Next: 4 Performance Evaluation
Up: 3 FastReplica Algorithm
Previous: 3.4 FastReplica in the
3.5 Reliable FastReplica Algorithm
In this Section, we extend the FastReplica algorithm to be able
to deal with node failures. The basic algorithm presented in
Sections 3.2, 3.4 is sensitive to node failures. For
example, if node
fails during either transfer shown in
Figures 1, 2 then this event
may impact all nodes
in the group because each node
depends on node
to receive subfile
. In the described
scenario, node
is acting as a recipient node in the replication
set. If a node fails when it acts as the origin node, e.g. node
in
Figure 6, this failure impacts all of the
replication groups in the replication subtree rooted in node
.
The reliable FastReplica algorithm proposed below efficiently
deals with node failures by making the local repair decision within
the particular group of nodes. It keeps the main structure of the
FastReplica algorithm practically unchanged while adding the desired
property of resilience to node failures.
In reliable FastReplica, the nodes of each group are exchanging
the heartbeat messages with their origin node. The heartbeat messages
from nodes to their origin node are augmented with additional
information on the corresponding algorithm step and group (list) of
nodes to which the nodes currently perform their transfers.
Figure 8:
Heartbeat group: the recipient nodes in
send heartbeat messages to
the origin node
.
 |
In Figure 8, the nodes
of group
form the heartbeat group with
their origin node
. Each node
sends to
the heartbeat messages with additional information on
node state in the replication process. Similarly, node
belongs to group
with the corresponding origin node
. Thus node
sends the heartbeat messages and
its node state to
.
There are different repair procedures depending on whether a failed
node was acting as a recipient node, e.g. node
in
replication set
, or a failed node was acting as an origin
node, e.g.
for replication set
.
- If node
fails while acting as a recipient node in
replication set
during the distribution step then the
communication pattern is similar to the pattern shown in
Figure 1. In this case, node
is aware of
the node
failure. Node
performs the
following repair step: it uses
already opened connections to the
rest of the nodes in group
to send the missing
file
to each node in the group as shown in Figure 9.
Figure 9:
Repair procedure for node
failed during distribution step.
 |
In this way, each node in group
receives all of the subfiles
of the original file
.
Additionally, node
acts as a ``substitute'' for the failed
node
in the next algorithm step. If node
was supposed to serve as the origin node to group
for the next algorithm iteration, then node
acts as the origin node to group
for this iteration.
- If node
fails while acting as a recipient node in
replication set
during the collection step then the
communication pattern is similar to the pattern shown in
Figure 2. Using the heartbeat messages, the
failure of node
is detected by node
. Node
performs the following repair
step: it opens connections to the impacted nodes in group
to send missing file
(similar to the repair step shown in
Figure 9). In this way, each node in group
receives all of the subfiles of the original file
.
Analogously, node
acts as a substitute for the failed
node
in the next algorithm step.
- If node
fails while acting as the origin node for
replication group
during the distribution step then
replication group
should be ``reattached'' to a
higher-level origin node. Let
be the corresponding origin
node for
from the previous iteration step as shown in
Figure 8. From heartbeat messages, node
detects node
failure. Node
analyzes what was the node
state in the replication
process preceding its failure. Then node
acts as a
replacement for
: it opens connections to the impacted
nodes in group
to send corresponding missing
files. Additionally,
updates every node in
about the change of the origin node (for future exchange of heartbeat
messages).
Reliable FastReplica, described above, aims to
minimize the impact of node failures by making the local repair
decision within the particular group of nodes. These groups are
relatively small, e.g. 10-30 nodes. Each group has the origin node
(with the original file for replication) and the recipient nodes. The
number of heart-bit messages in such a group is very small because
only the recipient nodes send heart-bit messages to their origin node,
and there are no heart-bit messages between the recipient nodes. This
structure significantly simplifies the protocol. Proposed failure
mechanism easily handles a single node failure within the group with
minimal performance penalty. The main structure of the
FastReplica algorithm is practically unchanged during the repair
steps.
Next: 4 Performance Evaluation
Up: 3 FastReplica Algorithm
Previous: 3.4 FastReplica in the