Next: 4 Performance Evaluation
Up: 3 FastReplica Algorithm
Previous: 3.4 FastReplica in the
3.5 Reliable FastReplica Algorithm
In this Section, we extend the FastReplica algorithm to be able
to deal with node failures. The basic algorithm presented in
Sections 3.2, 3.4 is sensitive to node failures. For
example, if node fails during either transfer shown in
Figures 1, 2 then this event
may impact all nodes in the group because each node
depends on node to receive subfile . In the described
scenario, node is acting as a recipient node in the replication
set. If a node fails when it acts as the origin node, e.g. node in
Figure 6, this failure impacts all of the
replication groups in the replication subtree rooted in node .
The reliable FastReplica algorithm proposed below efficiently
deals with node failures by making the local repair decision within
the particular group of nodes. It keeps the main structure of the
FastReplica algorithm practically unchanged while adding the desired
property of resilience to node failures.
In reliable FastReplica, the nodes of each group are exchanging
the heartbeat messages with their origin node. The heartbeat messages
from nodes to their origin node are augmented with additional
information on the corresponding algorithm step and group (list) of
nodes to which the nodes currently perform their transfers.
Figure 8:
Heartbeat group: the recipient nodes in
send heartbeat messages to
the origin node .
|
In Figure 8, the nodes
of group form the heartbeat group with
their origin node . Each node sends to
the heartbeat messages with additional information on
node state in the replication process. Similarly, node
belongs to group with the corresponding origin node
. Thus node sends the heartbeat messages and
its node state to .
There are different repair procedures depending on whether a failed
node was acting as a recipient node, e.g. node in
replication set , or a failed node was acting as an origin
node, e.g. for replication set .
- If node fails while acting as a recipient node in
replication set during the distribution step then the
communication pattern is similar to the pattern shown in
Figure 1. In this case, node is aware of
the node failure. Node performs the
following repair step: it uses already opened connections to the
rest of the nodes in group to send the missing file
to each node in the group as shown in Figure 9.
Figure 9:
Repair procedure for node failed during distribution step.
|
In this way, each node in group receives all of the subfiles
of the original file .
Additionally, node acts as a ``substitute'' for the failed
node in the next algorithm step. If node
was supposed to serve as the origin node to group
for the next algorithm iteration, then node
acts as the origin node to group
for this iteration.
- If node fails while acting as a recipient node in
replication set during the collection step then the
communication pattern is similar to the pattern shown in
Figure 2. Using the heartbeat messages, the
failure of node is detected by node
. Node performs the following repair
step: it opens connections to the impacted nodes in group
to send missing file (similar to the repair step shown in
Figure 9). In this way, each node in group
receives all of the subfiles of the original file .
Analogously, node acts as a substitute for the failed
node in the next algorithm step.
- If node fails while acting as the origin node for
replication group during the distribution step then
replication group should be ``reattached'' to a
higher-level origin node. Let be the corresponding origin
node for from the previous iteration step as shown in
Figure 8. From heartbeat messages, node
detects node failure. Node
analyzes what was the node state in the replication
process preceding its failure. Then node acts as a
replacement for : it opens connections to the impacted
nodes in group to send corresponding missing
files. Additionally, updates every node in
about the change of the origin node (for future exchange of heartbeat
messages).
Reliable FastReplica, described above, aims to
minimize the impact of node failures by making the local repair
decision within the particular group of nodes. These groups are
relatively small, e.g. 10-30 nodes. Each group has the origin node
(with the original file for replication) and the recipient nodes. The
number of heart-bit messages in such a group is very small because
only the recipient nodes send heart-bit messages to their origin node,
and there are no heart-bit messages between the recipient nodes. This
structure significantly simplifies the protocol. Proposed failure
mechanism easily handles a single node failure within the group with
minimal performance penalty. The main structure of the
FastReplica algorithm is practically unchanged during the repair
steps.
Next: 4 Performance Evaluation
Up: 3 FastReplica Algorithm
Previous: 3.4 FastReplica in the