Check out the new USENIX Web site. next up previous
Next: 4 Performance Evaluation Up: 3 FastReplica Algorithm Previous: 3.4 FastReplica in the


3.5 Reliable FastReplica Algorithm

In this Section, we extend the FastReplica algorithm to be able to deal with node failures. The basic algorithm presented in Sections 3.23.4 is sensitive to node failures. For example, if node $N_1$ fails during either transfer shown in Figures 12 then this event may impact all nodes $N_2, ... , N_n$ in the group because each node depends on node $N_1$ to receive subfile $F_1$. In the described scenario, node $N_1$ is acting as a recipient node in the replication set. If a node fails when it acts as the origin node, e.g. node $N_1^1$ in Figure 6, this failure impacts all of the replication groups in the replication subtree rooted in node $N_1^1$. The reliable FastReplica algorithm proposed below efficiently deals with node failures by making the local repair decision within the particular group of nodes. It keeps the main structure of the FastReplica algorithm practically unchanged while adding the desired property of resilience to node failures. In reliable FastReplica, the nodes of each group are exchanging the heartbeat messages with their origin node. The heartbeat messages from nodes to their origin node are augmented with additional information on the corresponding algorithm step and group (list) of nodes to which the nodes currently perform their transfers.
Figure 8: Heartbeat group: the recipient nodes in $G^{\prime } = \{N^{\prime }_1, ..., N^{\prime }_k\}$ send heartbeat messages to the origin node $N^{\prime }_0$.
\begin{figure}
\centering
\def 1 ...
In Figure 8, the nodes $N^{\prime}_1, ...,
N^{\prime}_k$ of group $G^{\prime }$ form the heartbeat group with their origin node $N^{\prime }_0$. Each node $N^{\prime }_i$ sends to $N^{\prime }_0$ the heartbeat messages with additional information on node state in the replication process. Similarly, node $N^{\prime }_0$ belongs to group $G$ with the corresponding origin node $\hat{N}_0$. Thus node $N^{\prime }_0$ sends the heartbeat messages and its node state to $\hat{N}_0$. There are different repair procedures depending on whether a failed node was acting as a recipient node, e.g. node $N^{\prime }_i$ in replication set $G^{\prime }$, or a failed node was acting as an origin node, e.g. $N^{\prime }_0$ for replication set $G^{\prime }$. Reliable FastReplica, described above, aims to minimize the impact of node failures by making the local repair decision within the particular group of nodes. These groups are relatively small, e.g. 10-30 nodes. Each group has the origin node (with the original file for replication) and the recipient nodes. The number of heart-bit messages in such a group is very small because only the recipient nodes send heart-bit messages to their origin node, and there are no heart-bit messages between the recipient nodes. This structure significantly simplifies the protocol. Proposed failure mechanism easily handles a single node failure within the group with minimal performance penalty. The main structure of the FastReplica algorithm is practically unchanged during the repair steps.
next up previous
Next: 4 Performance Evaluation Up: 3 FastReplica Algorithm Previous: 3.4 FastReplica in the