Fault Tolerance

Next: Admission Control Up: Single-Segment RETHER Previous: Token Passing

Fault Tolerance

As the control token represents a single point of failure, RETHER incorporates a built-in fault tolerance mechanism to ensure continued network operation despite token loss due to machine failures, or token corruption due to random bit errors. Each RETHER node is required to monitor the health of its successor in the token passing schedule. When a node N sends the token to its successor S, it starts an acknowledgment timer waiting to hear from S. If S is alive, it sends an acknowledgment back to N when it sends the token forward to its own successor. If the successor node is dead for some reasons, the timer at the monitoring node times out and it pings the successor to ensure that the successor indeed dies. This extra ping is necessary to check if the successor is still alive but actually drops the token due to reasons like bit errors. On detecting a failure, the monitoring node broadcasts a message announcing the failure and regenerates a new token. The choice of the timeout value and other failure scenarios are discussed in greater detail in [#!VENKthesis!#]. Although token recovery can take place within the same token cycle, in practice, it takes a little bit longer than one token cycle to recover. This is dependent on the precision with which we can set the acknowledgment timer value in the operating system⁴. RETHER also addresses many other failure scenarios such as multiple node failures in [#!VENKthesis!#]. When a new node boots up, it broadcasts a message identifying itself. The node currently holding the token adds the new node to the list of live nodes maintained in the token. As a result the token will visit the new node in the next cycle.

Next: Admission Control Up: Single-Segment RETHER Previous: Token Passing

Tzi-cker Chiueh
1999-03-18