Next: Admission Control
Up: Single-Segment RETHER
Previous: Token Passing
Fault Tolerance
As the control token represents a single point of failure,
RETHER incorporates a built-in fault tolerance mechanism to ensure continued
network operation despite token loss due to machine failures,
or token corruption due to random bit errors. Each RETHER node
is required to monitor the health of
its successor in the token passing schedule.
When a node N sends the token to its successor S,
it starts an acknowledgment
timer waiting to hear from S. If S is alive, it
sends an acknowledgment back to N when it sends the token forward to
its own successor. If the successor node
is dead for some reasons, the timer at the
monitoring node times out and it
pings the successor to ensure that the successor indeed dies.
This extra ping is necessary to
check if the successor is still alive but actually drops
the token due to reasons like bit errors.
On detecting a failure, the monitoring node
broadcasts a message announcing the
failure and regenerates a new token. The choice of the timeout value and
other failure scenarios are discussed in greater detail in [#!VENKthesis!#].
Although token recovery can take place within
the same token cycle, in practice, it takes a little bit longer than
one token cycle to recover.
This is dependent on the precision with which we can set the
acknowledgment timer value in the operating system4.
RETHER also addresses many
other failure scenarios such as multiple node failures in [#!VENKthesis!#].
When a new node boots up, it broadcasts a message identifying itself.
The node currently holding the token adds the
new node to the list of live nodes
maintained in the token. As a result the token will visit the new node in the
next cycle.
Next: Admission Control
Up: Single-Segment RETHER
Previous: Token Passing
Tzi-cker Chiueh
1999-03-18