Self-Healing

Next: Memory Fault in Stack Up: Experimental Results Previous: Self-Protecting

Self-Healing

The ability of a system to heal itself without requiring administrator assistance greatly simplifies management. However, accurate detection of faults is difficult. Usually, acting on an incorrect diagnosis such as a false positive results in degraded system performance, availability, correctness, or throughput. In SSM, the ability to reboot any component without affecting correctness and availablility, and to a degree, performance and throughput, coupled with a generic fault-detecting mechanism such as Pinpoint, gives rise to a self-healing system.

As discussed earlier, in SSM, a single brick can be restarted without affecting correctness, performance, availability, or throughput; the cost of acting on a false positive on SSM is very low, so long as the system does not make false positive errors with too high a frequency. For transient faults, Pinpoint can detect anomalies in brick performance, and restart bricks accordingly.

The following microbenchmarks demonstrate SSM's ability to recover and heal from transient faults. We attempt to inject realistic faults for each of SSM's hardware components--processor, memory, and network interface. We assume that for CPU faults, the brick will hang or reboot, as is typical for most such faults [18].

To model transient memory errors, we inject bitflip errors into various places in the brick's address space. To model a faulty network interface, we use FAUMachine [9], a Linux-based VM that allows for fault-injection at the network level, to drop a specified percentage of packets. We also model performance faults, where one brick runs more slowly than the others. In all of the following experiments, we use six bricks; Pinpoint actively monitors all of the bricks.

Next: Memory Fault in Stack Up: Experimental Results Previous: Self-Protecting

Benjamin Chan-Bin Ling 2004-03-04