Check out the new USENIX Web site. next up previous
Next: Converting Fault Resilience to Up: Clusters and Service Levels Previous: Clusters and Service Levels

Fault Tolerance v. Fault Resilience

Pfister[1] long ago pointed out that the tendency by the marketing departments to redefine HA terms at will makes Humpty Dumpty[*] look like a paragon of linguistic virtue. To save confusion, we will define:

Fault Tolerance to mean that any user of the service exported from the cluster does not observe any fault (other than possibly a longer delay than is normal) during a switch or fail over, and

Fault Resilience to mean that a fault may be observed, but only in uncommitted data (i.e. the database may respond with an error to the attempt to commit a transaction, etc.).

These distinctions are important, because it is possible to regard a fault tolerant service as suffering no down time even if the machine it is running on crashes, whereas the potential data fault in a fault resilient service counts toward down time.



James Bottomley 2004-05-12