OSDI 2000 Abstract
Exploring Failure Transparency and the Limits of Generic Recovery
David E. Lowell, Compaq Computer Corp.; Subhachandra Chandra, and Peter M. Chen, University of Michigan
Abstract
We explore the abstraction of failure transparency in which the operating system provides the illusion of failure-free
operation. To provide failure transparency, an operating system must recover applications after hardware, operating system,
and application failures, and must do so without help from the programmer or unduly slowing failure-free performance. We
describe two invariants that must be upheld to provide failure transparency: one that ensures sufficient application state is saved
to guarantee the user cannot discern failures, and another that ensures sufficient application state is lost to allow recovery from
failures affecting application state. We find that several real applications get failure transparency in the presence of simple stop
failures with overhead of 0-12%. Less encouragingly, we find that applications violate one invariant in the course of upholding
the other for more than 90% of application faults and 3-15% of operating system faults, rendering transparent recovery
impossible for these cases.
|