OSDI '04 Abstract
Pp. 3144 of the Proceedings
MicrorebootA Technique for Cheap Recovery
George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox, Stanford University
Abstract
A significant fraction of software failures in large-scale
Internet systems are cured by rebooting, even when the
exact failure causes are unknown. However, rebooting
can be expensive, causing nontrivial service disruption or
downtime even when clusters and failover are employed.
In this work we use separation of process recovery from
data recovery to enable microrebootinga fine-grain technique
for surgically recovering faulty application components,
without disturbing the rest of the application.
We evaluate microrebooting in an Internet auction system
running on an application server. Microreboots recover
most of the same failures as full reboots, but do so an
order of magnitude faster and result in an order of magnitude
savings in lost work. This cheap form of recovery engenders
a new approach to high availability: microreboots
can be employed at the slightest hint of failure, prior to
node failover in multi-node clusters, even when mistakes
in failure detection are likely; failure and recovery can be
masked from end users through transparent call-level retries;
and systems can be rejuvenated by parts, without ever being shut down.
- View the full text of this paper in HTML and
PDF.
Until December 2005, you will need your USENIX membership identification in order to access the full papers. The Proceedings are published as a collective work, © 2004 by the USENIX Association. All Rights Reserved. Rights to individual papers remain with the author or the author's employer. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. USENIX acknowledges all trademarks within this paper.
- If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.
|