NSDI '08 – Abstract
Pp. 161–174 of the Proceedings
Awarded Best Paper!
Remus: High Availability via Asynchronous Virtual Machine Replication
Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, and Norm Hutchinson, University of British Columbia; Andrew Warfield, University of British Columbia and Citrix Systems, Inc.
Abstract
Allowing applications to survive hardware failure is an expensive
undertaking, which generally involves re-engineering software to
include complicated recovery logic as well as deploying
special-purpose hardware; this represents a severe barrier to
improving the dependability of large or legacy applications. We
describe the construction of a general and transparent
high availability service that allows existing, unmodified
software to be protected from the failure of the physical machine on
which it runs. Remus provides an extremely high degree of
fault tolerance, to the point that a running system can
transparently continue execution on an alternate physical host in
the face of failure with only seconds of downtime, while completely
preserving host state such as active network connections.
Our approach encapsulates protected software in a
virtual machine, asynchronously propagates changed state to a backup host
at frequencies as high as forty times a second, and uses speculative
execution to concurrently run the active VM slightly ahead of the
replicated system state.
- View the full text of this paper in HTML and PDF. Listen to the presentation in
MP3 format.
The Proceedings are published as a collective work, © 2008 by the USENIX Association. All Rights Reserved. Rights to individual papers remain with the author or the author's employer. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. USENIX acknowledges all trademarks within this paper.
|