Flavio Junqueira, Ranjita Bhagwan, Alejandro Hevia, Keith Marzullo and Geoffrey M. Voelker
Department of Computer Science and Engineering
University of California, San Diego
In this paper, we propose a new approach for designing distributed
systems to survive Internet catastrophes called informed replication,
and demonstrate this approach with the design and evaluation of a
cooperative backup system called the Phoenix Recovery Service.
Informed replication uses a model of correlated failures to exploit
software diversity. The key observation that makes our approach both
feasible and practical is that Internet catastrophes result from
shared vulnerabilities. By replicating a system service on hosts that
do not have the same vulnerabilities, an Internet pathogen that
exploits a vulnerability is unlikely to cause all replicas to fail.
To characterize software diversity in an Internet setting, we measure
the software diversity of host operating systems and network services
in a large organization. We then use insights from our measurement
study to develop and evaluate heuristics for computing replica sets
that have a number of attractive features. Our heuristics provide
excellent reliability guarantees, result in low degree of replication,
limit the storage burden on each host in the system, and lend
themselves to a fully distributed implementation. We then present the
design and prototype implementation of Phoenix, and evaluate it on the
PlanetLab testbed.