The Internet today is highly vulnerable to Internet catastrophes: events in which an exceptionally successful Internet pathogen, like a worm or email virus, causes data loss on a significant percentage of the computers connected to the Internet. Incidents of successful wide-scale pathogens are becoming increasingly common on the Internet today, as exemplified by the Code Red and related worms [6], and LoveBug and other recent email viruses [11]. Given the ease with which someone can augment such Internet pathogens to erase data on the hosts that they infect, it is only a matter of time before Internet catastrophes occur that result in large-scale data loss.
In this paper, we explore the feasibility of using data redundancy, a model of dependent host vulnerabilities, and distributed storage to tolerate such events. In particular, we motivate the design of a cooperative, distributed remote backup system called the Phoenix recovery system. The usage model of Phoenix is straightforward: a user specify an amount of bytes from its disk space the system can use, and the goal of the system is to protect a proportional amount of its data using storage provided by other hosts.
In general, to recover the lost data of a host that was a victim in an Internet catastrophe, there must be copies of that data stored on a host or set of hosts that survived the catastrophe. A typical replication approach [10] creates additional replicas if up to copies of the data can be lost in a failure. In our case, would need to be as large as the largest Internet catastrophe. As an example, the Code Red worm infected over 359,000 computers, and so would need to be larger than 359,000 for hosts to survive a similar kind of event. Using such a large degree of replication would make cooperative remote backup useless for at least two reasons. First, the amount of data each user can protect is inversely proportional to the degree of replication, and with such a vast degree of replication the system could only protect a minuscule amount of data per user. Second, ensuring that such a large number of replicas are written would take an impractical amount of time.
Our key observation that makes Phoenix both feasible and practical is that an Internet catastrophe, like any large-scale Internet attack, exploits shared vulnerabilities. Hence, users should replicate their data on hosts that do not have the same vulnerabilities. That is, the replication mechanism should take the dependencies of host failures--in this case, host diversity--into account [5]. Hence, we formally represent host attributes, such as its operating system, web browser, mail client, web server, etc. The system can then use the attributes of all hosts in the system to determine how many replicas are needed to ensure recoverability, and on which hosts those replicas should be placed, to survive an Internet catastrophe that exploits one of its attributes. For example, for hosts that run a Microsoft web server, the system will avoid placing replicas on other hosts that run similar servers so that the replicas will survive Internet worms that exploit bugs in the server. Such a system could naturally be extended to tolerate simultaneous catastrophes using multiple exploits, although at the cost of a reduced amount of recoverable data that can be stored. Using a simulation model we show that, by doing informed placement of replicas, a Phoenix recovery system can provide highly resilient and available cooperative backup with low overhead.
In the rest of this paper, we discuss various approaches for tolerating Internet catastrophes and motivate the use of a cooperative, distributed recovery system like Phoenix for surviving them. Section 3 then describes our model for dependent failures and how we apply it to tolerate catastrophes. In Section 4, we explore the design space of the amount of available storage in the system and the redundancy required to survive Internet catastrophes under various degrees of host diversity and shared vulnerabilities. We then discuss system design issues in Section 5. Finally, Section 6 concludes the paper.