We now examine the bandwidth requirements for recovering from an Internet catastrophe. In a catastrophe, many hosts will lose their data. When the failed hosts come online again, they will want to recover their data from the remaining hosts that survived the catastrophe. With a large fraction of the hosts recovering simultaneously, a key question is what bandwidth demands the recovering hosts will place on the system.
The aggregate bandwidth required to recover from a catastrophe is a
function of the amount of data stored by the failed hosts, the time
window for recovery, and the fraction of hosts that fail. Consider a
system of 10,000 hosts that have software configurations analogous to
those presented in Section 4, where of the hosts
run Windows and the remaining run some other operating system. Next
consider a catastrophe similar to the one above in which all Windows
hosts, independent of version, lose the data they store.
Table 6 shows the bandwidth required to recover the Windows
hosts for various storage capacities and recovery periods. The first
column shows the average amount of data a host stores in the system.
The remaining columns show the bandwidth required to recover that data
for different periods.
The first four rows show the aggregate system bandwidth required to recover the failed hosts: the total amount of data to recover divided by the recovery time. This bandwidth reflects the load on the Internet during recovery. Even for relatively large backup sizes and short recovery periods, this load is small. Note that these results are for a system with 10,000 hosts and that, for an equivalent catastrophe, the aggregate bandwidth requirements will scale linearly with the number of hosts in the system and the amount of data backed up.
The second four rows show the average per-host bandwidth required by
the hosts in the system responding to recovery requests. Recall that
the system imposes a load limit that caps the number of replicas
any host will store. As a result, a host will only have to recover at
most
other hosts. Note that, because of the load limit, per-host
bandwidth requirements for hosts involved in recovery are independent
of both the number of hosts in the system and the number of hosts that
fail.
The results in the table show the per-host bandwidth requirements with
a load limit , where each host responds to at most three recovery
requests. The results indicate that Phoenix can recover from a severe
catastrophe in reasonable time periods for useful backup sizes. As
with other cooperative backup systems like Pastiche [8],
per-host recovery time will depend significantly on the connectivity
of hosts in the system. For example, hosts connected by modems can
serve as recovery hosts for a modest amount of backed up data (28 Kb/s
for 100 MB of data recovered in a day). Such backup amounts would
only be useful for recovering particularly critical data, or
recovering frequent incremental backups stored in Phoenix relative to
infrequent full backups using other methods (e.g., for users who take
monthly full backups on media but use Phoenix for storing and
recovering daily incrementals). Broadband hosts can recover failed
hosts storing orders of magnitude more data (1-10 GB) in a day, and
high-bandwidth hosts can recover either an order magnitude more
quickly (hours) or even an order of magnitude more data (100 GB).
Further, Phoenix could potentially exploit the parallelism of
recovering from all surviving hosts in a core to further reduce
recovery time.
Although there is no design constraint on the amount of data hosts back up in Phoenix, for current disk usage patterns, disk capacities, and host bandwidth connectivity, we envision users typically storing 1-10 GB in Phoenix and waiting a day to recover their data. According to a recent study, desktops with substantial disks ( > 40 GB) use less than 10% of their local disk capacity, and operating system and temporary user files consume up to 4 GB [3]. Recovery times on the order of a day are also practical. For example, previous worm catastrophes took longer than a day for organizations to recover, and recovery using organization backup services can take a day for an administrator to respond to a request.