Traditionally, reliable distributed systems are designed using the threshold model: out of components, no more than are faulty at any time. Although this model can always be applied when the probability of having a total failure is negligible, it is only capable of expressing the worst-case failure scenario, and it is best suited when failures are independent and identically distributed. The worst-case, however, can be one in which the failures of components are highly correlated.
Failures of hosts in a distributed system can be correlated for several reasons. Hosts may run the same code or be located in the same room, for example. In the former case, if there is a vulnerability in the code, then it can be exploited in all the hosts executing the target software. In the latter case, a power outage can crash all machines plugged into the same electrical circuit.
As a first step towards the design of a cooperative backup system for tolerating catastrophes, we need a concise way of representing failure correlation. We use the core abstraction to represent correlation among host failures [5]. A core is a reliable minimal subset of components: the probability of having all hosts in a core failing is negligible, for some definition of negligible. In a backup system, a core corresponds to the minimal replica set required for resilience.
Determining the cores of a system depends on the failure model used and the desired degree of resilience for the system. The failure model prescribes the possible types of failures for components. These types of failures determine how host failures can be correlated. In our case, hosts are the components of interest and software vulnerabilities are the causes of failures. Consequently, hosts executing the same piece of software present high failure correlation. This information on failure correlation is not sufficient, however, to determine the cores of a system. It also depends on the desired degree of resilience. As one increases the degree of resilience, more components are perhaps necessary to fulfill the core property stated above.
To reason about the correlation of host failures, we associate attributes to hosts. The attributes represent characteristics of the host that can make it prone to failures. For example, the operating system a host runs is a point of attack: an attack that targets Linux is less likely to be effective against hosts running Solaris, and is even less effective against hosts running Windows XP. We could represent this point of attack by having an -ary attribute that indicates the operating system, where the value of the attribute is 0 for Linux, 1 for Windows XP, 2 for Solaris, and so on. Throughout this paper, we use as the set of attributes that characterize a host.
To illustrate the concepts introduced in this section, consider the system described in Example 3.1. In this system, hosts are characterized by three attributes and each attribute has two possible values. We assume that hosts fail due to crashes caused by software vulnerabilities, and at most one vulnerability can be exploited at a time. Note that the cores shown in the example have maximum resilience according to the given set of attributes.
Attributes: | ; |
; | |
. |
Hosts: | ; |
; | |
; | |
. |
Cores . |
There are a few interesting facts to be observed about Example 3.1. First, and form what we call an orthogonal core, which is a core composed of hosts that have different values for every attribute. Note that in this case the size of the orthogonal core is two because of our assumption that at most one vulnerability can be exploited at a time. This implies that it is necessary and sufficient to have two hosts with different values for every attribute. Even though it is not orthogonal, is also a core since it covers all attributes. Second, when choosing a core for host to store replicas of its data, there are two possibilities: and . The second option for a core is larger than the first. Thus, choosing the second leads to unnecessary replication. The optimal choice in terms of storage overhead is therefore .
Choosing a smallest core may seem a good choice at first because it requires fewer replicas. We observe, however, that such a choice can adversely impact the system. In environments with highly skewed diversity, the total capacity of the system may be impacted by always choosing the smallest core1. Back in Example 3.1, is the only host which has some flavor of Unix as the operating system. Consequently, a core for every other host has to contain . For a small system as the one in the example this should not be a problem, but it is a potential problem for large-scale deployments. This raises the question of how host diversity impacts on storage overhead, storage load, and resilience. We address this question in the next section.