Traditionally, reliable distributed systems are designed using the
threshold model: out of components, no more than
are faulty
at any time. Although this model can always be applied
when the probability of having a total failure is negligible, it is
only capable of expressing the worst-case failure scenario, and it is
best suited when failures are independent and identically distributed.
The worst-case, however, can be one in which the failures of components
are highly correlated.
Failures of hosts in a distributed system can be correlated for several reasons. Hosts may run the same code or be located in the same room, for example. In the former case, if there is a vulnerability in the code, then it can be exploited in all the hosts executing the target software. In the latter case, a power outage can crash all machines plugged into the same electrical circuit.
As a first step towards the design of a cooperative backup system for tolerating catastrophes, we need a concise way of representing failure correlation. We use the core abstraction to represent correlation among host failures [5]. A core is a reliable minimal subset of components: the probability of having all hosts in a core failing is negligible, for some definition of negligible. In a backup system, a core corresponds to the minimal replica set required for resilience.
Determining the cores of a system depends on the failure model used and the desired degree of resilience for the system. The failure model prescribes the possible types of failures for components. These types of failures determine how host failures can be correlated. In our case, hosts are the components of interest and software vulnerabilities are the causes of failures. Consequently, hosts executing the same piece of software present high failure correlation. This information on failure correlation is not sufficient, however, to determine the cores of a system. It also depends on the desired degree of resilience. As one increases the degree of resilience, more components are perhaps necessary to fulfill the core property stated above.
To reason about the correlation of host failures, we associate
attributes to hosts. The attributes represent characteristics of the
host that can make it prone to failures. For example, the operating
system a host runs is a point of attack: an attack that targets Linux
is less likely to be effective against hosts running Solaris, and is
even less effective against hosts running Windows XP. We could
represent this point of attack by having an -ary attribute that
indicates the operating system, where the value of the attribute is 0
for Linux, 1 for Windows XP, 2 for Solaris, and so on. Throughout
this paper, we use
as the set of attributes that characterize a host.
To illustrate the concepts introduced in this section, consider the system described in Example 3.1. In this system, hosts are characterized by three attributes and each attribute has two possible values. We assume that hosts fail due to crashes caused by software vulnerabilities, and at most one vulnerability can be exploited at a time. Note that the cores shown in the example have maximum resilience according to the given set of attributes.
Attributes: |
![]() |
![]() |
|
![]() |
Hosts: |
![]() |
![]() |
|
![]() |
|
![]() |
Cores
![]() |
There are a few interesting facts to be observed about Example 3.1.
First, and
form what we call an orthogonal core, which is a core composed
of hosts that have different values for every attribute. Note that in this case the
size of the orthogonal core is two because of our assumption that at most one
vulnerability can be exploited at a time. This implies that it is necessary and
sufficient to have two hosts with different values for every attribute. Even though
it is not orthogonal,
is also a core since it covers all attributes.
Second, when choosing a core for host
to store replicas of its data, there are two
possibilities:
and
. The second option for a core is larger
than the first. Thus, choosing the second leads to unnecessary replication.
The optimal choice in terms of storage overhead is therefore
.
Choosing a smallest core
may seem a good choice at first because it requires fewer replicas. We observe, however,
that such a choice can adversely impact the system. In environments
with highly skewed diversity, the total capacity of the system may be impacted by
always choosing the smallest core1. Back in Example 3.1,
is the only host which has some flavor of Unix as the operating system. Consequently,
a core for every other host has to contain
. For a small system as the one in the example
this should not be a problem, but it is a potential problem for large-scale deployments.
This raises the question of how host diversity impacts on storage overhead, storage load, and
resilience. We address this question in the next section.