If one knew the probability of attack for each vulnerability, then given a target system resilience one could enumerate cores with that target resilience. In our case, it is not clear how one would determine such probabilities. Instead, we define a core for a host to be a minimal set of hosts with the following additional properties: 1) ; 2) for every attribute aA, either there is a host in that differs from in the value of or there is no host in the system that differs from in the value of . Such a subset of hosts is a core for a host if we assume that, in any Internet catastrophe, an attack targets a single attribute value. Although it is not hard to generalize this definition to allow for attacks targeted against multiple attribute values, in the rest of this paper we focus on attacks against a single attribute value.
Smaller cores means less replication, which is desirable for reducing storage overhead. A core will contain between 2 and hosts. If the hosts' attributes are well distributed, then the cores will be small on average: for any host , it is likely that there is a host that has different values of each of the attributes, and so and constitute an orthogonal core. That is, a fair number of orthogonal cores are likely to exist. If there is less diversity, though, then the smallest cores may not be orthogonal for many hosts, thus increasing storage overhead.
A lack of diversity, especially when trying to keep core sizes small, can lead to a more severe problem. Suppose there are hosts and an attribute such that all have the same value for . Moreover, there is only one host that differs in the value of . A core for each host hence contains , meaning that will maintain copies for all of the . Since the amount of disk space donates for storing backup data is fixed, each can only use of this space. In other words, if donates bytes for common storage to the system, then each can back up only bytes. Note that can be as large as the number of hosts, and so can be minuscule. In Example 3.1, host is the only one to have a different value for attribute ``Operating System'', and hence has to store copies for all the other hosts.
Characterizing the diversity of a set of hosts is a challenging task. In particular, considering all possible distributions for attribute configurations is not feasible. Instead, we define a measure that condenses the diversity of a system into a single number. According to our definition, a system with diversity is one in which a share of the servers is characterized by a share of the combinations of attributes. Although this metric is coarse and does not capture all possible scenarios, it is expressive enough to enable one to observe how the behavior of a backup system is affected by skewed diversity. Note that is in the interval . The value corresponds to a uniform distribution, and a value of close to 1 indicates a highly skewed diversity.
We use this metric to study how storage overhead, storage load, and resilience vary with skew in diversity. We define the storage overhead of a host as the size of the core that uses to backup its data. Host maintains copies for other hosts. We define the storage load of to be such a value . Note that storage load may vary among the hosts. Thus, we define the storage load of the system as the maximum value of across all the hosts. In the remainder of this paper, we refer to the storage load of the system as just storage load. Finally, resilience depends on the number of attributes covered in a core, and it decreases as the number of non-covered attributes increases. We then define the resilience of the system for a host as the percentage of attributes covered by the core that uses to backup its data.
The problem of finding a smallest core given a set of hosts and an attribute configuration for each host, however, is NP-hard (reduction from SET-COVER). For this reason, we used a randomized heuristic to find cores. This heuristic finds a core for a host as follows:
This heuristic may not be the best; We have not yet done a thorough study of heuristics for finding cores. The results we present below, however, indicates that it is efficient in terms of storage overhead and resilience.