Before presenting experimental results, we illustrate the relationship between MTTF for an individual brick and the availability of data for SSM as a whole. We assume independent failures; when failures are correlated in Internet server clusters, it is often the result of a larger catastrophic failure that session state would not be expected to survive [20]. We describe a natural extension to SSM to survive site failures in section 8.
Let brick failure be modeled by a Poisson process with rate (i.e., the brick's MTTF is ), and let writes for a particular user's data be modeled by a Poisson process with rate . (In other words, in practice is the session expiration time, usually on the order of minutes or tens of minutes.) Then is intuitively the ratio of the write rate to the failure rate, or equivalently, the ratio of the MTTF of a brick vs. the write interarrival time.
A session state object is lost if all copies of it are lost. Since every successful write re-creates copies of the data, the object is not lost if at most failures occur between successive writes of the object. Equations 1 and 2 show this probability for and respectively; figure 2 shows the probabilities graphically.
|
Table 2 summarizes the implication of the equations in terms of ``number of nines'' of availability. For example, to achieve ``three nines'' of availability, or probability that data will not be lost, a system with must be able to keep an individual brick from crashing for an interval that is times as long as the average time between writes. Adding redundancy () reduces this, requiring an MTTF that is only times the average time between writes. For example, if the average time between writes is 5 minutes and , three nines can be achieved as long as brick MTTF is at least 81 minutes.
Another way to look at it is to fix the ratio of MTTF to the write interval. Figure 3 sets this ratio to 10 (intuitively, this means roughly that writes occur ten times as often as failures) and illustrates the effect of adding redundancy (modifying ) on data loss.
|