The cost and complexity of administration of systems is now the dominant factor in total cost of ownership for both hardware and software.In addition, since human operator error is the source of a large fraction of outages [5], attention has recently been focused on simplifying and ultimately automating administration and management to reduce the impact of failures [13,19], and where this is not fully possible, on building self-monitoring components [20]. However, fast, accurate detection of failures and recovery management remains difficult, and initiating recovery on ``false alarms'' often incurs an unacceptable performance penalty; even worse, initiating recovery on ``false alarms'' can cause incorrect system behavior when system invariants are violated [20].
Operators of both network infrastructure and interactive Internet services have come to appreciate the high-availability and maintainability advantages of stateless and soft-state [33] protocols and systems. The stateless Web server tier of a typical three-tier service [3] can be managed with a simple policy: misbehaving components can be reactively or proactively rebooted, which is fast since they typically perform no special-case recovery, or can be removed from service without affecting correctness. Further, since all instances of a particular type of stateless component are functionally equivalent, overprovisioning for load redirection [3] is easy to do, with the net result that both stateless and soft-state components can be overprovisioned by simple replication for high availability.
However, this simplicity does not extend to the stateful tiers. Persistent-state subsystems in their full generality, such as filesystem appliances and relational databases, do not typically enjoy the simplicity of using redundancy to provide failover capacity as well as to incrementally scale the system. We argue that the ability to use these HA techniques can in fact be realized if we subdivide ``persistent state'' into distinct categories based on durability and consistency requirements. This has in fact already been done for several large Internet services [31,38,28], because it allows individual subsystems to be optimized for performance, fault-tolerance, recovery, and ease-of-management.
In this paper, we make three main contributions:
In Section 2, we define a category of session state, its associated workload, and existing solutions. In Section 3, we present the design and implementation of SSM, a recovery-friendly and self-managing session state store. In Section 4, we describe the integration of SSM with Pinpoint to enable the system to be self-healing. In Section 5, we present benchmarks demonstrating the features of SSM. In Section 6, we insert SSM into an existing production internet application and compare its performance, failure, and recovery characteristics with the original implementation. In Section 7, we discuss the design principles extracted from SSM. We then discuss related and future work, and conclude.