Check out the new USENIX Web site. next up previous
Next: Linux Specific Problems Up: Implementing Clusters for High Previous: Infrastructure Failures and Service

Reducing Down Time

By and large this is recovering as quickly as possible from a failure when it occurs. In order to reduce the Down Time to a minimum, this recovery should be automated. This automation is often done by a High Availability Harness.

The cardinal thing to consider is the time it takes to restore the application to full functionality, which is given by:

$\displaystyle T_{\rm Restore} = T_{\rm Detect} + T_{\rm Recover}$ (1)

The detection time, $ T_{\rm Detect}$, is entirely driven by the HA Harness (and should be easily tunable). The application recovery time, $ T_{\rm Recover}$, is usually less susceptible to tuning (although it can be minimised by making sure necessary data is on a journaling file-system for example).



Subsections

James Bottomley 2004-05-12