Next: Linux Specific Problems Up: Implementing Clusters for High Previous: Infrastructure Failures and Service

Reducing Down Time

By and large this is recovering as quickly as possible from a failure when it occurs. In order to reduce the Down Time to a minimum, this recovery should be automated. This automation is often done by a High Availability Harness.

The cardinal thing to consider is the time it takes to restore the application to full functionality, which is given by:

$\displaystyle T_{\rm Restore} = T_{\rm Detect} + T_{\rm Recover}$

(1)

The detection time, $T_{\rm Detect}$ , is entirely driven by the HA Harness (and should be easily tunable). The application recovery time, $T_{\rm Recover}$ , is usually less susceptible to tuning (although it can be minimised by making sure necessary data is on a journaling file-system for example).

Subsections

Linux Specific Problems
Replication
2.6 Kernel Enhancements for HA
The HA Harness
Considering More than Two Nodes

James Bottomley 2004-05-12