Infrastructure Failures and Service Export problems

Next: Reducing Down Time Up: Availability Previous: Eliminating Single Points Of

Infrastructure Failures and Service Export problems

Another key problem to consider is ``what exactly is the criterion for a service being available''. In the old days, it was enough to know that the service was being run in the mainframe room to say that it was available. However, nowadays, the service's users are more often than not remote from it over the Internet. Therefore, the availability of the service may be affected by factors beyond the control of a HA cluster.

To control vulnerability to these external factors, one must consider the SPOF reduction program as extending into the Internet domain itself: Your external router and your ISP may also be SPOFs, so you may wish to consider provisioning two of them. The expense of doing this for two full blown T1 or higher lines is likely to be prohibitive. However, one can consider the scenario where the primary Internet line is backed by a much cheaper alternative (like DSL or cable modem) so that if the primary fails, the service becomes degraded, but not non-functional.

Even within a cluster, it may be possible apparently to recover the service in a manner which makes it practically useless. For example, a web server exporting a service to the Internet should not be recovered on a node which cannot see the Internet gateway.

For this reason, a utility function per hierarchy could be calculated (measuring the actual usefulness of recovering the hierarchy on a given node) and taken into consideration when performing recovery.

Next: Reducing Down Time Up: Availability Previous: Eliminating Single Points Of

James Bottomley 2004-05-12