Check out the new USENIX Web site. next up previous
Next: Infrastructure Failures and Service Up: Availability Previous: Server Failures


Eliminating Single Points Of Failure

Single Points Of Failure (SPOFs) are one of the keys to controlling uptime. Their elimination is also crucial in cluster components that the HA harness doesn't protect: most often the actual data storage on an external array.

External data protection can be achieved by RAID [4], which comes in several possible implementations:

  1. Software: using the md (or possibly the evms md personality). This is the cheapest solution, because it requires no specialised hardware.
  2. Host Based RAID: This is a slightly more expensive solution where the RAID function is supplied by a special card in the server. This can cause problems clustering though: only some of these cards support clustering in both the hardware and the driver, and even if the card supports it, the HA package might not.

  3. External RAID Array. This is the most expensive, but easiest to manage solution: The RAID is provided in an external package which attaches to the server via either SCSI or FC.

A particular problem with both software and Host Based RAID is that the individual node is responsible for updating the array including the redundancy data. This can cause a problem if the node crashes in the middle of an update since the data and the redundancy information will now be out of sync (although this can be easily detected and corrected). Where the problems become acute is if the array is being operated in a degraded state. Now, for all RAID arrays other than RAID-1, the data on the array may have become undetectably corrupt. For this reason, only RAID-1 should be considered when implementing either of these array types.

Although RAID eliminates the actual storage medium of the data as a SPOF, the path to storage (and also the RAID controller for hardware RAID) still is a SPOF. The simplest way to eliminate this (applying to both software and host based raid) is to employ two controllers and two separate SCSI buses as in figure 2.

Figure: Achieving no Single Point of Failure
\includegraphics[width=2in]{raid1_nospof}

Hardware RAID arrays also come with a variety of SPOF elimination techniques, usually in the form of multiple paths and multiple controllers. The down side here is that almost every one of these is proprietary to the individual RAID vendor and often requires driver add-ons (sometimes binary only) to the Linux kernel[*] to operate.


next up previous
Next: Infrastructure Failures and Service Up: Availability Previous: Server Failures
James Bottomley 2004-05-12