Server Failures

Next: Eliminating Single Points Of Up: Availability Previous: Application Failures

Server Failures

The easiest (although certainly not the cheapest) way to get better uptime is to buy better hardware: often vendors sell apparently similar machines labelled ``server'' and ``workstation'' the only difference between them being the quality of the components and the addition of redundancy features.

Server redundancy features can be divided into two categories: those which don't and do require Operating System support to function. Of those that don't:

Redundant Fans: Ironically in these days of increasingly reduced solid state components, we still rely on simple mechanical (and therefore prone to wear and failure) devices for cooling: fans. They are often the cheapest separate component of any system, and yet if anything goes wrong with them, the entire system will crash or, in the extreme case of an on-chip CPU fan burn it's way through the motherboard. The first thing to note is that a well engineered box should have no on-component fans at all. All fans should be arranged in external banks to direct airflow over heat-sinks. The arrangement of the fans should be such that for any fan failure, the remaining fans should be sufficient to cool the machine correctly until the failed fan is replaced.

Redundant Power Supplies: After fans, these are the next most commonly failing components. A good server usually has two (or more) separate and fully functional power supply modules arranged so that for any single failure, the remaining PSUs can still fully power the box.

Those requiring Operating System support are things like:

Storage Redundancy: Both via multiple paths to the storage and multiple controllers within the storage (see section 3.4).

Active Power Management: With the advent of ACPI, the trend is toward the Operating System managing power to the server components. In this scenario, it becomes the responsibility of the OS to detect any power failure and possibly lower power consumption in its system until the fault is rectified.

Monitoring: This is the most overlooked part of the whole Server Failure problem. However much expensive hardware you buy, undetected faults will eventually cause it to die, primarily because the hardware is engineered to withstand a single fault in any subsystem, but a second fault (which will eventually occur) is usually fatal. Therefore, if you are going to run your systems unmonitored, you might just as well have bought the cheaper hardware and let the HA harness take over on any single failure.

Next: Eliminating Single Points Of Up: Availability Previous: Application Failures

James Bottomley 2004-05-12