Check out the new USENIX Web site. next up previous
Next: Server Failures Up: Availability Previous: Increasing Up Time

Application Failures

These are often the most insidious, since they can only be fixed by finding and squashing the particular bug in the application (and even if you have the source, you may not have the expertise or time to do this). There are two types of failure

Non-Deterministic: The failure occurs because of some internal error depending on the state of everything that went before it (often due to stack overruns or memory management problems). This type of failure can be ``fixed'' simply by restarting the application and trying again (because the necessary internal state will have been wiped clean). Non-deterministic failures may also occur as a result of interference from another node in the cluster (called a ``rogue'' node) which believes it has the right to update the same data the current node is using. To prevent these type of one node steps on another node's data failures from ever occurring in a cluster, I/O fencing (see section 6 is vitally important.

Deterministic: The crash is in direct response to a data input, regardless of internal state. This is the pathological failure case, since even if you restart the application, it will crash again when you resend it the data it initially failed on. Therefore, there is no automated way you can restart the application--someone must manually clean the failure causing data from its input stream. This is what Pfister[1] calls this the ``Toxic Data Syndrome''.

Fortunately, deterministic application failures are very rare (although they do occur), so they're more something to be aware of than something to expect. It is important to note that nothing can recover from a toxic data transaction that the application is required to process (rather than one introduced maliciously simply to crash the service) since the application must be fixed before the transaction can be processed.


next up previous
Next: Server Failures Up: Availability Previous: Increasing Up Time
James Bottomley 2004-05-12