POSIX.1h SRASS and POSIX.1m Checkpoint/Restart
Helmut Roth<firstname.lastname@example.org> reports on the October 1998 meeting in San Diego, CA.
The POSIX.1h Services for Reliable, Available and Serviceable Systems (SRASS) and the POSIX.1m Checkpoint/Restart working group met in San Diego, California, in late October 1998. The SRASS working group is in the process of developing a set of APIs for fault management and serviceability applications. The goal of the SRASS Working Group is to support fault-tolerant systems, serviceable systems, reliable systems, and highly available systems in a portable way. Wherever feasible, POSIX.1h needs to be useful for general applications too, such as distributed parallel database transaction systems and safety-related systems.
Checkpoint/Restart allows an application to save the entire state of the machine along with the operating system and the process activities so that in the event that something goes wrong, a saved backup state can be brought on line quickly. A ballot group has been formed, and the Checkpoint/Restart API should go out for ballot in November 1998. Some minor corrections to the draft to support a more rigid set of rules for use of the namespace and other backward-compatibility issues are being added. It is intended to be out to the ballot group in time to ballot before the next meeting in January 1999.
One part of the SRASS draft deals with logging APIs. These are aimed at allowing an application to log application-specific events and system events to a system log and allow for the subsequent processing of those events. Fault-management applications can use this API to register for the notification of events as they enter the system log. An example of an event could be where some limit has been exceeded. Events can have a severity associated with them. Event notification can provide a way to react proactively and initiate steps to prevent a subsequent system failure. In addition to these logging APIs, the de facto standard syslog interfaces have been added to support backward compatibility.
Another feature of the SRASS proposed interfaces is a single core-dump-control API. This is intended to enable an application to specify the location of a file to which a core dump will occur in the event of an abnormal termination.
A shutdown/reboot API has been included in this draft. This includes options such as fast shutdown and graceful shutdown, and features such as rebooting with optional scripts. The configuration-space-management API is intended to provide a portable method of traversing a system's configuration space, and for manipulating the data content of nodes in that configuration space. This API will provide a fault-management application access to underlying system configuration information and the means to direct reconfiguration of the system. The most recent changes in this area have been to move from a tree traversal to a directed graph.
The working group has approved the current SRASS draft 4.0 with changes to go out for ballot in December 1998; an extra meeting is planned at DISA in early December 1998 to verify that the latest changes have been incorporated correctly. If you are interested in helping support fault management (including serviceability and fault-tolerance aspects of systems), just get in touch with Helmut Roth at <email@example.com> or Dr. Arun Chandra at <firstname.lastname@example.org>. To subscribe to the SRASS mail list, send a message to <email@example.com> and ask to be included.