POSIX.1h: Services for Reliable, Available, and Serviceable Systems (SRASS)
Helmut Roth <firstname.lastname@example.org> reports on the January 1998 meeting of the PASC.1h Working Group in Fort Lauderdale, Florida.
The POSIX.1h Services for Reliable, Available and Serviceable Systems (SRASS) draft has just completed a mock ballot. The goal of the SRASS Working Group is to support fault-tolerant systems, serviceable systems, reliable systems, and available systems in a portable way. Where feasible, POSIX.1h needs to be useful for general applications too, such as distributed, parallel database transaction systems, and safety-related systems.
Right now the SRASS Working Group is in the process of refining draft 3.0 of POSIX.1h by reviewing the mock ballots received for the standard APIs for event logging, core dump control, shutdown/reboot, and configuration space management.
The logging APIs are aimed at allowing an application to log applicationspecific events and system events to a system log and for the subsequent processing of those events. Fault management applications can use this API to register for the notification of events that enter the system log. Events of interest may be those that exceed some limit. A notification can also have a severity associated with it. A notification can provide a way to react in a proactive way and initiate steps to prevent a system failure later.
There is a single core dump control API to enable an application to specify the files path location if a process terminates with a core dump file. The SRASS Working Group felt that an analyst should at least be able to find the core dump file, in case your system really crashes.
A shutdown/reboot API has been included in the draft. On careful review, several options considered for inclusion, such as fast shutdown, graceful shutdown, and optional features such as rebooting with scripts. This has been the second thorough review of this API, and a new rationale has been developed and several corrections identified. This API has several issues that still need to be resolved. We will be correcting this API based on the mock ballots received.
A corrective action, such as reconfiguration, is often needed to keep a system alive. The configuration space management API is intended to provide a portable method of traversing the configuration space and for manipulating the data content of nodes in that configuration space. This API will provide a fault management application access to underlying system configuration information and the means to direct reconfiguration of the system. In particular, the proposed set of APIs will allow a fault management application to keep track of the system configuration view and dynamically change the system configuration. The view of the configuration space is similar to a filesystem. The configuration space is accessed by means of mount and unmount operations, linking and unlinking operations, operations to add nodes to the configuration description, and several functions to allow an application to access any part of the current description of the configuration picture.
The working group has approved the current draft 3.0 that went out to mock ballot on November 24, 1997. At the January 1998 POSIX meeting, we began looking at the 111 internal ballots we received. Of those, 55 were objections, 31were comments, and 25 were editorial changes. We will continue to correct the draft and will shortly be forming the official ballot group. Our ballot coordinator is Richard Scalzo. His email address is <email@example.com>. If you are interested in helping support fault management (including serviceability and fault tolerance aspects of systems), please contact me or Dr. Arun Chandra <firstname.lastname@example.org>.
The other project that the SRASS Working Group is responsible for at present is POSIX.1m, Checkpoint/Restart. This work was originally balloted as a part of POSIX.1a, but was felt to be too far from consensus and was holding that project back. POSIX.1m allows an application to save the entire state of the machine, the operating system, and the applications activities so that, if something goes wrong, a saved backup state can be brought online quickly. This draft has been developed further by the working group and will be entering a new ballot soon. Please contact Richard Scalzo for further details.