USENIX - Snitch Report for PASC .1h Working Group

POSIX.1h: Services for Reliable, Available, and Serviceable Systems (SRASS)
Helmut Roth <hroth@nswc.navy.mil> reports on the April 1997 meeting in Jackson Hole, WY.

If you are a vendor or user spending too much money and resources tracking down errors or events, you find yourself rewriting the same fault management or serviceability applications over and over again. You find yourself dealing with changing and inconsistent APIs in Operating Systems that are breaking your applications and your customers are angry. There is hope. Vendors users and general interest groups can get together to reach consensus on common practices in fault management and serviceability and standardize on common practices.

The POSIX.1h working group is in the process of developing a useful set of APIs for fault management and serviceability applications. The goal of the SRASS Working Group is to support fault tolerant systems, serviceable systems, reliable systems, and available systems in a portable way. Where feasible POSIX.1h needs to be useful for general applications too, such as distributed, parallel, database, transaction systems and safety related systems.

Right now the SRASS Working Group (POSIX.1h) is in the process of refining draft 2.5 of standard APIs for event logging, core dump control, shutdown/reboot and configuration space management.

The logging APIs are aimed at allowing an application to log both application-specific events and system-related events to a specified log and for the subsequent processing of those events. Fault management applications can use this API to register for the notification of events, that enter the system log. Events of interest may be those that exceed some limit, a notification can have a severity associated with it etc. A notification can provide a way to react proactively and initiate steps to prevent a system failure later. The first cut at these APIs are nearly ready and will be going out to mock ballot after these round of edit changes make it into the next draft for SRASS.

There is a single core dump control API to enable an application to specify the path to the core-dump file if a process terminates abnormally. Should your system really crash, this file will be the first one you will be looking for, and so it should be found easily. This API is also very stable, and it is also ready to go out for mock ballot.

A shutdown/reboot API has been included in the draft. Several options have been included, such as fast shutdown, graceful shutdown and features such as rebooting with optional scripts etc. These interfaces have now been through a second review. New rationale has been developed, and several corrections have been identified. However, there are still a number of unresolved areas, and we will be working to correct these between meetings to allow the draft to be mock balloted fairly.

Sometimes a corrective action such as reconfiguration is needed to keep your system alive. The configuration space management API is intended to provide a portable method of traversing the configuration space, and for manipulating the data content of nodes in that configuration space. This API will provide a fault management application access to underlying system configuration information and the means to direct reconfiguration of the system. In particular, the proposed set of APIs will allow a fault management application to keep track of the system configuration view and dynamically change the systems configuration. This is achieved by means of a variety of operations:

mount and unmount operations,
linking and unlinking operations,
operations to add nodes to the configuration description,
several operations to allow an application to access any part of the current description of the configuration picture.

Implementing and testing these APIs in a rapid prototype has been completed by Texas Instrument in their Dynamic Reconfiguration Demonstration system. The results of that prototype implementation were presented at the January 1997 meeting. The text was reviewed for consistency with File Tree Streams in POSIX.1a (on which many of these operations have been modeled) and some wording changes where made to the draft to enhance clarity. This API is ready for mock ballot.

The working group will be continuing to work on the document between meetings. The balloting directions, cover letter etc. will be prepared by email for distribution of the draft document for mock ballot. When we get the responses back, IBM has offered to host an additional SRASS meeting to accelerate the draft getting to formal ballot.

If you are interested in helping to produce standard APIs that support fault management (including serviceability and fault tolerance aspects of systems), just get in touch with Helmut Roth at hroth@nswc.navy.mil or Dr. Arun Chandra at achandra@vnet.ibm.com.

A mailing list is also available; send email to srass-request@pasc.org and ask to be included.