A Transparent Checkpoint Facility On NT
Johny Srouji, Paul Schuster, Maury Bach, and Yulik Kuzmin
Intel Corporation
Abstract
With the increased use of networks of NT workstations for long-running engineering applications, process checkpointing and process migration can avoid wasted computer cycles and improve system utilization. The problem we solve is how to capture and reconstruct process state transparently and efficiently without affecting the correctness of the application.
A checkpoint facility enables the intermediate state of a process to be saved to a file. Users can later resume execution of the process from the checkpoint file. This prevents the loss of data generated by long-running processes due to program or system failures, and it also facilitates debugging when the bug appears after the program has executed for a long time.
This paper describes the implementation of a checkpoint library that permits users to save temporary state of long-running multi-threaded programs on a Windows/NT system and to resume execution from the checkpointed state at a later time. Our Windows implementation is the first such implementations that we are aware of for this operating system. Our implementation is portable, maintains good performance, and is transparent.
The checkpoint facility is currently used in several major internal projects at Intel.
- View the full text of this paper in
HTML form and
PDF form.
- If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.
- To become a USENIX Member, please see our Membership Information.