USENIX Technical Program - Abstract - Windows NT Symposium 99
Efficient User-Level Thread Migration and Checkpointing on Windows NT Clusters
Hazim Abdel-Shafi, Evan Speight, and John K. Bennett, Department of Electrical and Computer Engineering, Rice University
Abstract
Clusters of industry-standard multiprocessors are emerging as a
competitive alternative for large-scale parallel computing. However,
these systems have several disadvantages over large-scale
multiprocessors, including complex thread scheduling and increased
susceptibility to failure. This paper describes the design and
implementation of two user-level mechanisms in the Brazos parallel
programming environment that address these issues on clusters of
multiprocessors running Windows NT: thread migration and
checkpointing. These mechanisms offer several benefits: (1) The
ability to tolerate the failure of multiple computing nodes with
minimal runtime overhead and short recovery time. (2) The ability to
add and remove computing nodes while applications continue to run,
simplifying scheduled maintenance operations and facilitating load
balancing. (3) The ability to tolerate power failures by performing a
checkpoint before shutdown or by migrating computation threads to
other stable nodes. Brazos is a distributed system that supports both
shared memory and message passing parallel programming paradigms on
networks of Intel x86-based multiprocessors running Windows NT. The
performance of thread migration in Brazos is an order of magnitude
faster than previously reported Windows NT implementations, and is
competitive with implementations on other operating systems. The
checkpoint facility exhibits low runtime overhead and fast recovery
time.
- View the full text of this paper in
HTML form and PDF form.
- If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.
- To become a USENIX Member, please see our Membership Information.
|