Adaptive and Reliable Parallel Computing
on Networks of Workstations
Robert D. Blumofe, University of Texas, and Philip A. Lisiecki, MIT
In this paper, we present the design of Cilk-NOW, a runtime system
that adaptively and reliably executes functional Cilk programs in
parallel on a network of UNIX workstations. Cilk (pronounced "silk")
is a parallel multithreaded extension of the C language, and all Cilk runtime
systems employ a provably efficient thread-scheduling algorithm. Cilk-NOW
is such a runtime system, and in addition, Cilk-NOW automatically delivers
adaptive and reliable execution for a functional subset of Cilk programs.
By adaptive execution, we mean that each Cilk program dynamically utilizes
a changing set of otherwise-idle workstations. By reliable execution, we
mean that the Cilk-NOW system as a whole and each executing Cilk program
are able to tolerate machine and network faults. Cilk-NOW provides these
features while programs remain fault oblivious, meaning that Cilk
programmers need not code for fault tolerance. Throughout this paper, we
focus on end-to-end design decisions, and we show how these decisions allow
the design to exploit high-level algorithmic properties of the Cilk programming
model in order to simplify and streamline the implementation.