|
USENIX 2003 Annual Technical Conference, General Track Paper   
[USENIX Annual Conference '03 Tech Program Index]
Multiprocessor Support for Event-Driven ProgramsNickolai Zeldovich (Stanford University), Alexander Yip, Frank Dabek, Robert T. Morris, David Mazières (New York University), Frans Kaashoek nickolai@cs.stanford.edu, {yipal, fdabek, rtm, kaashoek}@lcs.mit.edu, dm@cs.nyu.edu
AbstractThis paper presents a new asynchronous programming library (libasync-smp) that allows event-driven applications to take advantage of multiprocessors by running code for event handlers in parallel. To control the concurrency between events, the programmer can specify a color for each event: events with the same color (the default case) are handled serially; events with different colors can be handled in parallel. The programmer can incrementally expose parallelism in existing event-driven applications by assigning different colors to computationally-intensive events that do not share mutable state. An evaluation of libasync-smp demonstrates that applications achieve multiprocessor speedup with little programming effort. As an example, parallelizing the cryptography in the SFS file server required about 90 lines of changed code in two modules, out of a total of about 12,000 lines. Multiple clients were able to read large cached files from the libasync-smp SFS server running on a 4-CPU machine 2.5 times as fast as from an unmodified uniprocessor SFS server on one CPU. Applications without computationally intensive tasks also benefit: an event-driven Web server achieves 1.5 speedup on four CPUs with multiple clients reading small cached files.
IntroductionTo obtain high performance, servers must overlap computation with I/O. Programs typically achieve this overlap using threads or events. Threaded programs typically process each request in a separate thread; when one thread blocks waiting for I/O, other threads can run. Threads provide an intuitive programming model and can take advantage of multiprocessors; however, they require coordination of accesses by different threads to shared state, even on a uniprocessor. In contrast, event-based programs are structured as a collection of callback functions which a main loop calls as I/O events occur. Event-based programs execute callbacks serially, so the programmer need not worry about concurrency control; however, event-based programs until now have been unable to take full advantage of multiprocessors without running multiple copies of an application or introducing fine-grained synchronization. The contribution of this paper is libasync-smp, a library that supports event-driven programs on multiprocessors. libasync-smp is intended to support the construction of user-level systems programs, particularly network servers and clients; we show that these applications can achieve performance gains on multiprocessors by exploiting coarse-grained parallelism. libasync-smp is intended for programs that have natural opportunities for parallel speedup; it has no support for expressing very fine-grained parallelism. The goal of libasync-smp's concurrency control mechanisms is to provide enough concurrency to extract parallel speedup without requiring the programmer to reason about the correctness of a fine-grained parallel program. Much of the effort required to make existing event-driven programs take advantage of multiprocessors is in specifying which events may be handled in parallel. libasync-smp provides a simple mechanism to allow the programmer to incrementally add parallelism to uniprocessor applications as an optimization. This mechanism allows the programmer to assign a color to each callback. Callbacks with different colors can execute in parallel. Callbacks with the same color execute serially. By default, libasync-smp assigns all callbacks the same color, so existing programs continue to work correctly without modification. As programmers discover opportunities to safely execute callbacks in parallel, they can assign different colors to those callbacks. libasync-smp is based on the libasync library [16]. libasync uses operating system asynchronous I/O facilities to support event-based programs on uniprocessors. The modifications for libasync-smp include coordinating access to the shared internal state of a few libasync modules, adding support for colors, and scheduling callbacks on multiple CPUs. An evaluation of libasync-smp demonstrates that applications achieve multiprocessor speedup with little programming effort. As an example, we modified the SFS [17] file server to use libasync-smp. This server uses more than 260 distinct callbacks. Most of the CPU time is spent in just two callbacks, those responsible for encrypting and decrypting client traffic; this meant that coloring just a few callbacks was sufficient to gain substantial parallel speedup. The changes affected 90 lines in two modules, out of a total of about 12,000 lines. When run on a machine with four Intel Xeon CPUs, the modified SFS server was able to serve large cached files to multiple clients 2.5 times as fast as an unmodified uniprocessor SFS server on one CPU. Even servers without CPU-intensive operations such as cryptography can achieve speedup approaching that offered by the operating system, especially if the O/S kernel can take advantage of a multiprocessor. For example, with a workload of multiple clients reading small cached files, an event-driven web server achieves 1.5 speedup on four CPUs. The next section (Section 2) introduces libasync, on which libasync-smp is based, and describes its support for uniprocessor event-driven programs. Section 3 and 4 describe the design and implementation of libasync-smp, and show examples of how applications use it. Section 5 uses two examples to show that use of libasync-smp requires little effort to achieve parallel speedup. Section 6 discusses related work, and Section 7 concludes.
|
main() { // listen on TCP port 80 int afd = inetsocket(SOCK_STREAM, 80); // register callback for new connections fdcb(afd, READ, wrap(accept_cb, afd)); amain(); // start main loop } // called when a new connection arrives accept_cb(int afd) { int fd = accept(afd, ...); str inBuf(""); // new ref-counted buffer // register callback for incoming data fdcb(fd, READ, wrap(req_cb, fd, inBuf)); } // called when data arrives req_cb(int fd, str inBuf) { read(fd, buf, ...); append input to inBuf; if(complete request in inBuf){ // un-register callback fdcb(fd, READ, NULL); // parse the HTTP request parse_request(inBuf, serverName, file); // resolve serverName and connect // both are asynchronous tcpconnect(serverName, 80, wrap(connect_cb, fd, file)); } else { // do nothing; wait for more calls to req_cb() } } // called when we have connected to the server connect_cb(int client_fd, str file, int server_fd) { // write the request when the socket is ready fdcb(server_fd, WRITE, wrap (write_cb, file, server_fd)); } Figure 1: Outline of a web proxy that uses libasync.
|
Figure 1 shows an abbreviated fragment of a program written using libasync. The purpose of the application is to act as a web proxy. The example code accepts TCP connections, reads an HTTP request from each new connection, extracts the server name from the request, connects to the indicated server, etc. One way to view the example code is that it is the result of writing a single sequential function with all these steps, and then splitting it into callbacks at each point that the function would block for input.
main() calls inetsocket() to create a socket that listens for new connections on TCP port 80. UNIX makes such a socket appear readable when new connections arrive, so main() calls the libasync function fdcb() to register a read callback. Finally main() calls amain() to enter the libasync main loop.
The libasync main loop will call the callback wrap with no arguments when a new connection arrives on afd. The wrap calls accept_cb() with the other arguments passed to wrap(), in this case the file descriptor afd. After allocating a buffer in which to accumulate client input, accept_cb() registers a callback to req_cb() to read input from the new connection. The server keeps track of its state for the connection, which consists of the file descriptor and the buffer, by including it in each wrap() call and thus passing it from one callback to the next. If multiple clients connect to the proxy, the result will be multiple callbacks waiting for input from the client connections.
When a complete request has arrived, the proxy server needs to look up the target web server's DNS host name and connect to it. The function tcpconnect() performs both of these tasks. The DNS lookup itself involves waiting for a response from a DNS server, perhaps more than one in the case of timeouts; thus the libasync DNS resolver is internally structured as a set of callbacks. Waiting for TCP connection establishment to complete also involves callbacks. For these reasons, tcpconnect() takes a wrap as one of its argument, carries that wrap along in its own callbacks, and finally calls the wrap when the connection process completes or fails. This style of programming is reminiscent of the continuation-passing style [21], and makes it easy for programmers to compose modules.
Name | #Wraps | Lines of Code |
SFS | 229 | 39871 |
SFSRO | 58 | 4836 |
Chord | 65 | 5445 |
CFS | 87 | 4960 |
A number of applications are based on libasync; Figure 2 lists some of them, along with the number of distinct calls to wrap() in each program. These numbers give a feel for the level of complexity in the programs' use of callbacks.
A single event-driven process derives no direct benefit from a multi-processor. There may be an indirect speedup if the operating system or helper processes can make use of the multiprocessor's other CPUs.
It is common practice to run multiple independent copies of an event-driven program on a multiprocessor. This N-copy approach might work in the case of a web server, since the processing of different client requests can be made independent. The N-copy approach does not work if the program maintains mutable state that is shared among multiple clients or requests. For example, a user-level file server might maintain a table of leases for client cache consistency. In other cases, running multiple independent copies of a server may lead to a decrease in efficiency. A web proxy might maintain a cache of recently accessed pages: multiple copies of the proxy could maintain independent caches, but content duplicated in these caches would waste memory.
The focus of this paper is libasync-smp, a multiprocessor extension of libasync. The goal of libasync-smp is to execute event-driven programs faster by running callbacks on multiple CPUs. Much of the design of libasync-smp is motivated by the desire to make it easy to adapt existing libasync-based servers to multiprocessors. The goal of the libasync-smp design is to allow both the parallelism of the N-copy arrangement and the advantages of shared data structures.
A server based on libasync-smp consists of a single process containing one worker thread per available CPU. Each thread repeatedly chooses a callback from a set of runnable callbacks and runs it. The threads share an address space, file descriptors, and signals. The library assumes that the number of CPUs available to the process is static over its running time. A mechanism such as scheduler activations [2] could be used to dynamically determine the number of available CPUs.
There are a number of design challenges to making the single address space approach work, the most interesting of which is coordination of access to application data shared by multiple callbacks. An effective concurrency control mechanism should allow the programmer to easily (and incrementally) identify which parts of a server can safely be run in parallel.
Figure 3: The single process event driven architecture (left) and the libasync-smp architecture (right). Note that in the libasync-smp architecture callbacks of the same color appear in the same queue. This guarantees that callbacks with the same color are never run in parallel and always run in the order in which they were scheduled. |
The design of the concurrency control mechanisms in libasync-smp is motivated by two observations. First, system software often has natural coarse-grained parallelism, because different requests don't interact or because each request passes through a sequence of independent processing stages. Second, existing event-driven programs are already structured as non-blocking units of execution (callbacks), often associated with one stage of the processing for a particular client. Together, these observations suggest that individual callbacks are an appropriate unit of coordination of execution.
libasync-smp associates a color with each registered callback, and ensures that no two callbacks with the same color execute in parallel. Colors are arbitrary 32-bit values. Application code can optionally specify a color for each callback it creates; if it specifies no color, the callback has color zero. Thus, by default, callbacks execute sequentially on a single CPU. This means that unmodified event-driven applications written for libasync will execute correctly with libasync-smp.
The orthogonality of color to the callback's code eases the adaptation of existing libasync-based servers. A typical arrangement is to run the code that accepts new client connections in the default color. If the processing for different connections is largely independent, the programmer assigns each new connection a new unique color that applies to all the callbacks involved in processing that connection. If a particular stage in request processing shares mutable data among requests (e.g. a cache of web pages), the programmer chooses a color for that stage and applies it to all callbacks that use the shared data, regardless of which connection the callback is associated with.
In some cases, application code may need to be re-structured to permit callbacks to be parallelized. For example, a single callback might use shared data but also have significant computation that does not use shared data. It may help to split such a callback; the first half would use a special libasync-smp call (cpucb()) to schedule the second half with a different color.
The color mechanism is less expressive than locking; for example, a callback can have only one color, which is equivalent to holding a single lock for the complete duration of a callback. However, experience suggests that fine-grained and sophisticated locking, while it may be necessary for correctness with concurrent threads, is rarely necessary to achieve reasonable speedup on multiple CPUs for server applications. Parallel speedup usually comes from the parts of the code that don't need much locking; coloring allows this speedup to be easily captured, and also makes it easy to port existing event-driven code to multiprocessors.
The API that libasync-smp presents differs slightly from that exposed by libasync. The cwrap() function is analogous to the wrap() function described in Section 2 but takes an optional color argument; Table 1 shows the cwrap() interface. The color specified at the callback's creation (i.e. when cwrap() is called) dictates the color it will be executed under. Embedding color information in the callback object rather than in an argument to fdcb() (and other calls which register callbacks) allows the programmer to write modular functions which accept callbacks and remain agnostic to the color under which those callbacks will be executed. Note that colors are not inherited by new callbacks created inside a callback running under a non-zero color. While color inheritance might seem convenient, it makes it very difficult to write modular code as colors ``leak'' into modules which assume that callbacks they create carry color zero. Since colors are arbitrary 32-bit values, programmers have considerable latitude in how to assign colors. One reasonable convention is to use each request's file descriptor number as the color for its parallelizable callbacks. Another possibility is to use the address of a data structure to which access must be serialized; for example, a per-client or per-request state structure. Depending on the convention, it could be the case that unrelated modules accidentally choose the same color. This might reduce performance, but not correctness. libasync-smp provides a cpucb() function that schedules a callback for execution as soon as a CPU is idle. The cpucb() function can be used to register a callback with a color different from that of the currently executing callback. A common use of cpucb() is to split a CPU-intensive callback in two callbacks with different colors, one to perform a computation and the other to synchronize with shared state. To minimize programming errors associated with splitting an existing callback into a chain of cpucb() callbacks, libasync-smp guarantees that all CPU callbacks of the same color will be executed in the order they were scheduled. This maintains assumptions about sequential execution that the original single callback may have been relying on. Execution order isn't defined for callbacks with different colors.
ExampleConsider the web proxy example from Section 2. For illustrative purposes assume that the parse_request() routine uses a large amount of CPU time and does not depend on any shared data. We could re-write req_cb() to parse different requests in parallel on different CPUs by calling cpucb() and assigning the callback a unique color. Figure 4 shows this change to req_cb(). In this example only the parse_request() workload is distributed across CPUs. As a further optimization, reading requests could be parallelized by creating the read request callback using cwrap() and specifying the request's file descriptor as the callback's color.
|
Figure 6: The sequence of callbacks executed when the libasync-smp web server handles a request for a page not in the cache. Nodes represent callbacks, arrows indicate that the node at the source scheduled the callback represented by the node at the tip. Nodes on the same vertical line are run under distinct colors (and thus potentially in parallel). The stacked circles in the ``Check page cache'' stage indicate that a small number of threads (less than the number of concurrent requests) can access the cache simultaneously). Labels at the top of the figure describe each step of the processing. |
To demonstrate that the web server can take advantage of multiprocessor hardware, we tested the performance of the parallelized web server on a cache-based workload while varying the number of CPUs available to the server. The workload consisted of 720 files whose sizes were distributed according to the SPECweb99 benchmark [20]; the total size of the data set was 100MB which fits completely into the server's in-memory page cache. Four machines simulated a total of 800 concurrent clients. A single instance of the load generation client is capable of reading over 20MB/s from the web server. Each client made 10 requests over a persistent connection before closing the connection and opening a new one. The servers were started with cold caches and run for 4 minutes under load. The server's throughput was then measured for 60 seconds, to capture its behavior in the steady state.
Figure 7 shows the performance (in terms of total throughput) with different numbers of CPUs for the libasync-smp web server. Even though the HTTP server has no particularly processor-intensive operations, we can still observe noticeable speedup on a multiprocessor system: the server's throughput is 1.28 times greater on two CPUs than it is on one and 1.5 times greater on four CPUs.
Figure 7: The performance of the libasync-smp web server serving a cached workload and running on different number of CPUs relative to the performance on one CPU (light bars). The performance of N copies of a libasync web server is also shown relative the performance of the the libasync server's performance on one CPU (dark bars) |
To provide an upper bound for the multiprocessor speedup we can expect from the libasync-smp-based web server we contrast its performance with N independent copies of a single process version of the web server (where N is the number of CPUs provided to the libasync-smp-based server). This single process version is based on an unmodified version of libasync and thus does not suffer the overhead associated with the libasync-smp library (callback queue locking, etc). Each copy of the N-copy server listens for client connections on a different TCP port number.
The speedup obtained by the libasync-smp server is well below the speedup obtained by N copies of the libasync server. Even on a single CPU, the libasync based server achieved higher throughput than the libasync-smp server. The throughput of the libasync server was 35.4 MB/s while the libasync-smp server's throughput was 30.4 MB/s.
Profiling the single CPU case explains the base penalty that libasync-smp incurs. While running the libasync-smp web server under load, roughly 35% of the CPU time is spent in user-level including libasync-smp and the web server. Of that time, at least 37% is spent performing tasks needed only by libasync-smp. Atomic reference counting uses 26% of user-level CPU time, and task accounting such as enqueuing and dequeuing tasks takes another 11%. The overall CPU time used for atomic reference counting and task management is 13%, which explains the libasync-smp web server's decreased single CPU performance.
The reduced performance of the libasync-smp server is partly due to the fact that many of the libasync-smp server's operations must be serialized, such as accepting connections and checking caches. In the N-copy case, all of these operations run in parallel. In addition, locking overhead penalizes the libasync-smp server: some data is necessarily shared across threads and must be protected by expensive atomic operations although the server has been written in such a way as to minimize such sharing.
Because the N-copy server can perform all of these operations in parallel and, in addition, extract additional parallelism from the operating system which locks some structures on a per-process basis, the performance of the N-copy server represents a true upper bound for any architecture which operates in a single address space.
Figure 8: The performance of several web servers on multiprocessor hardware. Shown are the throughput of the libasync-smp based server (light bars), Apache 2.0.36 (dark bars), and Flash (black bars) on 1,2,3 and 4 processors. |
To provide a more realistic performance goal than the N-copy server, we compared the libasync-smp server with two commonly used HTTP servers. Figure 8 shows the performance of Apache 2.0.36 (in both multithreaded and multiprocess mode) and Flash v0.1_990914 on different numbers of processors. Apache in multiprocess mode was configured to run with 32 servers. Apache-MT is a multithreaded version of the Apache server. It creates a single heavyweight process and 32 kernel threads within that process by calling clone. The number of processes and threads used by the Apache servers were chosen to maximize throughput for the benchmarks presented here. Flash is an event-driven server; when run on multiprocessors it forks to create N independent copies, where N is the number of available CPUs
The performance of the libasync-smp HTTP server is comparable to the performance of these servers: the libasync-smp server shows better absolute performance than both versions of the Apache server and slightly lower performance than N-copies of the Flash server.
These servers show better speedup than the libasync-smp server: Flash achieves 1.68 speedup on four CPUs while the libasync-smp server is 1.5 times faster on four CPUs. Because Flash runs four heavyweight processes, it is able to take advantage of many of the benefits of the N-copy approach: as a result its speedup and absolute performance are greater than that of the libasync-smp server. Although this approach is workable for a web server, in applications that must coordinate shared state such replication would be impossible.
Like the libasync-smp server, Flash and multiprocess Apache do not show the same performance achieved by the N-copy server. Although these servers fully parallelize access to their caches and do not perform locking internally, they do exhibit some shared state. For instance, the servers must serialize access to the accept() system call since all requests arrive on a single TCP port.
The main reason to parallelize a web server is to increase its performance under heavy load. A key part of the ability to handle heavy load is stability: non-decreasing performance as the load increases past the server's point of peak performance. To explore whether servers based on libasync-smp can provide stable performance, we measured the web server's throughput with varying numbers of simultaneous clients. Each client selects a file according to the SPECweb99 distribution; the files all fit in the server's cache. The server uses all four CPUs. Figure 9 shows the results. The event-driven HTTP server offers consistent performance over a wide variety of loads.
Figure 9:The performance of the web server on a cached workload as the number of concurrent clients is varied. |
To evaluate the performance of libasync-smp on existing libasync programs, we modified the SFS file server [17] to take advantage of a multiprocessor system.
The SFS server is a single user-level process. Clients communicate with it over persistent TCP connections. All communication is encrypted using a symmetric stream cipher, and authenticated with a keyed cryptographic hash. Clients send requests using an NFS-like protocol. The server process maintains significant mutable per-file-system state, such as lease records for client cache consistency. The server performs non-blocking disk I/O by sending NFS requests to the local kernel NFS server. Because of the encryption, the SFS server is compute-bound under some heavy workloads and therefore we expect that by using libasync-smp we can extract significant multiprocessor speedup.
We used the pct[5] statistical profiler to locate performance bottlenecks in the original SFS file server code. Encryption appeared to be an obvious target, using 75% of CPU time. We modified the server so that encryption operations for different clients executed in parallel and independently of the rest of the code. The resulting parallel SFS server spent about 65% of its time in encryption. The reduction from 75% is due to the time spent coordinating access to shared mutable data structures inside libasync-smp, as well as to additional memory-copy operations that allow for parallel execution of encryption.
The modifications to the SFS server are concentrated in the code that encrypts, decrypts, and authenticates data sent to and received from the clients. We split the main send callback-function into three smaller callbacks. The first and last remain synchronized with the rest of the server code (i.e. have the default color), and copy data to be transmitted into and out of a per-client buffer. The second callback encrypts the data in the client buffer, and runs in parallel with other callbacks (i.e., has a different color for each client). This involved modifying about 40 lines of code in a single callback, largely having to do with variable name changes and data copying.
Parallelization of the SFS server's receive code was slightly more complex because more code interacts with it. About 50 lines of code from four different callbacks were modified, splitting each callback into two. The first of these two callbacks received and decrypted data in parallel with other callbacks (i.e., with a different color for every client), and used cpucb() to execute the second callback. The second callback remained synchronized with the rest of the server code (i.e., had the default color), and performed the actual processing of the decrypted data.
We measured the total throughput of the file server to all clients, in bits per second, when multiple clients read a 200 MByte file whose contents remained in the server's disk buffer cache. We repeated this experiment for different numbers of processors. This test reflects how SFS is used in practice: an SFS client machine sends all of its requests over a single TCP connection to the server.
Figure 10: Performance of the SFS file server using different numbers of CPUs, relative to the performance on one CPU. The light bars indicate the performance of the server using libasync-smp; dark bars indicate the performance of 1#1 separate copies of the original server. Each bar represents the average of three runs; the variation from run to run was not significant. |
The bars labeled ``libasync-smp'' in Figure 10 show the performance of the parallelized SFS server on the throughput test. On a single CPU, the parallelized server achieves 96 percent of the throughput of the original uniprocessor server. The parallelized server is 1.66, 2.20, and 2.5 times as fast as the original uniprocessor server on two, three and four CPUs, respectively.
Because only 65% of the cycles (just encryption) have been parallelized, the remaining 35% creates a bottleneck. In particular, when the remaining 35% of the code runs continuously on one processor, we can achieve a maximum utilization of 10#10 processors. This number is close to the maximum speedup (2.5) of the parallelized server. Further parallelization of the SFS server code would allow it to incrementally take advantage of more processors.
To explore the performance limits imposed by the hardware and operating system, we also measured the total performance of multiple independent copies of the original libasync SFS server code, as many separate processes as CPUs. In practice, such a configuration would not work unless each server were serving a distinct file system. An SFS server maintains mutable per-file-system state, such as attribute leases, that would require shared memory and synchronization among the server processes. This test thus gives an upper bound on the performance that SFS with libasync-smp could achieve.
The results of this test are labeled ``N-copy'' in Figure 10. The SFS server with libasync-smp roughly follows the aggregate performance of multiple independent server copies. The performance difference between the libasync-smp-based SFS server and the N-copy server is due to the penalty incurred due to shared state maintained by the server, such as file lease data and user ID mapping tables.
Despite comparatively modest changes to the SFS server to expose parallelism, the server's parallel performance was close to the maximum speedup offered by the underlying operating system (as measured by the speedup obtained by multiple copies of the server).
Table 2 shows how much the use of per-thread work queues improves performance. The numbers in the table indicate how fast a synthetic benchmark executes tasks. The benchmark program creates 16 callbacks with unique colors. Each callback performs a small amount of computation, and then registers a child callback of the same color. The benchmark intentionally assigns colors so that all but one of the task queues are populated, in order to explore the effects of work stealing. The benchmark was run with four CPUs.
The first line shows the task rate with a single task queue shared among all the worker threads. The entry shows the task completion rate when using per-thread task queues. The increase in task completion rate is dramatically higher due to better cache locality, and because there is no contention for the task-queue locks. The third line shows the task completion rate when per-thread task free-lists are used in addition to per-thread queues. The fourth configuration adds work stealing between worker threads. Without work stealing, tasks were never run on one of the four CPUs. Work stealing allows the worker thread on that CPU to find work, at the expense of increased contention for the other threads' task queues.
There is a large body of work exploring the relative merits of thread-based I/O concurrency and the event-driven architecture [18,11,12,15,1]. This paper does not attempt to argue that either is superior. Instead, we present a technique which improves the performance of the event-driven model on multiprocessors. The work described below also considers performance of event-driven software.
Pai et al. characterized approaches to achieving concurrency in network servers in [19]. They evaluate a number of architectures: multi-process, multi-threaded, single-process event-driven, and asymmetric multi-process event-driven (AMPED). In this taxonomy, libasync-smp could be characterized as symmetric multi-threaded event-driven; its main difference from AMPED is that its goal is to increase CPU concurrency rather than I/O concurrency.
Like libasync-smp, the AMPED architecture introduces limited concurrency into an event driven system. Under the AMPED architecture, a small number of helper processes are used to handle file I/O to overcome the lack of non-blocking support for file I/O in most operating systems. In contrast, libasync-smp uses additional execution contexts to execute callbacks in parallel. libasync-smp achieves greater CPU concurrency on multiprocessors when compared to the AMPED architecture but places greater demands on the programmer to control concurrency. Like the AMPED-based Flash web server, libasync-smp must also cope with the issue of non-blocking file I/O: libasync-smp uses an NFS-loopback server to access files asynchronously. This allows libasync-smp to use non-blocking local RPC requests rather than blocking system calls.
The Apache web server serves concurrent requests with a pool of independent processes, one per active request [3]. This approach provides both I/O and CPU concurrency. Apache processes cannot easily share mutable state such as a page cache.
The staged, event-driven architecture (SEDA) is a structuring technique for high-performance servers [24]. It divides request processing into a series of well-defined stages, connected by queues of requests. Within each stage, one or more threads dequeue requests from input queue(s), perform that stage's processing, and enqueue the requests for subsequent stages. A thread can block (to wait for disk I/O, for example), so a stage often contains multiple threads in order to achieve I/O concurrency.
SEDA can take advantage of multiprocessors, since a SEDA server may contain many concurrent threads. One of SEDA's primary goals is to dynamically manage the number of threads in each stage in order to achieve good I/O and CPU concurrency but avoid unstable behavior under overload. Both libasync-smp and SEDA use a mixture of events and concurrent threads; from a programmer's perspective, SEDA exposes more thread-based concurrency which the programmer may need to synchronize, while libasync-smp tries to preserve the serial callback execution model.
Cohort scheduling organizes threaded computation into stages in order to increase performance by increasing cache locality, reducing TLB pressure, and reducing branch mispredicts [14]. The staged computation model used by cohort scheduling is more general than the colored callback model presented here. However, the partitioned stage scheduling policy is somewhat analagous to coloring callbacks for parallel execution (the key corresponds to a callback color). Like SEDA, cohort scheduling exposes more thread-based concurrency to the programmer. Cohort scheduling can also take advantage of multiprocessor hardware.
This paper describes a library that allows event-driven programs to take advantage of multiprocessors with a minimum of programming effort. When high loads make multiple events available for processing, the library can execute event handler callbacks on multiple CPUs. To control the concurrency between events, the programmer can specify a color for each event: events with the same color (the default case) are handled serially; events with different colors can be handled in parallel. The programmer can incrementally expose parallelism in existing event-driven applications by assigning different colors to computationally-intensive events that don't share mutable state.
Experience with libasync-smp demonstrates that applications can achieve multi-processor speedup with little programming effort. Parallelizing the cryptography in the SFS file server required about 90 lines of changed code in two modules, out of a total of about 12,000 lines. Multiple clients were able to read large cached files from the libasync-smp SFS server running on a 4-CPU machine 2.5 times as fast as from an unmodified uniprocessor SFS server on one CPU. Applications without computationally intensive tasks also benefit: an event-driven Web server achieves 1.5 speedup on four CPUs with multiple clients reading small cached files relative to its performance on one CPU.
We are grateful to the many people who aided this work. The NMS group at LCS provided (on short notice) the SMP server used to test an early version of this software. The anonymous reviewers and our shepherd, Edouard Bugnion, provided helpful feedback.
This research was sponsored by the Defense Advanced Research Projects Agency (DARPA) and the Space and Naval Warfare Systems Center, San Diego, under contract N66001-00-1-8927.
This paper was originally published in the
Proceedings of the
USENIX Annual Technical Conference (General Track),
June 9 14, 2003,
San Antonio, TX, USA
Last changed: 3 Jun 2003 aw |
|