We begin by reviewing the design and implementation of the select() API. The system call is declared as:
int select( int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);An fd_set is simply a bitmap; the maximum size (in bits) of these bitmaps is the largest legal file descriptor value, which is a system-specific parameter. The readfds, writefds, and exceptfds are in-out arguments, respectively corresponding to the sets of file descriptors that are ``interesting'' for reading, writing, and exceptional conditions. A given file descriptor might be in more than one of these sets. The nfds argument gives the largest bitmap index actually used. The timeout argument controls whether, and how soon, select() will return if no file descriptors become ready.
Before select() is called, the application creates one or more of the readfds, writefds, or exceptfds bitmaps, by asserting bits corresponding to the set of interesting file descriptors. On its return, select() overwrites these bitmaps with new values, corresponding to subsets of the input sets, indicating which file descriptors are available for I/O. A member of the readfds set is available if there is any available input data; a member of writefds is considered writable if the available buffer space exceeds a system-specific parameter (usually 2048 bytes, for TCP sockets). The application then scans the result bitmaps to discover the readable or writable file descriptors, and normally invokes handlers for those descriptors.
Figure 2 is an oversimplified example of how an application typically uses select(). One of us has shown[15] that the programming style used here is quite inefficient for large numbers of file descriptors, independent of the problems with select(). For example, the construction of the input bitmaps (lines 8 through 12 of Figure 2) should not be done explicitly before each call to select(); instead, the application should maintain shadow copies of the input bitmaps, and simply copy these shadows to readfds and writefds. Also, the scan of the result bitmaps, which are usually quite sparse, is best done word-by-word, rather than bit-by-bit.
Once one has eliminated these inefficiencies, however, select() is still quite costly. Part of this cost comes from the use of bitmaps, which must be created, copied into the kernel, scanned by the kernel, subsetted, copied out of the kernel, and then scanned by the application. These costs clearly increase with the number of descriptors.
Other aspects of the select() implementation also scale poorly. Wright and Stevens provide a detailed discussion of the 4.4BSD implementation[23]; we limit ourselves to a sketch. In the traditional implementation, select() starts by checking, for each descriptor present in the input bitmaps, whether that descriptor is already available for I/O. If none are available, then select() blocks. Later, when a protocol processing (or file system) module's state changes to make a descriptor readable or writable, that module awakens the blocked process.
In the traditional implementation, the awakened process has no idea which descriptor has just become readable or writable, so it must repeat its initial scan. This is unfortunate, because the protocol module certainly knew what socket or file had changed state, but this information is not preserved. In our previous work on improving select() performance[4], we showed that it was fairly easy to preserve this information, and thereby improve the performance of select() in the blocking case.
We also showed that one could avoid most of the initial scan by remembering which descriptors had previously been interesting to the calling process (i.e., had been in the input bitmap of a previous select() call), and scanning those descriptors only if their state had changed in the interim. The implementation of this technique is somewhat more complex, and depends on set-manipulation operations whose costs are inherently dependent on the number of descriptors.
In our previous work, we tested our modifications using the Digital UNIX V4.0B operating system, and version 1.1.20 of the Squid proxy software[5,18]. After doing our best to improve the kernel's implementation of select(), and Squid's implementation of the procedure that invokes select(), we measured the system's performance on a busy non-caching proxy, connected to the Internet and handling over 2.5 million requests/day.
We found that we had approximately doubled the system's efficiency (expressed as CPU time per request), but select() still accounted for almost 25% of the total CPU time. Table 1 shows a profile, made with the DCPI [1] tools, of both kernel and user-mode CPU activity during a typical hour of high-load operation.
In the profile comm_select(), the user-mode procedure that creates the input bitmaps for select() and that scans its output bitmaps, takes only 0.54% of the non-idle CPU time. Some of the 2.85% attributed to memCopy() and memSet() should also be charged to the creation of the input bitmaps (because the modified Squid uses the shadow-copy method). (The profile also shows a lot of time spent in malloc()-related procedures; a future version of Squid will use pre-allocated pools to avoid the overhead of too many calls to malloc() and free()[22].)
However, the bulk of the select()-related overhead is in the kernel code, and accounts for about two thirds of the total non-idle kernel-mode CPU time. Moreover, this measurement reflects a select() implementation that we had already improved about as much as we thought possible. Finally, our implementation could not avoid costs dependent on the number of descriptors, implying that the select()-related overhead scales worse than linearly. Yet these costs did not seem to be related to intrinsically useful work. We decided to design a scalable replacement for select().