Operating system performance continues to be an active area of research, especially as demanding applications test OS scalability and performance limits. The kernel-user boundary becomes critically important as these applications spend a significant fraction, often a majority, of their time executing system calls. In the past, developers could expect to put data-sharing services, such as NFS, into the kernel to avoid the limitations stemming from running in user space. However, with the rapid rate of developments in HTTP servers, Web proxy servers, peer-to-peer systems, and other networked systems, using kernel integration to avoid performance problems becomes unrealistic. As a result, examining the interaction between operating systems and user processes remains a useful area of investigation.
Much of the earlier work focusing on the kernel-user interface centered around developing new system calls that are more closely tailored to the needs of particular applications. In particular, zero-copy I/O (31,17) and scalable event delivery (23,9,10) are examples of techniques that have been adopted in mainstream operating systems, via calls such as sendfile(), transmitfile(), kevent(), and epoll(), to address performance issues for servers. Other approaches, such as allowing processes to declare their intentions to the OS (32), have also been proposed and implemented. Some system calls, such as madvise(), provide usage hints to the OS, but with operating systems free to ignore such requests or restrict them to mapped files, programs cannot rely on their behavior.
Some recent research uses the reverse approach, where applications determine how the ``black box'' OS is likely to behave and then adapt accordingly. For example, the Flash Web Server (30) uses the mincore() system call to determine memory residency of pages, and combines this information with some heuristics to avoid blocking. The ``gray box'' approach (7,15) manages to infer memory residency by observing page faults and correlating them with known replacement algorithms. In both systems, memory-resident files are treated differently than others, improving performance, latency, or both. These approaches depend on the quality of the information they can obtain from the operating system and the accuracy of their heuristics. As workload complexity increases, we believe that such inferences will become harder to make.
To remedy these problems, we propose a much more direct approach to making the OS transparent: make system call performance information a first-class result, and return it in-band. In practice, what this entails is having each system call fill a ``performance result'' structure, providing information about what occurred in processing the call. The term first-class result specifies that it gets treated the same as other results, such as errno and the system call return value, instead of having to be explicitly requested via other system or library calls. The term in-band specifies that it is returned to the caller immediately, instead of being logged or processed by some other monitoring processes. While it is much larger and more detailed than the errno global variable, they are conceptually similar. Simply monitoring at the system call boundary, the scheduler, page fault handlers, and function entry and exit is sufficient to provide detailed information about the inner working of the operating system. This approach not only eliminates guesswork about what happens during call processing, but also gives the application control over how this information is collected, filtered, and analyzed, providing more customizable and narrowly-targeted performance debugging than is available in existing tools.
We evaluate the flexibility and performance of our implementation, DeBox, running on the FreeBSD operating system. DeBox allows us to determine where applications spend their time inside the kernel, what causes them to lose performance, what resources are under contention, and how the kernel behavior changes with the workload. The flexibility of DeBox allows us to measure very specific information, such as the kernel CPU consumption caused by a single call site in a program.
Our throughput experiments focus on analyzing and optimizing the performance of the Flash Web Server on the industry-standard SpecWeb99 benchmark (39). Using DeBox, we are able to diagnose a series of problematic interactions between the server and the operating system on this benchmark. The resulting system shows an overall factor of four improvement in SpecWeb99 score, throughput gains on other benchmarks, and latency reductions ranging from a factor of 4 to 47. Most of the issues are addressed by application redesign and the resulting system is portable, as we demonstrate by showing improvements on Linux. Our kernel modifications, optimizing of the sendfile() system call, have been integrated into FreeBSD.
DeBox is specifically designed for performance analysis of the interactions between the OS and applications, especially in server-style environments with complex workloads. Its combination of features and flexibility is novel, and differentiates it from other profiling-related approaches. However, it is not designed to be a general-purpose profiler, since it currently does not address applications that spend most of their time in user space or in the ``bottom half'' (interrupt-driven) portion of the kernel.
The rest of this paper is organized as follows. In Section 2 we motivate the approach of DeBox. The detailed design and implementation are described in Section 3. We describe experimental setup and workloads in Section 4, then show a case study using DeBox to analyze and optimize the Flash Web Server in Section 5. Section 6 contains further experiments on latency and Section 7 demonstrates the portability of our optimizations. We discuss related work in Section 8 and conclude in Section 9.