To compare DeBox's approach of making performance information a first-class result, we describe three categories of tools currently in use, and explain how DeBox relates to these approaches.
Function-based profilers - Programs such as prof, gprof (18), and their variants are often used to detect hot-spots in programs and kernels. These tools use compiler assistance to add bookkeeping information (count and time). Data is gathered while running and analyzed offline to reveal function call counts and CPU usage, often along edges in the call graph.
Coverage-based profilers - These profilers divide the program of interest into regions and use a clock interrupt to periodically sample the location of the program counter. Like function-based profilers, data gathering is done online while analysis is performed offline. Tools such as profil(), kernbb, and tcov can then use this information to show what parts of the program are most likely to consume CPU time. Coverage-only approaches may miss infrequently-called functions entirely and may not be able to show call graph behavior. Coverage information combined with compiler analysis can be used to show usage on a basic-block basis.
Hardware-assisted profilers - These profilers are similar to coverage-based profilers, but use special features of the microprocessor (event counters, timers, programmable interrupts) to obtain high-precision information at lower cost. The other major difference is that these profilers, such as DCPI (), Morph (43), VTune (19), Oprofile (29), and PP(3) tend to be whole system profilers, capturing activity across all processes and the operating system.
In this category, DeBox is logically closest to kernel gprof, though it provides more than just timing information. DeBox's full call trace allows more complete call graph generation than gprof's arc counts, and with the data compression and storage performed in user space, overhead is moved from the kernel to the process. Compared to path profiling, DeBox allows developers to customize the level of detail they want about specific paths, and it allows them to act on that information as it is generated. In comparison to low-level statistical profilers such as DCPI, coverage differs since DeBox measures functions directly used during the system call. As a result, the difference in approach yields some differences in what can be gathered and the difficulty in doing so - DCPI can gather bottom-half information, which DeBox currently cannot. However, DeBox can easily isolate problematic paths and their call sites, which DCPI's aggregation makes more difficult.
System activity monitors - Tools such as top, vmstat, netstat, iostat, and systat can be used to monitor a running system or determine a first-order cause for system slowdowns. The level of precision varies greatly, with top showing per-process information on CPU usage, memory consumption, ownership, and running time, to vmstat showing only summary information on memory usage, fault rates, disk activity, and CPU usage.
Trace tools - Trace tools provide a means of observing the system call behavior of processes without access to source code. Tools such as truss, PCT (11), strace (2), and ktrace are able to show some details of system calls, such as parameters, return values, and timing information. Recent tools, such as Kitrace (21) and the Linux Trace Toolkit (42), also provide data on some kernel state that changes as a result of the system calls. These tools are intended for observing another process, and as a result, producing out-of-band measurements and data aggregation, often requiring post-processing to generate usable output.
Timing calls - Using gettimeofday() or similar calls, programmers can manually record the start and end times of events to infer information based on the difference. The getrusage() call adds some information beyond timings (context switches, faults, messages and I/O counts) and can similarly used. If per-call information is required, not only do these approaches introduce many more system calls, but the information can be misleading.
DeBox compares favorably with a hypothetical merger of the timing calls and the trace tools in the sense that timing information is presented in-band, but so is the other information. In comparison with the Linux Trace Toolkit, our focus differs in that we gather the most significant pieces of data related to performance, and we capture it at a much higher level of detail.
Microbenchmarks - Tools such as lmbench (24) and hbench:OS (13) can measure best-case times or the isolated cost of certain operations (cache misses, context switches, etc.). Common usage for these tools is to compare different operating systems, different hardware platforms, or possible optimizations.
Latency tools - Recent work on attempting to find the source of latency on desktop systems not designed for real-time work have yielded insight and some tools. The Intel Real-Time Performance Analyzer (33) helps automate the process of pinpointing latency. The work of Cota-Robles and Held (16) and Jones and Regehr (20) demonstrate the benefits of successive measurement and searching.
Instrumentation - Dynamic instrumentation tools provide mechanisms to instrument running systems (processes or the kernel) under user control, and to obtain precise kernel information. Examples include DynInst (14), KernInst (40), ParaDyn (25), Etch (35), and ATOM (37). The appeal of this approach versus standard profilers is the flexibility (arbitrary code can be inserted) and the cost (no overhead until use). Information is presented out-of-band.
Since DeBox measures the performance of calls in their natural usage, it resembles the instrumentation tools. DeBox gains some flexibility by presenting this data to the application, which can filter it on-line. One major difference between DeBox and kernel instrumentation is that we provide a rich set of measurements to any process, rather than providing information only to privileged processes.
Beyond these performance analysis tools, the idea of observing kernel behavior to improve performance has appeared in many different forms. We share similarities with Scheduler Activations (5) in observing scheduler activity to optimize application performance, and with the Infokernel system by Arpaci-Dusseau et al. (8). Our goals differ, since we are more concerned with understanding why blocking occurs rather than reacting to it during a system call. Our non-blocking sendfile() modification is patterned on non-blocking sockets, but it could be used in other system calls as well. In a similar vein, RedHat has applied for a patent on a new flag to the open() call, which aborts if the necessary metadata is not in memory (26).
Our observations on blocking and its impact on latency may impact server design. Event-driven designs for network servers have been a popular approach since the performance studies of the Harvest Cache (12) and the Flash server (30). Schmidt and Hu (36) performed much of the early work in studying threaded architectures for improving server performance. A hybrid architecture was used by Welsh et al. (41) to support scheduling, while Larus and Parkes (22) demonstrate that such scheduling can also be performed in event-driven architectures. Qie et al. (34) show that such architectures can also be protected against denial-of-service attacks. Adya et al. (1) discuss the unification of the two models. We believe that DeBox can be used to identify problem areas in other servers and architectures, as our latency measurements of Apache suggest.