To compare DeBox's approach of making performance information a first-class result, we describe three categories of tools currently in use, and explain how DeBox relates to these approaches.
Function-based profilers - Programs such as
prof, gprof (18), and their variants are often
used to detect hot-spots in programs and kernels. These tools use
compiler assistance to add bookkeeping information (count and
time). Data is gathered while running and analyzed offline to reveal
function call counts and CPU usage, often along edges in the call
graph.
Coverage-based profilers - These profilers divide the program
of interest into regions and use a clock interrupt to periodically
sample the location of the program counter. Like function-based
profilers, data gathering is done online while analysis is performed
offline. Tools such as profil(), kernbb, and tcov
can then use this information to show what parts of the program are
most likely to consume CPU time. Coverage-only approaches may miss
infrequently-called functions entirely and may not be able to show
call graph behavior. Coverage information combined with compiler
analysis can be used to show usage on a basic-block basis.
Hardware-assisted profilers - These profilers are similar to
coverage-based profilers, but use special features of the
microprocessor (event counters, timers, programmable interrupts) to
obtain high-precision information at lower cost. The other major
difference is that these profilers, such as DCPI (), Morph (43), VTune (19),
Oprofile (29), and PP(3) tend to be
whole system profilers, capturing activity across all processes and
the operating system.
In this category, DeBox is logically closest to kernel gprof,
though it provides more than just timing information. DeBox's full
call trace allows more complete call graph generation than gprof's arc
counts, and with the data compression and storage performed in user
space, overhead is moved from the kernel to the
process. Compared to path profiling, DeBox allows developers
to customize the level of detail they want about specific paths, and
it allows them to act on that information as it is generated. In
comparison to low-level statistical profilers such as DCPI, coverage
differs since DeBox measures functions directly used during the system
call. As a result, the difference in approach yields some differences
in what can be gathered and the difficulty in doing so - DCPI can
gather bottom-half information, which DeBox currently cannot. However,
DeBox can easily isolate problematic paths and their call sites, which
DCPI's aggregation makes more difficult.
System activity monitors - Tools such as top,
vmstat, netstat, iostat, and systat can be
used to monitor a running system or determine a first-order cause for
system slowdowns. The level of precision varies greatly, with
top showing per-process information on CPU usage, memory consumption,
ownership, and running time, to vmstat showing only summary
information on memory usage, fault rates, disk activity, and CPU
usage.
Trace tools - Trace tools provide a means of observing the
system call behavior of processes without access to source code. Tools
such as truss, PCT (11),
strace (2), and ktrace are able to show some details
of system calls, such as parameters, return values, and timing
information. Recent tools, such as Kitrace (21)
and the Linux Trace Toolkit (42), also provide
data on some kernel state that changes as a result of the system
calls. These tools are intended for observing another process, and as
a result, producing out-of-band measurements and data aggregation,
often requiring post-processing to generate usable output.
Timing calls - Using gettimeofday() or similar
calls, programmers can manually record the start and end times of
events to infer information based on the difference. The
getrusage() call adds some information beyond timings (context
switches, faults, messages and I/O counts) and can similarly
used. If per-call information is required, not only do these
approaches introduce many more system calls, but the information can
be misleading.
DeBox compares favorably with a hypothetical merger of the timing
calls and the trace tools in the sense that timing information is
presented in-band, but so is the other information. In comparison with
the Linux Trace Toolkit, our focus differs in that we gather the most
significant pieces of data related to performance, and we capture it
at a much higher level of detail.
Microbenchmarks - Tools such as
lmbench (24) and hbench:OS (13) can
measure best-case times or the isolated cost of certain operations
(cache misses, context switches, etc.). Common usage for these tools
is to compare different operating systems, different hardware
platforms, or possible optimizations.
Latency tools - Recent work on attempting to find the
source of latency on desktop systems not designed for real-time work
have yielded insight and some tools. The Intel Real-Time Performance
Analyzer (33) helps automate the process
of pinpointing latency. The work of Cota-Robles and
Held (16) and Jones and
Regehr (20) demonstrate the benefits of successive
measurement and searching.
Instrumentation - Dynamic instrumentation tools provide
mechanisms to instrument running systems (processes or the kernel)
under user control, and to obtain precise kernel information. Examples
include DynInst (14),
KernInst (40), ParaDyn (25),
Etch (35), and
ATOM (37). The appeal of this approach versus
standard profilers is the flexibility (arbitrary code can be inserted)
and the cost (no overhead until use). Information is presented
out-of-band.
Since DeBox measures the performance of calls in their natural usage,
it resembles the instrumentation tools. DeBox gains some flexibility
by presenting this data to the application, which can filter it
on-line. One major difference between DeBox and kernel instrumentation
is that we provide a rich set of measurements to any process, rather
than providing information only to privileged processes.
Beyond these performance analysis tools, the idea of observing kernel
behavior to improve performance has appeared in many different
forms. We share similarities with Scheduler
Activations (5) in observing scheduler activity
to optimize application performance, and with the Infokernel system by
Arpaci-Dusseau et al. (8). Our goals differ,
since we are more concerned with understanding why blocking occurs
rather than reacting to it during a system call. Our non-blocking
sendfile() modification is patterned on non-blocking
sockets, but it could be used in other system calls as well. In a
similar vein, RedHat has applied for a patent on a new flag to the
open() call, which aborts if the necessary metadata
is not in memory (26).
Our observations on blocking and its impact on latency may impact server design. Event-driven designs for network servers have been a popular approach since the performance studies of the Harvest Cache (12) and the Flash server (30). Schmidt and Hu (36) performed much of the early work in studying threaded architectures for improving server performance. A hybrid architecture was used by Welsh et al. (41) to support scheduling, while Larus and Parkes (22) demonstrate that such scheduling can also be performed in event-driven architectures. Qie et al. (34) show that such architectures can also be protected against denial-of-service attacks. Adya et al. (1) discuss the unification of the two models. We believe that DeBox can be used to identify problem areas in other servers and architectures, as our latency measurements of Apache suggest.