USENIX Technical Program - Paper - Proceedings of the USENIX Annual Technical Conference, January 6-10, 1997, Anaheim, California, USA
   
[Technical Program]
Pp. 3142 of the Proceedings | |
A Revisitation of Kernel Synchronization Schemes
Christopher Small and Stephen Manley
Harvard University
{chris,manley}@eecs.harvard.edu
Abstract
In an operating system kernel, critical sections of code
must be protected from interruption. This is traditionally
accomplished by masking the set of interrupts whose
handlers interfere with the correct operation of the criti
cal section. Because it can be expensive to communicate
with an off-chip interrupt controller, more complex opti
mistic techniques for masking interrupts have been pro
posed.
In this paper we present measurements of the
behavior of the NetBSD 1.2 kernel, and use the
measurements to explore the space of kernel
synchronization schemes. We show that (a) most critical
sections are very short, (b) very few are ever
interrupted, (c) using the traditional synchronization
technique, the synchronization cost is often higher than
the time spent in the body of the critical section, and (d)
under heavy load NetBSD 1.2 can spend 9% to 12% of
its time in synchronization primitives.
The simplest scheme we examined, disabling all
interrupts while in a critical section or interrupt handler,
can lead to loss of data under heavy load. A more
complex optimistic scheme functions correctly under
the heavy workloads we tested and has very low
overhead (at most 0.3%). Based on our measurements,
we present a new model that offers the simplicity of the
traditional scheme with the performance of the
optimistic schemes.
Given the relative CPU, memory, and device
performance of today's hardware, the newer techniques
we examined have a much lower synchronization cost
than the traditional technique. Under heavy load, such
as that incurred by a web server, a system using these
newer techniques will have noticeably better
performance.
1 Introduction
Although the BSD kernel is traditionally single-
threaded, it needs to perform synchronization in the face
of asynchronous interrupts from devices. In addition,
while processing an interrupt from a device, the kernel
needs to block delivery of new interrupts from that
device. This is traditionally accomplished by communi
cating with an off-chip interrupt controller and masking
interrupts. Historically, the routines that perform this
service are named splintr, where intr is the name of the
interrupt to disable. A special routine, splhigh, is used to
disable all interrupts.
Most, if not all, conventional hardware platforms
allow device interrupts to be prioritized and assigned
levels; a pending interrupt from a lower-level device can
not interrupt the processing of an interrupt from a
higher-level device. This method protects critical
sections (by disabling interrupts from all devices) and
device drivers (by disabling interrupts from the same
device). Interrupts can then be prioritized by their
frequency and the time required to handle them.
At Harvard we are in the process of developing a
new operating system, the VINO kernel [Selt96]. In
order to explore the space of kernel synchronization
methods, we constructed analytic models of four basic
interrupt handling schemes. We then used the NetBSD
1.2 kernel to analyze the performance of each scheme,
by measuring:
- how often NetBSD enters critical sections,
- how long critical sections last,
- how often they are interrupted,
- the frequency of different types of interrupts, and
- the frequency with which these interrupts are
delivered.
We found that critical sections are generally very
short (usually less than four microseconds), that little
total time is spent in critical sections, and that the
synchronization overhead of three of the four basic
schemes we examined is nominal. We also found that
the overhead of the traditional scheme can be as large as
12% of CPU time.
On many system architectures interrupt processing
is handled by an off-chip programmable interrupt
controller. On a Pentium PC the time required to access
the off-chip interrupt controller is quite high,
approximately three hundred cycles (2.5ms on a
120MHz Pentium).
In Section 2, we discuss related work on
synchronization. In Section 3, we propose our four
schemes for handling interrupts. In Section 4 we discuss
our experimental setup, and then in Section 5 measure
the overhead of the four schemes. Section 6 examines
the behavioral correctness of the on-chip schemes we
propose, measuring the frequency of, and amount of
time spent in, interrupt handlers. In Section 7 we discuss
our results and propose our new scheme, and conclude
in Section 8.
The interrupt handlers of real-time systems often run
with interrupts disabled [Stan88], as in the V kernel
[Berg86]. In this paper we explore similar techniques,
using atomicity for synchronization, although we make
no assumptions about real-time behavior of the system
as a whole.
Synchronization does not require locking; other
techniques have been proposed. For example, non-
blocking synchronization [Herl90, Mass91, Green96],
lock-free data structures [Wing92], and restartable
atomic sequences [Bers92] are novel techniques
proposed for synchronization. Some of these schemes
require special hardware support, such as a compare-
and-swap instruction, which is not available on all
processors (e.g., the MIPS R3000 and the Intel i386).
These techniques are designed to work in environments
where it is expensive (or impossible) to disable
interrupts, and work best in environments where
contention is minimal. The cost and complexity of these
schemes is higher than the on-chip schemes we propose,
and our analysis shows that under typical loads, such
complexity is unnecessary.
Similarly, Stodolsky et al. [Stod93] proposed
decreasing the cost of synchronization by taking an
optimistic approach. When the kernel makes a request to
raise the global priority level, instead of communicating
with the interrupt controller, the kernel saves the desired
priority level in a variable. The critical section is then
run, and on completion, the former priority level is
restored. If no interrupts occurred during the critical
section, the kernel was able to save the (high) cost of
communicating with the interrupt controller, paying
instead the (lower) cost of saving and restoring the
logical priority level.
If an interrupt is delivered while the kernel is in a
critical section, the interrupt handler compares the
logical priority level with the interrupt priority level. If
the interrupt is higher priority, it is handled
immediately; if it is lower priority, the system queues
the interrupt and falls back to the pessimistic scheme,
setting the actual priority level to match the logical
level. The cost of setting the global flag is quite low
compared to the cost of communicating with the
interrupt controller, so if the kernel is rarely interrupted
while in a critical section the performance of the
optimistic technique is superior to that of the standard
pessimistic technique. In the case of a null RPC
microbenchmark, they saw a 14% performance
improvement.
In addition to the off-chip interrupt controller, some
CPUs include an on-chip instruction to mask all
interrupts. On the Pentium this operation can be
performed in five to ten cycles, two orders of magnitude
more quickly than communicating with the interrupt
controller.
If a critical section is short, the cost of going off-
chip to disable and re-enable interrupts can be much
greater than the time spent in the body of the critical
section. For example, if the critical section adds an
element to a linked list, it may only run for twenty or
thirty cycles; if the kernel needs to go off-chip to
synchronize, the time spent in the synchronization code
is an order of magnitude more than the time spent in the
body of the critical section. If we use the on-chip
instructions to disable and re-enable interrupts, the
synchronization cost is lowered to the point where it is
less than that of the critical section code.
Optimistic synchronization techniques are
particularly well-suited to short-duration critical
sections that are infrequently interrupted. On entry to a
critical section, instead of disabling interrupts, a global
flag is set. If an interrupt is delivered while the flag is
set, interrupts are only then disabled, and the interrupt is
postponed until the flag is cleared. If no interrupt is
delivered while the flag is set, the cost of
synchronization is just the cost of setting and clearing
the flag.
In this paper we model four schemes, based on two
variables: where to mask interrupts (on-chip or off-chip)
and how to mask them (pessimistically or
optimistically).
We note that the techniques that use on-chip
interrupt masking for synchronization may not be usable
on multiprocessor systems. In symmetric multiprocessor
systems, any processor may enter a critical section.
Hence, any processor must be able to disable interrupts
delivered to all processors. On this type of hardware the
synchronization scheme must communicate either with
the other processors (directly or through memory), or
with an off-processor interrupt controller; in either case,
it needs to go off-chip. An asymmetric multiprocessor
(where only one processor runs operating system kernel
code, and hence only one processor receives interrupts
and enters critical sections) would be able to take
advantage of the on-chip techniques we propose.
In the remainder of the paper we examine our four
schemes in detail and compare their costs. We use our
results to derive a fifth scheme which synthesizes the
benefits of the others.
Our work begins with the ideas developed by
Stodolsky and adds a second dimension of comparison,
the performance difference seen when masking
interrupts on-chip and off-chip.
Mogul and Ramakrishnan [Mogul96] have
explored a related issue. Interrupt processing can be
split across multiple priority levels, a high-priority level
that responds to the device interrupt and a low-priority
level that processes the interrupt. When this is the case,
it is possible for a flood of high priority device interrupts
to cause the starvation of the low-priority processing.
Mogul and Ramakrishnan refer to this state as receiver
livelock. When the system reaches this state, their
strategy is to disable interrupts and poll for new requests
as the old ones are handled.
Their techniques are most appropriate in the face of
unsolicited I/O from network and serial devices. The
kernel has little or no control over how quickly data is
sent to this type of device. Unlike a network device, a
disk device only generates interrupts in response to a
request from the kernel. If the kernel is receiving disk
interrupts more quickly than it can process them, the
kernel can reduce the load by queueing fewer disk
requests. A natural feedback loop is present; if the
kernel is bogged down handling old work, little new
work will be started.
Mogul and Ramakrishnan's strategy is related to
ours. They find that, at times, it is more efficient to
ignore new interrupts in order to process the ones that
have already arrived. However, they propose
dynamically switching between two schemes based on
the current system load; we propose choosing a single
scheme for handling all interrupts. Their work is
compatible with ours; none of the synchronization
schemes we analyze preclude use of their techniques.
The first axis of our strategy space is a comparison of
optimistic and pessimistic schemes, as studied by Stod
olsky et al. The second axis of our space is a comparison
of off-chip and on-chip interrupt management. This
gives us four strategies to explore, as seen in Table 1.
In the following sections we describe each scheme
in detail, sketch its synchronization functions, and
discuss its costs and benefits.
3.1 Spl-Pessim
The spl-pessim scheme is named after the kernel "set
priority level" function (spl), which was named after the
PDP-11 instruction of the same name. When a critical
section is entered, a subset of interrupts are disabled by
communicating with the off-chip programmable inter
rupt controller (PIC).
crit_sec_enter()
saved_mask = cur_mask
PIC_mask(all)
crit_sec_leave()
PIC_mask(saved_mask)
interrupt_handler(int intr_level)
saved_mask = cur_mask
PIC_mask(intr <= intr_level)
handle interrupt
PIC_mask(saved_mask)
The benefit of this scheme is fine-grained control of
which interrupts are disabled. The cost is high per-
critical section overhead.
3.2 Spl-Optim
When a critical section is entered a variable is set to the
logical priority level; the variable is restored on exit
from the critical section. When an interrupt is delivered,
the logical priority level is checked; if the interrupt has
been masked, the interrupt is queued. The hardware
interrupt mask is then set to be the logical mask (by
communicating with the interrupt controller).
crit_sec_enter()
in_crit_sec = true
crit_sec_leave()
if (any queued interrupts)
handle queued interrupts
in_crit_sec = false
interrupt_handler(int intr_level)
if (in_crit_sec
|| intr_level < cur_level)
queue interrupt
else
saved_mask = cur_mask
PIC_mask(intr <= intr_level)
handle interrupt
PIC_mask(saved_mask)
handle queued interrupts
Table 1: Synchronization Strategies
Explored. On the x86, NetBSD 1.2 uses spl-optim, the Linux
1.3.0 kernel uses cli-pessim, and BSD/OS 2.1 uses spl-pessim.
------------------------------------------
| | Pessimistic | Optimistic |
==========================================
| Off-chip | spl-pessim | spl-optim |
------------------------------------------
| On-chip | cli-pessim | cli-optim |
------------------------------------------
The benefit of this scheme is low per-critical
section overhead, and fine-grained control over which
interrupts are disabled.
The cost of this scheme is communication with the
off-chip interrupt controller if an interrupt is delivered
when the kernel is in a critical section, and a higher code
complexity than the pessimistic schemes.
3.3 Cli-Pessim
The cli-pessim scheme is named after the x86 instruc
tion that clears the interrupt flag, cli. When a critical
section is entered, all interrupts are disabled. This
scheme is structurally similar to the spl-pessim scheme,
but instead of communicating with the PIC, on-chip
instructions are used to disable and enable interrupts.
crit_sec_enter()
disable_all_interrupts()
crit_sec_leave()
enable_all_interrupts()
interrupt_handler(int intr_level)
crit_sec_enter()
handle interrupt
crit_sec_leave()
The benefit of this scheme is a low per-critical
section overhead. The cost is increased risk of dropped
interrupts (if multiple interrupts are delivered during
critical sections) and possible delay of high-priority
interrupts while processing a lower priority interrupt or
while the kernel is in a critical section.
3.4 Cli-Optim
When a critical section is entered, a global flag is set; it
is cleared on exit from the critical section. When an
interrupt is delivered, interrupts are disabled while the
interrupt is processed.
crit_sec_enter()
in_crit_sec = true
crit_sec_leave(int level)
if (any queued interrupts)
handle queued interrupts
in_crit_sec = false
interrupt_handler(int intr_level)
if (in_crit_sec)
queue interrupt
else
disable_all_interrupts()
handle interrupt
enable_all_interrupts()
This scheme is very similar to the spl-optim
scheme, but, as with the cli-pessim scheme, instead of
communicating with the PIC, interrupts are disabled and
enabled through the use of on-chip instructions.
The benefit of this scheme is low per-critical
section overhead. Its cost is increased risk of lost
interrupts if multiple interrupts are delivered during
critical sections, and possible delay of high-priority
interrupts while processing a lower priority interrupt or
critical section. It also has a slightly higher code
complexity than the cli-pessim scheme.
3.5 Comparison of Schemes
While a scheme's performance is important, we
must also verify its correctness. Devices are designed to
queue a small number of pending interrupts for a short
period of time; if interrupts are disabled for a long
period of time, it is possible that multiple interrupts will
be merged together or lost, and data buffers can
overflow with an accompanying loss of data. In some
cases the loss of an interrupt is a performance issue, not
a correctness issue (e.g., TCP/IP packets are
retransmitted if dropped). However, we need to be sure
that the system as a whole behaves correctly in the face
of heavy load.
Fortunately, it is not necessary to test all four
schemes for correctness. The cli-pessim scheme is the
least responsive to external interrupts; if it behaves
correctly in the face of heavy load, the other systems can
do no worse.
Note that use of a cli scheme does not preclude
assigning priorities to interrupts. Although all interrupts
are disabled while the system is in a critical section,
pending interrupts can still be assigned priority levels,
and higher-priority interrupts will take precedence over
lower-priority interrupts when interrupts are once again
enabled. (In our implementation of the cli-pessim kernel
we assign the same priorities to interrupts as are used by
standard NetBSD1.2, the spl-optim kernel to which we
compare it.)
The costs and benefits of the four strategies depend
on a number of components. First, the cost of going off-
chip to set the priority level needs to be measured, as
does the cost of disabling interrupts on-chip. Second,
the frequency with which interrupts are received, and
the frequency with which an interrupt arrives while the
kernel is in a critical section, must be measured. We
must also determine the time spent in critical sections in
order to learn whether a cli-pessim kernel will ignore
interrupts for too long a period of time, with too high a
risk of losing interrupts.
The overhead of synchronization for each strategy
can be described by a simple analytic model. The model
must take into account:
- number of critical sections entered: number of times
a critical section was entered, per second. This factor
is important when computing the overhead for any of
the schemes.
- number of interrupted critical sections: number of
times a critical section was interrupted, per second.
This is only a factor for the optimistic schemes.
- off-chip cost: the time required to communicate with
the interrupt controller, first raising the priority level,
then lowering it. This is only relevant for the spl
schemes.
- on-chip cost: the time required to disable interrupts
and re-enable them. This cost only applies to the cli
schemes.
- flag setting cost: the time required to set the variable
holding the mask. This cost is only a factor when
using an optimistic scheme.
The equations that describe the overhead for each
scheme are found in Table 2. In the following section we
discuss our experimental setup, and then in Section 5 we
measure the component costs and derive a total cost for
each scheme.
In this section we discuss the kernels that we con
structed and measured, and the hardware platform used.
4.1 Kernels
The x86 version of NetBSD 1.2, a derivative of
4.4BSDLite, uses the spl-optim strategy for priority
level management. We started with an off-the-shelf copy
of NetBSD 1.2 for our spl-optim measurements. With
this kernel we were able to measure the frequency with
which critical sections are interrupted.
Starting with NetBSD 1.2 as a base, we developed a
cli-pessim kernel. This kernel disables all interrupts any
time a critical section or interrupt handler is entered, and
enables all interrupts when the critical section or
interrupt handler finishes. The cli-pessim kernel allowed
us to accurately measure the length of time spent in
critical sections, the number of critical sections, and the
time required to handle each type of interrupt.
4.2 Hardware
We ran our tests on an x86 PC with a 120MHz Pentium
processor and PCI bus. The memory system consisted of
the on-chip cache (8KB i + 8KB d), a 512KB pipeline
burst off-chip cache and 64MB of 60ns EDO RAM.
There was one IDE hard drive attached, a 1080 MB,
5400 RPM Western Digital WDC31000, with a 64KB
buffer and an average seek time of 10ms. The system
included a BusLogic BT946C PCI SCSI controller;
attached to it was a 1033 MB, 5400 RPM Fujitsu
M2694ES disk, with a 10ms average seek time and a
transfer rate of 5MB/second, and a Sony 4x SCSI CD-
ROM drive (model CDU-76S). The Ethernet adapter
was a 10Mbps Western Digital Elite 16 with 16KB of
100ns RAM. The serial ports were buffered (16550
UART).
Pentium processors include an on-chip cycle
counter, which enable very precise timing
measurements. We read the counter at the start and at
the end of each experiment; the difference was
multiplied by the processor cycle time (8 1/3
nanoseconds) to obtain the elapsed time. The cost of
reading the cycle counter is roughly 10 cycles; where
significant, the timing cost has been subtracted from our
measurements. We include code to read the cycle
counter in the Appendix.
Table 2: Synchronization Scheme
Overheads. The model describing the overhead associated with the
four schemes discussed. The costs are functions of five variables:
critical sections entered, critical sections interrupted, off-chip
cost, on-chip cost, and flag-setting cost.
---------------------------------------------------------------------------------
| Scheme | Synchronization Overhead |
=================================================================================
| spl-pessim | (number of critical sections entered) o (off-chip cost) |
---------------------------------------------------------------------------------
| spl-optim | (number of critical sections entered) o (flag-setting cost) + |
| | (number of critical sections interrupted) o (off-chip cost) |
---------------------------------------------------------------------------------
| cli-pessim | (number of critical sections entered) o (on-chip cost) |
---------------------------------------------------------------------------------
| cli-optim | (number of critical sections entered) o (flag-setting cost) + |
| | (number of critical sections interrupted) o (on-chip cost) |
---------------------------------------------------------------------------------
As shown in Table 2, the synchronization overhead of
each scheme is a function of five variables: the off-chip
priority setting cost, the on-chip priority setting cost, the
flag-setting cost, the number of times a critical section is
entered, and the number of times an interrupt arrives
while the system is in a critical section. In this section
we measure each component to derive the synchroniza
tion overhead for the four schemes.
5.1 Critical Sections Per Second
We ran four tests to estimate the number of critical
sections entered per second under heavy load. These
tests supply the value used for the number of critical
sections entered variable in the equations above. They
measure the system under a mixed programming load,
heavy web traffic, and high-speed serial traffic.
The first two tests are the result of running the
Modified Andrew Benchmark [Ouster90] on a local file
system (Andrew local) and an NFS-mounted file system
(Andrew NFS). The Modified Andrew Benchmark
consists of creating, copying, and compiling a hierarchy
of files, and was designed to measure the scalability of
the Andrew filesystem [How88].
Table 3: Critical Sections Per Second.
The mean and maximum number of critical sections entered, per second,
measured with the cli-pessim kernel. We also measured the mean
critical section duration, which is seen to be very short, especially
relative to the cost of off-chip synchronization.
--------------------------------------------------------
| | Mean | Max | Mean |
| | Critical | Critical | Critical |
| | Sections | Sections | Section |
| | per sec | per sec | Duration |
| | (std dev) | | |
--------------------------------------------------------
| Andrew | 36048 | 72521 | 3.1ms |
| local | (60%) | | |
--------------------------------------------------------
| Andrew | 25867 | 72940 | 4.6ms |
| NFS | (68%) | | |
--------------------------------------------------------
| WebStone | 34468 | 82653 | 3.3ms |
| | (31%) | | |
--------------------------------------------------------
| Serial 115.2 | 47762 | 47807 | 0.6ms |
| | (0.05%) | | |
--------------------------------------------------------
The third test was running the WebStone 2.0
benchmark from Silicon Graphics [SGI96]. WebStone
was designed to test the performance of web servers
under varying load. We installed the Apache 1.1.1
server on our test machine, and ran the WebStone clients
(which generate http requests of the server) from a
SparcStation 20. The maximum number of critical
sections per second occurred with 50 client processes;
we report the results from this test.
The fourth test measured the kernel's behavior
while receiving a 64KB file via tip(1) at 115,200 bps.
This test measures the behavior of the kernel when it is
presented with a high rate of short-duration interrupts.
We instrumented the kernel to measure the duration
of each critical section, using the Pentium cycle counter.
A user-level process polled the kernel once per second
to retrieve the number of critical sections and the time
spent in critical sections since the last poll. We use these
results to derive the numbers in Table 4.
Table 4: Interrupted Critical
Sections. We measured the number of critical sections entered
per second on the spl-optim kernel, and the number of times per second
critical sections were interrupted (which causes the optimistic kernel
to revert to pessimistic synchronization behavior). We include the
percentage of critical sections interrupted.
---------------------------------------------------
| | Mean Critical | Interrupted |
| | Sections | Critical |
| | per second | Sections |
| | (std dev) | per second |
| | | (% intr) |
---------------------------------------------------
| Andrew local | 31904 (69%) | 15 (0.05%) |
---------------------------------------------------
| Andrew NFS | 25172 (67%) | 50 (0.2%) |
---------------------------------------------------
| WebStone | 33890 (33%) | 300 (0.9%) |
---------------------------------------------------
| Serial 115.2 | 45269 (8%) | 140 (0.3%) |
---------------------------------------------------
We plotted the frequency of duration of critical
sections for each test, and found that the distribution
roughly follows the shape of an exponential distribution
(see Figure 1 as an example; the other distributions had
the same shape). The standard deviation of an
exponential distribution is equal to the mean of the
distribution, which in all measured cases is less than
5ms. If we use the mean as an estimate of the standard
deviation, we find that most critical sections take less
than 10ms; in the case of the Serial 115.2, we expect
most to be less than 1.2ms.
Figure 1: Critical Section Duration Distribution.
Distribution of critical section duration for Andrew NFS; other
distributions had the same shape.
5.2 Interrupted Critical Sections
In order to compute the overhead of the optimistic
techniques we need to determine the percentage of
critical sections that are interrupted. We reran the tests
from Table 4 with the standard NetBSD 1.2 kernel
(which uses the spl-optim scheme) and measured the
number of critical sections interrupted per second.
These results are shown in Table 4, on page 7.
For comparison with the results of Table 4, we
include the Mean Critical Sections Per Second
measured using this kernel; these results are very similar
to the values seen in Table 4. As we expected (based on
the short duration of critical sections), we saw that a
very small number of critical sections are interrupted.
5.3 Synchronization Primitive Cost
The synchronization overhead parameters were
measured by performing each operation pair (set and
unset) in a tight loop 100,000 times and computing the
mean and standard deviation. In all cases the standard
deviation was less than 1% of the mean. The results of
these measurements are given in Table 5.
Because these tests were run in a tight loop, a
minimal number of cache misses and other side-effects
are seen in these results, hence the synchronization
times we use are somewhat optimistic. Cache misses
could have a large percentage impact on the on-chip and
flag-setting measurements, because these instruction
sequences are very short. However, because they are so
short, even if all of the instructions and data referenced
by the code cause cache misses, the total absolute
additional cost (in cycles) would be very small.
Table 5: Synchronization Primitive Cost:
Time required to set priority level off-chip, on-chip, and to
set a global flag saving the logical priority level. The Pentium has a
write buffer, so the flag-setting cost does not include time to write
to main memory.
----------------------------------------------------------
| Synchronization Primitive | Time |
==========================================================
| off-chip (spl-pessim) | 2.5ms (304 cycles) |
----------------------------------------------------------
| on-chip (cli-pessim) | 0.18ms (21 cycles) |
----------------------------------------------------------
| set flag (spl-optim, cli-optim) | 0.05ms (6 cycles) |
----------------------------------------------------------
The synchronization cost of each of the four
schemes is directly driven by the costs shown in Table 5.
The off-chip cost is the synchronization cost of the spl-
pessim kernel (304 cycles, 2.5ms), on-chip gives the
synchronization cost the cli-pessim kernel (21 cycles,
0.18ms), and the flag setting cost is the synchronization
cost of the optimistic kernels (6 cycles, 0.05ms). As we
immediately see by comparing the costs in Table 5 with
the critical section durations seen in Table 4, the mean
critical section duration seen is less than twice the off-
chip synchronization cost, and more than three times the
mean Serial 115.2 critical section cost.
5.4 Synchronization Overhead
Given the results in Table 3, Table 4, and Table 5,
the synchronization cost and overhead for each of the
four schemes can be computed. These costs are given in
Table 6.
Table 6: Synchronization Scheme
Overhead. Percentage of time spent synchronizing critical
sections using the described techniques. Computed using equations in
Table 2 and measurements from Table 3, Table 4, and Table 5.
----------------------------------------------------------------
| | Synchronization | | | |
| | Overhead | | | |
----------------------------------------------------------------
| Scheme | Andrew | Andrew | Web | Serial |
| | local | NFS | Stone | 115.2 |
================================================================
| spl-pessim | 9.0% | 6.5% | 8.6% | 11.9% |
----------------------------------------------------------------
| spl-optim | 0.2% | 0.1% | 0.3% | 0.3% |
----------------------------------------------------------------
| cli-pessim | 0.7% | 0.5% | 0.6% | 0.9% |
----------------------------------------------------------------
| cli-optim | 0.2% | 0.1% | 0.2% | 0.2% |
----------------------------------------------------------------
We see that with the large number of critical
sections per second seen under the Serial 115.2 test, the
spl-pessim kernel can spend nearly 12% of its time in
synchronization primitives. Even under the lightest
measured load, the system spends more than 6% of its
time in synchronization primitives.
Another thing to note is that the other three
techniques all have an overhead that is an order of
magnitude less than that of the spl-pessim technique.
Even though the spl-optim technique goes off-chip for
synchronization, because so few critical sections are
interrupted (from 0.05% to 0.9%) the difference is
negligible (between 0.2%-0.7%).
With the knowledge that the absolute overhead of the
three proposed schemes is very low, we must be sure
that none of the schemes allow interrupts to be lost or
overly delayed.
The optimistic schemes do not delay high-priority
interrupts, nor does the spl-pessim scheme; the cli-
pessim scheme is the only one that masks all interrupts
while in a critical section or interrupt handler. If the cli-
pessim scheme can operate correctly, the other schemes
will also operate correctly.
In Table 3 we saw that the mean critical section
duration is between 0.6 and 4.6ms. We are concerned
that handling a device interrupt could take much longer,
especially with programmed I/O devices, such as the
IDE disk, which copy data from the device to the kernel
one word at a time.
We are concerned that high-frequency interrupts are
not overly delayed while in long-duration interrupt
handlers. We gathered data on both long-duration
handlers (IDE disk interrupts) and high-frequency
interrupts (serial lines at 115.2Kbps).
We measured the time required to process interrupts
from a variety of devices attached to our test system.
Our tests involve reading from or writing to each device
and measuring the amount of time required to perform
the test and the number of interrupts delivered while
performing the test. We ran the following tests:
- ide disk: write 8MB to IDE disk.
- scsi disk: write 8MB to SCSI disk.
- scsi cd-rom: read 8MB from SCSI CD-ROM.
- serial-300: read 8KB (via tip) at 300bps.
- serial-38.4: read 64KB (via tip) at 38.4Kbps.
- serial-115.2: read 64KB (via tip) at 115.2Kbps.
- floppy: read 1MB from raw floppy disk.
- ether-small: flood ping of 1000 64 byte packets.
- ether-large: flood ping of 1000 1024 byte packets.
The results of these tests are shown in Table 7. The
mean time to handle an interrupt from a particular
device is useful, although we found that there can be
considerable variation. For each test we also include the
minimum, maximum, and median interrupt handling
time.
6.1 IDE, SCSI, and Floppy Interrupts
As mentioned above, the ide disk test performs
programmed I/O (PIO), i.e., data is copied by the CPU
one byte or word at a time between the disk controller
and main memory. This is very CPU intensive, and can
be quite slow. By comparison, the SCSI disk controller
uses direct memory access (DMA) to copy data directly
from the controller to main memory without the
intervention of the CPU. Each ide disk interrupt took
substantially longer than a scsi disk interrupt (1519ms
vs. 65ms).
Surprisingly, although the specifications of the IDE
and SCSI disks is very similar, it took nearly three times
as long to transfer 8MB to the SCSI disk as it did to
transfer it to the IDE disk. We believe that this is an
artifact of the NetBSD 1.2 disk drivers; we saw similar
performance with the generic NetBSD 1.2 kernel and
our cli-pessim kernel. We do not believe that this reflects
the underlying performance of the disk or the controller;
under BSD/OS 2.1 the performance of the two disks was
much closer (in fact, the SCSI disk was about 10%
faster than the IDE disk).
The mean time required to process an IDE interrupt
(1.5ms) is substantially greater than any other interrupt
processing time we measured. We discuss the possibility
of timing conflicts between IDE interrupts and other
interrupts in Section 6.4.
Table 7: Behavior ofspl-optim kernel during
device tests. Frequency of device interrupts and time to
handle an interrupt from each device was measured. Although the spread
between minimum and maximum is large, the median is in all cases other
than floppy is within 2% of the mean. Each test measured the reception
of several thousand interrupts.
------------------------------------------------------------------------------------------------------------------------
| Test | Duration | Total | Interrupts | Mean time | Min time | Median | Max time |
| | of Test | Interrupts | per | to handle | to handle | time | to handle |
| | (seconds) | | second | interrupt | interrupt | to handle | interrupt |
| | | | | (std dev) | | interrupt | |
========================================================================================================================
| ide disk | 3.2s | 1027 | 321 | 1519ms (7%) | 15ms | 1545ms | 2003ms |
------------------------------------------------------------------------------------------------------------------------
| scsi disk | 8.8s | 553 | 61 | 65ms (15%) | 16ms | 64ms | 86ms |
------------------------------------------------------------------------------------------------------------------------
| scsi cd-rom | 13.4s | 3931 | 293 | 27ms (7%) | 20ms | 26ms | 40ms |
------------------------------------------------------------------------------------------------------------------------
| floppy | 64.2s | 506 | 8 | 169ms (57%) | 4ms | 200ms | 271ms |
------------------------------------------------------------------------------------------------------------------------
| serial-300 (8K) | 274s | 8221 | 30 | 9ms (5%) | 9ms | 9ms | 12ms |
------------------------------------------------------------------------------------------------------------------------
| serial-38.4 (64K) | 17.4s | 8195 | 471 | 27ms (2%) | 19ms | 27ms | 42ms |
------------------------------------------------------------------------------------------------------------------------
| serial-115 (64K) | 5.8s | 8176 | 1410 | 27ms (4%) | 11ms | 27ms | 54ms |
------------------------------------------------------------------------------------------------------------------------
| ethernet-small | 2.4s | 1003 | 418 | 78ms (25%) | 44ms | 78ms | 370ms |
------------------------------------------------------------------------------------------------------------------------
| ethernet-large | 3.4s | 1703 | 501 | 487ms (8%) | 45ms | 491ms | 1194ms |
------------------------------------------------------------------------------------------------------------------------
The standard deviation of floppy interrupt
processing time is quite large (57%). It is caused by the
several different types of interrupts generated by the
floppy device, from seek completion notification (4ms),
to transfer completion notification (25ms), to data
transfer (250ms). The maximum time seen, 271ms, is
short enough that we are not concerned about floppy
interrupts.
6.2 Serial Interrupts
We see that a serial-300 interrupt takes one third as
long as a serial-38.4 interrupt; we attribute this to the
fact that multiple characters are transferred on each
serial-38.4 interrupt, as evidenced by the fact that 1/8 as
many interrupts were generated per kilobyte by the
latter. (Please note that the serial-300 test transferred
8KB, where the other two tests transferred 64KB.)
When we tested higher-rate serial connection (at
115.2 bps) we saw roughly the same number of
interrupts as we saw in the serial-38.4 test (8195 vs.
8176) and the same mean interrupt handling time
(27ms), implying the same number of characters
processed per interrupt, but the total test took one third
as long. The high rate of interrupts during the serial-
115.2 test may conflict with the large amount of time
required to process an ide-disk interrupt; we discuss this
possibility in Section 6.4.
6.3 Network Interrupts
As stated above, our network tests were taken using
a 10Mbps Ethernet board. We ran three tests: flood
(continuous) ping, UDP, and TCP. We saw no significant
variation in interrupt processing time as a function of
protocol (ping/IP, UDP, and TCP), and hence include
only the ping/IP results.
Interrupt processing time is independent of protocol
because the higher-level protocol processing (where we
would see differences between TCP, UDP, and ping)
does not take place when the packet arrives; instead, the
packet is queued for processing by a higher-level "soft"
interrupt handler which runs once the hardware interrupt
has completed.
We ran additional tests with varying packet sizes.
We saw the expected relationship between packet size
and processing time (which more or less scaled linearly
with packet size).
Faster network technologies (e.g., 100Mbps
Ethernet and ATM) would generate interrupts much
more frequently than our 10Mbps Ethernet interface.
However, as we see below, the rate at which the serial
device generates interrupts is sufficient to preclude use
of the cli schemes. Based on this, the interrupt
generation rate of ATM is a moot point.
When examining the data gathered for each
interrupt, we found that although there was often a
significant difference between the minimum and
maximum, in most cases the standard deviation was
relatively small, and the median was close to the mean.
Our results show that in our environment it takes
little time to handle an interrupt from any of our devices.
No interrupt seen took more than 2ms to service, and in
most cases the interrupt processing time was several
orders of magnitude less than that.
However, we see from Table 7 that interrupts can
occur very frequently. For example, the serial-115.2
interrupts arrived at 1400Hz (every 700ms). We must
examine the consequences of delaying the processing of
these interrupts.
As was discussed in Section 2, we can partition
interrupts into two classes: solicited and unsolicited.
Solicited interrupts come in response to requests from
the CPU (e.g., in response to a disk request). Unsolicited
interrupts are generated externally, and are not under the
control of the system (e.g., serial and network traffic).
The rate of solicited interrupts are under the control of
the system; when a solicited interrupt is received, the
system spends its time processing the interrupt rather
than generating more requests. The system can thus
control the rate at which solicited interrupts are
generated.
The system does not control the rate at which
unsolicited interrupts are generated. However, the
system can indirectly impact the rate at which they are
generated, either in software (by not sending
acknowledgments to network packets that are dropped
or processed in time), or in hardware (by use of serial-
line flow control).
To block serial interrupts it is necessary for the
system to communicate with the device, and to be aware
that it is necessary to slow the interrupt rate. If serial
interrupts are overly delayed, the system will not be
aware of the possibility of overflow, and hence will not
be able to stem the tide in time.
From our measurements, the longest that an
interrupt will be delayed is 2ms (the longest time spent
handling an ide-disk interrupt). The serial-115.2 device
receives characters every 70ms (115.2 bits per sec =
14,400 bytes per sec = one byte every 70ms). This means
that up to 28 characters (2ms / 70ms = 28) could arrive
while an Ethernet interrupt is being processed. Although
the serial port is buffered, the buffer only holds 16
characters. At this rate, characters could easily be lost
on the serial line while processing an IDE interrupt.
If we could eliminate programmed I/O devices, the
cli schemes would work. However, newer devices with
a high interrupt rate (e.g., ATM network controllers),
combined with the possibility of slow interrupt handlers
shift the balance against the cli techniques. Although, in
some hardware combinations, the cli schemes would
work, many common configurations could easily lead to
loss of data.
The failure of the strict cli schemes leads us to
propose a new, hybrid scheme, cli-spl-pessim, which is
described in Section 7. The cli-spl-pessim scheme
combines the low cost of the cli schemes for critical
section synchronization with the interrupt prioritization
of the spl schemes.
Our results show that under the benchmark workloads
the absolute cost of synchronization using the optimistic
techniques is low, less than 0.4%. The traditional spl-
pessim scheme has a much higher cost, over 6% in all
tested cases.
The performance improvement we see is less than
that seen by Stodolsky et al. with their spl-optim scheme
(14%). We attribute this in part to the differences
between our benchmarks and theirs; their test consisted
of a highly optimized null RPC, which has a single
critical section. Because the null RPC took very little
time (2140 cycles), an improvement in critical section
synchronization had a large performance impact. Under
the more varied loads that we measured, the kernel
spends a much lower percentage of its time in
synchronization code.
For the cli schemes to work, critical sections and
interrupt handlers must complete their work quickly.
These schemes completely disable interrupts during a
critical section, so a long-running interrupt handler can
delay the delivery of interrupts for a long period of time,
increasing the likelihood of data loss. The combination
of devices whose interrupt handlers require a long time
to service (e.g., IDE disks with programmed I/O) with
low latency requirements (e.g., very fast serial ports) is a
worst-case scenario for the cli scheme.
7.1 Proposal
We find the simplicity of the pessim schemes and
the low cost of the cli schemes appealing. This leads us
to propose a fifth scheme, cli-spl-pessim, which
combines the cli technique for critical sections with the
prioritization of the spl technique for interrupt handlers.
In this scheme, the cli and sti instructions are used to
mask interrupts while the kernel is in a critical section;
because critical sections are quite short, this will not
overly delay interrupt delivery. When handling an
interrupt, the kernel communicates with the off-chip
interrupt controller, disabling the delivery of equal and
lower-priority interrupts. Because
- interrupts are delivered much less frequently than
critical sections are entered,
- interrupt handlers take much longer than critical
sections, and
- on our hardware platform, when an interrupt is
delivered it is necessary to communicate with the off-
chip controller,
the additional performance impact of going off-chip will
be minimal and more than made up for by the increase
in system robustness. In the format of Section 3, we
include the following pseudo-code for cli-spl-pessim:
void crit_sec_enter(int level)
disable all interrupts
void crit_sec_leave(int level)
enable all interrupts
void interrupt_handler
prev_level = cur_interrupt_level
PIC_mask(interrupt_level)
handle_interrupt
PIC_mask(prev_level)
The benefit of this scheme is the low cost of
synchronization in critical sections. In addition, because
we use an spl scheme for synchronizing handlers, a low-
priority handler will be interrupted by the arrival of a
high-priority interrupt.
Because our model includes only the cost of
synchronization of critical sections, and disregards the
cost of the synchronization overhead in interrupt
handlers, using our model the cost of cli-spl-pessim is
identical to that of the cli-pessim scheme, with (as
shown in Table 6) sub-1% synchronization overhead for
the system loads that we studied.
With the performance characteristics and simplicity
of the cli-pessim scheme, without its incorrect behavior
under heavy load, we plan to use the cli-spl-pessim
scheme in the implementation of our new operating
system kernel.
7.2 Related Schemes
Although we have described NetBSD as using the spl-
optim technique, it also supports a cli-optim-style tech
nique for short duration interrupt handlers.
The Linux operating system uses a scheme similar
to cli-spl-pessim. The cli and sti instructions are used to
synchronize critical sections (with the concomitant high
performance), and, like NetBSD 1.2, short-duration
interrupt handlers are run with all interrupts disabled.
Long-duration handlers are either run with no protection
from interrupts (which increases their complexity) or
wake a kernel task to handle the processing of the
interrupt, and then return. Although this scheme
efficiently synchronizes critical sections, we believe that
the increased latency and complexity of waking a
separate task argue against use of this technique.
In this paper we have shown that it is worthwhile to
rethink how interrupts and synchronization are handled
on modern hardware.
Our results have driven the design of our new
kernel, where we use the proposed cli-spl-pessim
scheme. This scheme provides us with the simplicity of
the pessimistic schemes we describe, with the low
overhead of the optimistic schemes.
Comparing our scheme to those used in earlier
systems, we are reminded of how quickly hardware
changes: the CPU and I/O bus of today is substantially
faster than the one available in 1993 [Stod93], and
much, much faster than the one available on the VAX or
the PDP-11 [Leff89]. Nevertheless, developers of newer
systems [Cust93] find themselves re-using old, complex
techniques to solve a problem that may no longer exist.
Status and Availability
All code discussed in this paper (benchmark tests and patch files for
a cli-pessim NetBSD 1.2) is available at
https://www.eecs.harvard.edu/~chris.
Acknowledgments
Our thanks to our advisor, Margo Seltzer, for her invalu
able feedback, wordsmithing, and grasp of the subtleties
of "that" and "which." We thank our reviewers, espe
cially Mike Karels, our shepherd, for providing very
helpful feedback. And thanks to Brad Chen, whose pre
vious work on fast interrupt priority management helped
inspire this work.
Bibliography
[Berg86] Berglund, E., "An Introduction to the V-System",
IEEE Micro 10, 8, 35-52 (August 1986).
[Bers92] Bershad, B., Redell, D., Ellis, J.,
"Fast Mutual Exclusion for Uniprocessors," ASPLOS V, 223-233,
Boston, MA (1992)
[Cust93] Custer, H., Inside Windows NT, Microsoft Press,
Redmond, WA, 218-221 (1993).
[Green96] Greenwald, M., Cheriton, D., "The Synergy
Between Non-blocking Synchronization and Operating
System Structure," Second Symposium on Operating
Systems Design and Implementation, 123-136, Seattle,
WA (1996).
[Herl90] Herlihy, M. "A Methodology for Implementing
Highly Concurrent Data Structures," Second ACM
Symposium on Principles and Practice of Parallel
Programming, 197-206 (1990).
[How88] Howard, J., Kazar, M., Menees, S., Nichols, D.,
Satyanarayanan, M., Sidebotham, R., West, M., "Scale
and Performance in a Distributed File System," ACM
Transactions on Computer Systems 6, 1 51-81 (1988).
[Intel95] Pentium Processor Family Developer's Manual,
Volume 3: Architecture and Programming Manual, Intel
Corporation, Mt. Prospect, IL (1995).
[Leff89] Leffler, S., McKusick, M. K., Karels, M.,
Quarterman, J., The Design and Implementation of the
4.3BSD UNIX Operating System, Addison-Wesley
Publishing, Reading, MA (1989).
[Mass91] Massalin, H., Pu, C.,
"A Lock-Free Multiprocessor OS Kernel," Technical Report CUCS-005-91,
Department of Computer Science, Columbia University
(1991)
[Mogul96] Mogul, J., Ramakrishnan, K. K., "Eliminating
Receiver Livelock in an Interrupt-Driven Kernel,"
USENIX Technical Conference, 99-111, San Diego, CA
(1996).
[Ouster90] Ousterhout, John, "Why Aren't Operating Systems
Getting Faster as Fast as Hardware," USENIX Summer
Conference, 247-256, Anaheim, CA (1990)
[Selt96] Seltzer, M., Endo, Y., Small, C., Smith,
K.,
"Dealing
With Disaster: Surviving Misbehaving Kernel
Extensions," Proceedings of the Second OSDI, Seattle,
WA (1996).
[SGI96] Silicon Graphics Inc., "Webstone: World-Wide Web
Server Benchmarking,"
https://www.sgi.com/Products/WebFORCE.
[Stan88] Stankovic, J., Ramamritham, K., "A New Hard Real-
Time Kernel", Hard Real-Time Systems, IEEE, 361-370
(1988).
[Stod93] Stodolsky, D., Chen, J. B., Bershad, B., "Fast
Interrupt Priority Management in Operating System
Kernels," USENIX Symposium on Microkernels and
Other Kernel Architectures, 105-110, San Diego, CA
(1993).
[Wing92] Wing, J, Gong, C, "A Library of Concurrent
Objects," CMU CS TR 90-151, School of Computer
Science, Carnegie Mellon University (1992).
This research was supported by Sun Microsystems Laboratories.
|