Pp. 155164 of the Proceedings | |
Improving the FreeBSD SMP implementation
Greg Lehey
IBM LTC Ozlabs
ABSTRACT
UNIX-derived operating systems have traditionally have a simplistic approach to
process synchronization which is unsuited to multiprocessor application.
Initial FreeBSD SMP support kept this approach by allowing only one process to
run in kernel mode at any time, and also blocked interrupts across multiple
processors, causing seriously suboptimal performance of I/O bound systems. This
paper describes work done to remove this bottleneck, replacing it with
fine-grained locking. It derives from work done on BSD/OS and has many
similarities with the approach taken in SunOS 5. Synchronization is performed
primarily by a locking construct intermediate between a spin lock and a binary
semaphore, termed mutexes. In general, mutexes attempt to block rather
than to spin in cases where the likely wait time is long enough to warrant a
process switch. The issue of blocking interrupt handlers is addressed by
attaching a process context to the interrupt handlers. Despite this process
context, an interrupt handler normally runs in the context of the interrupted
process and is scheduled only when blocking is required.
Introduction
A crucial issue in the design of an operating system is the manner in which it
shares resources such as memory, data structures and processor time. In the
UNIX model, the main clients for resources are processes and interrupt
handlers. Interrupt handlers operate completely in kernel space, primarily on
behalf of the system. Processes normally run in one of two different modes,
user mode and kernel mode. User mode code is the code of the program from which
the process is derived, and kernel mode code is part of the kernel. This
structure gives rise to multiple potential conflicts.
Use of processor time
The most obvious demand a process or interrupt routine places on the system is
that it wants to run: it must execute instructions. In traditional UNIX, the
rules governing this sharing are:
-
There is only one processor. All code runs on it.
-
If both an interrupt handler and a process are available to run, the interrupt
handler runs.
-
Interrupt handlers have different priorities. If one interrupt handler is
running and one with a higher priority becomes runnable, the higher priority
interrupt immediately preempts the lower priority interrupt.
-
The scheduler runs when a process voluntarily relinquishes the processor, its
time slice expires, or a higher-priority process becomes runnable. The
scheduler chooses the highest priority process which is ready to run.
-
If the process is in kernel mode when its time slice expires or a higher
priority process becomes runnable, the system waits until it returns to user
mode or sleeps before running the scheduler.
This method works acceptably for the single processor machines for which it was
designed. In the following section, we'll see the reasoning behind the last
decision.
Kernel data objects
The most obvious problem is access to memory. Modern UNIX systems run with
memory protection, which prevents processes in user mode from accessing
the address space of other processes. This protection no longer applies in
kernel mode: all processes share the kernel address space, and they need to
access data shared between all processes. For example, the fork()
system call needs to allocate a proc structure for the new process. The
file sys/kern_fork.c contains the following code:
int
fork1(p1, flags, procp)
struct proc *p1;
int flags;
struct proc **procp;
{
struct proc *p2, *pptr;
...
/* Allocate new proc. */
newproc = zalloc(proc_zone);
The function zalloc takes a struct proc entry off a freelist and
returns its address:
item = z->zitems;
z->zitems = ((void **) item)[0];
...
return item;
What happens if the currently executing process is interrupted exactly between
the first two lines of the code above, maybe because a higher priority process
wants to run? item contains the pointer to the process structure, but
z->z_items still points to it. If the interrupting code also allocates
a process structure, it will go through the same code and return a pointer to
the same memory area, creating the process equivalent of Siamese twins.
UNIX solves this issue with the rule ``The UNIX kernel is non-preemptive''.
This means that when a process is running in kernel mode, no other process can
execute kernel code until the first process relinquishes the kernel voluntarily,
either by returning to user mode, or by sleeping.
Synchronizing processes and interrupts
The non-preemption rule only applies to processes. Interrupts happen
independently of process context, so a different method is needed. In device
drivers, the process context (``top half'') and the interrupt context (``bottom
half'') must share data. Two separate issues arise here: each half must ensure
that any changes to shared data structures occur in a consistent manner, and
they must find a way to synchronize with each other.
Protection
Each half must protect its data against change by the other half. For example,
the buffer header structure contains a flags word with 32 flags, some set and
reset by both halves. Setting and resetting bits requires multiple instructions
on most architectures, so the potential for data corruption exists. UNIX solves
this problem by locking out interrupts during critical sections. Top half code
must explicitly lock out interrupts with the spl functions.
The naming goes back to the early days of UNIX on the PDP-11. The PDP-11 had a
relatively simplistic level-based interrupt structure. When running at a
specific level, only higher priority interrupts were allowed. UNIX named
functions for setting the interrupt priority level after the PDP-11 SPL
instruction, so initially the functions had names like spl4 and
spl7. Later machines came out with interrupt masks, and BSD changed the
names to more descriptive names such as splbio (for block I/O) and
splhigh (block out all interrupts).
One of the most significant sources of bugs in drivers is inadequate
synchronization with the bottom half.
Interrupt code does not need to perform any special synchronization: by
definition, processes don't run when interrupt code is active.
Blocking interrupts has a potential danger that an interrupt will not be
serviced in a timely fashion. On PC hardware, this is particularly evident with
serial I/O, which frequently generates an interrupt for every character. At
115200 bps, this equates to an interrupt every 85 µs. In the past, this has
given rise to the dreaded silo overflows; even on fast modern hardware it can be
a problem. It's also not easy to decide interrupt priorities: in the early
days, disk I/O was given a high priority in order to avoid overruns, while
serial I/O had a low priority. Nowadays disk controllers can handle transfers
by themselves, but overruns are still a problem with serial I/O.
Waiting for the other half
In other cases, a process will need to wait for some event to complete. The
most obvious example is I/O: a process issues an I/O request, and the driver
initiates the transfer. It can be a long time before the transfer completes: if
it's reading keyboard input, for example, it could be weeks before the I/O
completes. When the transfer completes, it causes an interrupt, so it's the
interrupt handler which finally determines that the transfer is complete and
notifies the process. Traditional UNIX performs this synchronization with the
functions sleep and wakeup, though current BSD no longer uses
sleep: it has been replaced with tsleep, which offers additional
functionality.
The top half of a driver calls sleep or tsleep when it wants to
wait for an event, and the bottom half calls wakeup when the event
occurs. In more detail,
-
The process issues a system call read, which brings it into kernel mode.
-
read locates the driver for the device and calls it to initiate a
transfer.
-
read next calls tsleep, passing it the address of some unique
object related to the request. tsleep stores the address in the proc
structure, marks the process as sleeping and relinquishes the processor. At
this point, the process is sleeping.
-
At some later point, when the request is complete, the interrupt handler calls
wakeup with the address which was passed to tsleep.
wakeup runs through a list of sleeping processes and wakes all processes
waiting on this particular address.
This method has problems even on single processors: the time to wake processes
depends on the number of sleeping processes, which is usually only slightly less
than the number of processes in the system. FreeBSD addresses this problem with
128 hashed sleep queues, effectively diminishing the search time by a factor of
128. A large system might have 10,000 processes running at the same time, so
this is only a partial solution.
In addition, it is permissible for more than one process to wait on a specific
address. In extreme cases dozens of processes wait on a specific address, but
only one will be able to run when the resource becomes available; the rest call
tsleep again. The term thundering horde has been devised to
describe this situation. FreeBSD has partially solved this issue with the
wakeup_one function, which only wakes the first process it finds. This
still involves a linear search through a possibly large number of process
structures, and it has the potential to deadlock if two unrelated events map to
the same address.
Adapting the UNIX model to SMP
A number of the basic assumptions of this model no longer apply to SMP, and
others become more of a problem:
-
More than one processor is available. Code can run in parallel.
-
Interrupt handlers and user processes can run on different processors at the
same time.
-
The ``non-preemption'' rule is no longer sufficient to ensure that two processes
can't execute at the same time, so it would theoretically be possible for two
processes to allocate the same memory.
-
Locking out interrupts must happen in every processor. This can adversely
affect performance.
The initial FreeBSD model
The original version of FreeBSD SMP support solved these problems in a manner
designed for reliability rather than performance: effectively it found a method
to simulate the single-processor paradigm on multiple processors. Specifically,
only one process could run in the kernel at any one time. The system ensured
this with a spinlock, the so-called Big Kernel Lock (BKL), which
ensured that only one processor could be in the kernel at a time. On entry to
the kernel, each processor attempted to get the BKL. If another processor was
executing in kernel mode, the other processor performed a busy wait
until the lock became free:
MPgetlock_edx:
1:
movl (%edx), %eax
movl %eax, %ecx
andl $CPU_FIELD,%ecx
cmpl _cpu_lockid, %ecx
jne 2f
incl %eax
movl %eax, (%edx)
ret
2:
movl $FREE_LOCK, %eax
movl _cpu_lockid, %ecx
incl %ecx
lock
cmpxchg %ecx, (%edx)
jne 1b
GRAB_HWI
ret
In an extreme case, this waiting could degrade SMP performance to below that of
a single processor machine.
How to solve the dilemma
Multiple processor machines have been around for a long time, since before UNIX
was written. During this time, a number of solutions to this kind of problem
have been devised. The problem was less to find a solution than to find a
solution which would fit in the UNIX environment. At least the following
synchronization primitives have been used in the past:
-
Counting semaphores were originally designed to share a certain number
of resources amongst potentially more consumers. To get access, a consumer
decrements the semaphore counter, and when it is finished it increments it
again. If the semaphore counter goes negative, the process is placed on a
sleep queue. If it goes from -1 to 0, the first process on the sleep
queue is activated. This approach is a possible alternative to tsleep
and wakeup synchronization. In particular, it avoids a lengthy
sequential search of sleeping processes.
-
SunOS 5 uses turnstiles to address the sequential search problem in
tsleep and wakeup synchronization. A turnstile is a separate
queue associated with a specific wait address, so the need for a sequential
search disappears.
-
Spin locks have already been mentioned. FreeBSD used to spin
indefinitely on the BKL, which doesn't make any sense, but they are useful in
cases where the wait is short; a longer wait will result in a process being
suspended and subsequently rescheduled. If the average wait for a resource is
less than this time, then it makes sense to spin instead.
-
Blocking locks are the alternative to spin locks when the wait is likely
to be longer than it would take to reschedule. A typical implementation is
similar to a counting semaphore with a count of 1.
-
Condition variables are a kind of blocking lock where the lock is based
on a condition, for example the absence of entries in a queue.
-
Read/write locks address a different issue: frequently multiple
processes may read specific data in parallel, but only one may write it.
There is some confusion in terminology with these locking primitives. In
particular, the term mutex has been applied to nearly all of them at
different times. We'll look at how FreeBSD uses the term in the next section.
One big problem with all locking primitives with the exception of spin locks is
that they can block. This requires a process context: a UNIX interrupt handler
can't block. This is one of the reasons that the old BKL was a spinlock, even
though it could potentially use up most of processor time spinning.
The new FreeBSD implementation
The new implementation of SMP on FreeBSD bases heavily on the implementation in
BSD/OS 5.0, which has not yet been released. Even the name SMPng (``new
generation'') was taken from BSD/OS. Due to the open source nature of FreeBSD,
SMPng is available on FreeBSD before on BSD/OS.
The most radical difference in SMPng are:
-
Interrupt code (``bottom half'') now runs in a process context, enabling it to
block if necessary. This process context is termed an interrupt thread.
-
Interrupt lockout primitives (splfoo) have been removed.
The low-level interrupt code still needs to block interrupts briefly, but the
interrupt service routines themselves run with interrupts enabled. Instead of
locking out interrupts, the system uses mutexes, which may be either spin locks
or blocking locks.
Interrupt threads
The single most important aspect of the implementation is the introduction of a
process or ``thread'' context for interrupt handlers. This change involves a
number of tradeoffs:
-
The process context allows a uniform approach to synchronization: it is no
longer necessary to provide separate primitives to synchronize the top half and
the bottom half. In particular, the spl primitives are no longer
needed. For compatibility reasons, the calls have been retained, but they
translate to no-ops.
-
The action of scheduling another process takes significantly longer than
interrupt overhead, which also remains.
-
The UNIX approach to scheduling does not allow preemption if the process is
running in kernel mode.
SMPng solves the latency and scheduling issues with a technique known as lazy
scheduling: on receiving an interrupt, the interrupt stubs note the PID of
the interrupt thread, but they do not schedule the thread. Instead, it
continues execution in the context of the interrupted process. The thread will
be scheduled only in the following circumstances:
-
If the thread has to block.
-
If the interrupt nesting level gets too deep.
We expect this method to offer negligible overhead for the majority of
interrupts.
From a scheduling viewpoint, the threads differ from normal processes in the
following ways:
-
They never enter user mode, so they do not have user text and data segments.
-
They all share the address space of process 0, the swapper.
-
They run at a higher priority than all user processes.
-
Their priority is not adjusted based on load: it remains fixed.
-
An additional process state SWAIT has been introduced for interrupt
processes which are currently idle: the normal ``idle'' state is SSLEEP,
which implies that the process is sleeping.
Experience with the BSD/OS implementation showed that the initial implementation
of interrupt threads was a particularly error-prone process, and that the
debugging tools were inadequate. Due to the nature of the FreeBSD project, we
considered it imperative to have the system relatively functional at all times
during the transition, so we decided to implement interrupt threads in two
stages. The initial implementation was very similar to that of normal
processes. This offered the benefits of relatively easy debugging and of
stability, and the disadvantage of a significant drop in performance: each
interrupt could potentially cause two context switches, and the interrupt would
not be handled while another process, even a user process, was in the kernel.
Experience with the initial implementation met expectations: we have seen no
stability problems with the implementation, and the performance, though
significantly worse, was not as bad as we had expected.
At the time of writing, we have improved the implementation somewhat by allowing
limited kernel preemption, allowing interrupt threads to be scheduled
immediately rather than having to wait for the current process to leave kernel
mode. The potential exists for complete kernel preemption, where any higher
priority process can preempt a lower priority process running in the kernel, but
we are not sure that the benefits will outweigh the potential bug sources.
The final lazy scheduling implementation has been tested, but it is not
currently in the -CURRENT kernel. Due to the current kernel lock
implementation, it would not show any significant performance increase, and
problems can be expected as additional kernel components are migrated from under
Giant.
Not all interrupts have been changed to threaded interrupts. In particular, the
old fast interrupts remain relatively unchanged, with the restriction
that they may not use any blocking mutexes. Fast interrupts have typically been
used for the serial drivers, and are specific to FreeBSD: BSD/OS has no
corresponding functionality.
Locking constructs
The initial BSD/OS implementation defined two basic types of lock, called
mutex:
-
The default locking construct is the spin/sleep mutex. This is similar in
concept to a semaphore with a count of 1, but the implementation allows spinning
for a certain period of time if this appears to be of benefit (in other words,
if it is likely that the mutex will become free in less time than it would take
to schedule another process), though this feature is not currently in use. It
also allows the user to specify that the mutex should not spin. If the
process cannot obtain the mutex, it is placed on a sleep queue and woken when
the resource becomes available.
-
An alternate construct is a spin mutex. This corresponds to the spin lock
which was already present in the system. Spin mutexes are used only in
exceptional cases.
The implementation of these locks was derived almost directly from BSD/OS, but
has since been modified significantly.
In addition to these locks, the FreeBSD project has included two further locking
constructs:
Condition variables are built on top of mutexes. They consist of a
mutex and a wait queue. The following operations are supported:
-
Acquire a condition variable with cv_wait(), cv_wait_sig(),
cv_timedwait() or cv_timedwait_sig().
-
Before acquiring the condition variable, the associated mutex must be held. The
mutex will be released before sleeping and reacquired on wakeup.
-
Unblock one waiter with cv_signal().
-
Unblock all waiters with cv_broadcast().
-
Wait for queue empty with cv_waitq_empty.
-
Same functionality available from the msleep function.
Shared/exclusive locks, or sx locks, are effectively read-write
locks. The difference in terminology came from an intention to add additional
functionality to these locks. This functionality has not been implemented, so
currently sx locks are the same thing as read-write locks: they allow access by
multiple readers or a single writer.
The implementation of sx locks is relatively expensive:
struct sx {
struct lock_object sx_object;
struct mtx sx_lock;
int sx_cnt;
struct cv sx_shrd_cv;
int sx_shrd_wcnt;
struct cv sx_excl_cv;
int sx_excl_wcnt;
struct proc *sx_xholder;
};
They should be only used where the vast majority of accesses is shared.
-
Create an sx lock with sx_init().
-
Attain a read (shared) lock with sx_slock() and release it with
sx_sunlock().
-
Attain a write (exclusive) lock with sx_xlock() and release it with
sx_xunlock().
-
Destroy an sx lock with sx_destroy.
Removing the Big Kernel Lock
These modifications made it possible to remove the Big Kernel Lock. The initial
implementation replaced it with two mutexes:
-
Giant is used in a similar manner to the BKL, but it is a blocking
mutex. Currently it protects all entry to the kernel, including interrupt
handlers. In order to be able to block, it must allow scheduling to continue.
-
sched_lock is a spin lock which protects the scheduler queues.
This combination of locks supplied the bare minimum of locks necessary to build
the new framework. In itself, it does not improve the performance of the
system, since processes still block on Giant.
Idle processes
The planned light-weight interrupt threads need a process context in order to
work. In the traditional UNIX kernel, there is not always a process context:
the pointer curproc can be NULL. SMPng solves this problem by
having an idle process which runs when no other process is active.
Recursive locking
Normally, if a lock is locked, it cannot be locked again. On occasions,
however, it is possible that a process tries to acquire a lock which it already
holds. Without special checks, this would cause a deadlock. Many
implementations allow this so-called recursive locking. The locking code
checks for the owner of the lock. If the owner is the current process, it
increments a recursion counter. Releasing the lock decrements the recursion
counter and only releases the lock when the count goes to zero.
There is much discussion both in the literature and in the FreeBSD SMP project
as to whether recursive locking should be allowed at all. In general, we have
the feeling that recursive locks are evidence of untidy programming.
Unfortunately, the code base was never designed for this kind of locking, and in
particular library functions may attempt to reacquire locks already held. We
have come to a compromise: in general, they are discouraged, and recursion must
be specifically enabled for each mutex, thus avoiding recursion where it was not
intended.
Migrating to fine-grained locking
Implementing the interrupt threads and replacing the Big Kernel Lock with
Giant and schedlock did not result in any performance
improvements, but it provided a framework in which the transition to
fine-grained locking could be performed. The next step was to choose a locking
strategy and migrate individual portions of the kernel from under the protection
of Giant.
One of the dangers of this approach is that locking conflicts might not be
recognized until very late. In particular, the FreeBSD project has different
people working on different kernel components, and it does not have a strong
centralized architectural committee to determine locking strategy. As a result,
we developed the following guidelines for locking:
-
Use sleep mutexes. Spin mutexes should only be used in very special cases and
only with the approval of the SMP project team. The only current exception to
this rule is the scheduler lock, which by nature must be a spin lock.
-
Do not tsleep() while holding a mutex other than
Giant. The implementation of tsleep() and cv_wait()
automatically releases Giant and gains it again on wakeup, but no other
mutexes will be released.
-
Do not msleep() or cv_wait() while holding a mutex other than
Giant or the mutex passed as a parameter to msleep().
msleep() is a new function which combines the functionality with atomic
release and regain of a specified mutex.
-
Do not call a function that can grab Giant and then sleep unless no mutexes
(other than possibly Giant) are held. This is a consequence of the previous
rules.
-
If calling msleep() or cv_wait() while holding Giant and
another mutex, Giant must be acquired first and released last. This
avoids lock order reversals.
-
Except for the Giant mutex used during the transition phase, mutexes
protect data, not code.
-
Do not msleep() or cv_wait() with a recursed mutex.
Giant is a special case and is handled automagically behind the scenes,
so don't pass Giant to these functions.
-
Try to hold mutexes for as little time as possible.
-
Try to avoid recursing on mutexes if at all possible. In general, if a mutex is
recursively entered, the mutex is being held for too long, and a redesign is in
order.
One of the weaknesses of the project structure is that there is no overall
strategy for locking. In many cases, the choice of locking construct and
granularity is left to the individual developer. In almost every case, locks
are leaf node locks: very little code locks more than one lock at a time, and
when it does, it is in a very tight context. This results in relatively
reliable code, but it may not be result in optimum performance.
There are a number of reasons why we persist with this approach:
-
FreeBSD is a volunteer project. Developers do what they think is best. They
are unlikely to agree to an alternative implementation.
-
We do not currently have enough architectural direction, nor enough experience
with other SMP systems, to come up with an ideal locking strategy. This derives
from the volunteer nature of the project, but note also that large UNIX vendors
have found the choice of locking strategy to be a big problem.
-
Unlike large companies, there is much less concern about throwaway
implementations. If we find that the performance of a system component is
suboptimal, we will discard it and start with a different implementation.
Migrating interrupt handlers
This new basic structure is now in place, and implementation of finer grained
locking is proceeding. Giant will remain as a legacy locking mechanism for code
which has not been converted to the new locking mechanism. For example, the
main loop of the function ithread_loop, which runs an interrupt handler,
contains the following code:
if ((ih->ih_flags & IH_MPSAFE) == 0)
mtx_lock(&Giant);
....
ih->ih_handler(ih->ih_argument);
if ((ih->ih_flags & IH_MPSAFE) == 0)
mtx_unlock(&Giant);
The flag INTR_MPSAFE indicates that the interrupt handler has its own
synchronization primitives.
A typical strategy planned for migrating device drivers involves the following
steps:
-
Add a mutex to the driver softc.
-
Set the INTR_MPSAFE flag when registering the interrupt.
-
Obtain the mutex in the same kind of situation where previously an spl
was used. Unlike spls, however, the interrupt handlers must also obtain
the mutex before accessing shared data structures.
Probably the most difficult part of the process will involve larger components
of the system, such as the file system and the networking stack. We have the
example of the BSD/OS code, but it's currently not clear that this is the best
path to follow.
Kernel trace facility
The ktr package provides a method of tracing kernel events for debugging
purposes. It is not intended for use during normal operation, and should not be
confused with the kernel call trace facility ktrace.
For example, the function sched_ithd, which schedules the interrupt
threads, contains the following code:
CTR3(KTR_INTR,
"sched_ithd pid %d(%s) need=%d",
ir->it_proc->p_pid,
ir->it_proc->p_comm,
ir->it_need);
...
if (ir->it_proc->p_stat == SWAIT) {
CTR1(KTR_INTR,
"sched_ithd: setrunqueue %d",
ir->it_proc->p_pid);
The function ithd_loop, which runs the interrupt in process context,
contains the following code at the beginning and end of the main loop:
for (;;) {
CTR3(KTR_INTR,
"ithd_loop pid %d(%s) need=%d",
me->it_proc->p_pid,
me->it_proc->p_comm,
me->it_need);
...
CTR1(KTR_INTR,
"ithd_loop pid %d: done",
me->it_proc->p_pid);
mi_switch();
CTR1(KTR_INTR,
"ithd_loop pid %d: resumed",
me->it_proc->p_pid);
The calls CTR1 and CTR3 are two macros which only compile any
kind of code when the kernel is built with the KTR kernel option. If
the kernel contains this option and the bit KTR_INTR is set in the
variable ktr_mask, then these events will be masked to a circular buffer
in the kernel. The ddb debugger has a command show ktr which
dumps the buffer one page at a time, and gdb macros are also available.
This gives a relatively useful means of tracing the interaction between
processes:
2791 968643993:219224100 cpu1 ../../i386/isa/ithread.c:214 ithd_loop pid 21 ih=0xc235f200: 0xc0324d98(0) flg=100
2790 968643993:219214043 cpu1 ../../i386/isa/ithread.c:197 ithd_loop pid 21(irq0: clk) need=1
2789 968643993:219205383 cpu1 ../../i386/isa/ithread.c:243 ithd_loop pid 21: resumed
2788 968643993:219190856 cpu1 ../../i386/isa/ithread.c:158 sched_ithd: setrunqueue 21
2787 968643993:219179402 cpu1 ../../i386/isa/ithread.c:120 sched_ithd pid 21(irq0: clk) need=0
This example traces the arrival and processing of a clock interrupt on
the i386 platform, in reverse chronological order. The number at the beginning
of the line is the trace entry number.
-
Entry 2787 shows the arrival of an interrupt at the beginning of
sched_ithd. The second value on the trace line is the time since the
epoch, followed by the CPU number and the file name and line number. The
remaining values are supplied by the program to the CTR3 function.
-
Entry 2788 shows the second trace call in sched_ithd, where the
interrupt handler is placed on the run queue.
-
Entry 2789 shows the entry into the main loop of ithd_loop.
-
Entries 2790 and 2791 show the exit from the main loop of ithd_loop.
Witness facility
The witness code was designed specifically to debug mutex code. It
keeps track of the locks acquired and released by each thread. It also keeps
track of the order in which locks are acquired with respect to each other. Each
time a lock is acquired, witness uses these two lists to verify that a lock is
not being acquired in the wrong order. If a lock order violation is detected,
then a message is output to the kernel console detailing the locks involved and
the locations in question. Witness can also be configured to drop into the
kernel debugger when an order violation occurs.
The witness code also checks various other conditions such as verifying that one
does not recurse on a non-recursive lock. For sleep locks, witness verifies
that a new process would not be switched to when a lock is released or a lock is
blocked on during an acquire while any spin locks are held. If any of these
checks fail, the kernel will panic.
Project status
The project started in June 2000. The major milestones in the development are:
-
June 2000: Ported the BSD/OS mutex code and replaced the Big Kernel Lock with
Giant and sched_lock.
-
September 2000: Replaced interrupt handlers with heavyweight interrupt
processors. Initial commit to the FreeBSD source tree.
-
November 2000: Made softclock MP-safe and migrate from under
Giant.
-
January 2001: Implemented condition variables.
-
March 2001: Implemented read/write locks (called ``shared/exclusive'' or
sx locks).
-
March 2001: Complete locking of enough of the proc structure to allow
signal handlers to be moved from under Giant.
The main issue in the immediate future is to migrate more and more code out from
under Giant. In more detail, we have identified the following major
tasks, some of which are in an advanced state of implementation:
-
Split NFS into client and server.
-
Add locking to NFS.
-
Make the IP stack thread-safe.
-
Create mechanism in cdevsw structure to protect thread-unsafe drivers.
-
Complete locking struct proc.
-
Cleanup the various mp_machdep.c's, unify various SMP API's such as IPI
delivery, etc.
-
Make printf() safe to call in almost any situation to avoid deadlocks.
-
Make mbuf system use condition variables instead of msleep() and
wakeup().
-
Remove the MP safe syscall flag from the system call table and add explicit
mtx_lock of Giant to all system calls which need it.
-
Use per-CPU buffers for ktr to reduce synchronization.
-
Remove the priority argument from msleep() and cv_wait().
-
Implement lazy interrupt thread switching (context stealing).
-
Lock structs filedesc, pgrp, sigio, session and
ifnet.
-
Make the virtual memory subsystem thread-safe.
-
Convert select() to use condition variables.
-
Reimplement kqueue using condition variables.
-
Conditionalize atomic operations used for debugging statistics.
-
Lock the virtual file system code.
Performance
The implementation has not progressed far enough to make any firm statements
about performance, but we are expecting reasonable scalability to beyond 32
processor systems.
Acknowledgements
The FreeBSD SMPng project was made possible by BSDi's generous donation of code
from the development version 5.0 of BSD/OS. The main contributors were:
-
John Baldwin rewrote the low level interrupt code for i386 SMP, made
much code machine independent, worked on the WITNESS code, converted
allproc and proctree locks from lockmgr locks to
sx locks, created a mechanism in cdevsw structure to protect
thread-unsafe drivers, locked struct proc and unified various SMP API's
such as IPI delivery.
-
Jake Burkholder ported the BSD/OS locking primitives for i386,
implemented msleep(), condition variables and kernel preemption.
-
Matt Dillon converted the Big Kernel spinlock to the blocking
Giant lock and added the scheduler lock and per-CPU idle processes.
-
Jason Evans made malloc and friends thread-safe, converted simplelocks
to mutexes and implemented sx (shared/exclusive) locks.
-
Greg Lehey implemented the heavy-weight interrupt threads, rewrote the
low level interrupt code for i386 UP, removed spls and ported the BSD/OS
ktr code.
-
Bosko Milekic made sf_bufs thread-safe, cleaned up the mutex API
and made the mbuf system use condition variables instead of
msleep().
-
Doug Rabson ported the BSD/OS locking primitives. implemented the
heavy-weight interrupt threads and rewrote the low level interrupt code for the
Alpha architecture.
Further contributors were Tor Egge, Seth Kingsley, Jonathan Lemon, Mark Murray,
Chuck Paterson, Bill Paul, Alfred Perlstein, Dag-Erling Sm¿rgrav and Peter Wemm.
Bibliography
Per Brinch Hansen, Operating System Principles.
Prentice-Hall, 1973.
Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. Quarterman,
The Design and Implementation of the 4.4BSD Operating System,
Addison-Wesley 1996.
Curt Schimmel, UNIX Systems for Modern Architectures, Addison-Wesley 1994.
Uresh Vahalia, UNIX Internals. Prentice-Hall, 1996.
Further reference
FreeBSD SMP home page.
This paper was originally published in the
Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference,
June 25-30, 2001, Boston, Masssachusetts, USA
Last changed: 21 June 2001 bleu
|
|