2007 USENIX Annual Technical Conference
Pp. 87100 of the Proceedings
Events Can Make Sense
Maxwell Krohn (MIT CSAIL),
Eddie Kohler (UCLA) and
M. Frans Kaashoek (MIT CSAIL)
{krohn,kaashoek}@csail.mit.edu,
kohler@cs.ucla.edu
|
Abstract
Tame is a new event-based system for managing concurrency in
network applications.
Code written with Tame abstractions does not suffer from the
“stack-ripping” problem associated with other event libraries.
Like threaded code, tamed code uses standard control flow,
automatically-managed local variables, and modular interfaces between
callers and callees.
Tame's implementation consists of C++ libraries and a
source-to-source translator; no platform-specific support or compiler
modifications are required, and Tame induces little runtime
overhead.
Experience with Tame in real-world systems, including a popular
commercial Web site, suggests it is easy to adopt and deploy.
1 Introduction
This paper introduces Tame, a system for managing concurrency in
network applications that combines the flexibility and performance of
events with the programmability of threads.
Tame is yet another design point in a crowded space, but one
that has proven successful in real-world deployments.
The system is, at heart, an event-based programming library that frees
event developers from the annoyance of “stack
ripping” [2002 (Adya et al.)].
We have implemented Tame in C++ using libraries and
source-to-source translation, making Tame deployable without compiler
upgrades.
Threads are the more popular strategy for managing
concurrency, but some situations (and programmers) still call for
events [2001 (Dabek et al.), 2004 (Freedman et al.), 1999 (Pai et al.), 2006 (Stribling et al.), 2004 (Provos), 2006 (Walfish et al.)].
Applications with exotic concurrency, such as multicast,
publish/subscribe, or TCP-like state machines, might find threads
insufficiently expressive [2003 (von Behren et al.)].
Certain contexts do not support threads or
blocking [2005 (Cunningham and Kohler), 2001 (Mazières)].
On new platforms, portability can favor events, which require only a
select call and no knowledge of hardware-specific stack or
register configuration [2006 (Dunkels et al.)].
Finally, some event-based servers perform better and use less memory
than threaded
competitors [2004 (Krohn), 1999 (Pai et al.), 2006 (Park and Pai), 2007 (Pariag et al.)].
But a key advantage of events—a single stack—is also a liability.
Sharing one stack for multiple tasks requires stack ripping, which
plagues the development, maintenance, debugging and profiling of
event code [2002 (Adya et al.)].
The programmer must manually split (or “rip”) each function that
might block (due to network communication or disk I/O), as well as
all of its ancestors in the call stack.
Ripping a function obscures its control flow [2005 (Cunningham and Kohler)]
and complicates memory management.
However,
the right
abstractions can capture events' expressivity while minimizing the
headaches of stack ripping [1991 (Reppy)].
The Tame system introduces powerful abstractions
with implementation techniques suitable for high-performance system
programming.
The specific contributions of the Tame system are:
A high-level, type-safe API for event-based programming that
frees it from the stack-ripping problem but is still backwards
compatible with legacy event code.
A new technique to incorporate threads and events in the same
program.
A maintainable and immediately deployable
implementation in C++, using only portable libraries and
source-to-source translation.
An automated memory management scheme for events that does not
require garbage collection.
Our experience with Tame has shown the interface sufficient to
build and run real systems.
Programmers other than the authors rely on Tame in
educational assignments, research projects [2006 (Tolia et al.)], and even
a high-traffic commercial Web site [16].
2 Related work
The research systems most closely related to Tame are
Capriccio [2003 (von Behren et al.)] and the work of Adya et
al. [2002 (Adya et al.)].
Capriccio is a cooperative threading package that exports the POSIX
thread interface but looks like events to the operating system:
it uses sophisticated stack management to make one stack appear as
many, saving on cycles and memory.
However, the Capriccio system strives to equal events only in terms of
performance and not in terms of expressivity; its
authors note that the thread interface is less flexible than that of
events [2003 (von Behren et al.)].
Adya et al.'s system is a way to combine event-based and threaded code
in the same address space.
The key insight is that a program's style of stack management
(automatic or manual) is orthogonal to its style of task management
(cooperative or preemptive) and that most literature on events and
threads mistakenly claims they are linked.
As in Adya's system, a Tame program can be expressed in a syntax that
has readable automatic stack management (like threads)
yet has explicit cooperative task management (like events).
Tame differs because it extends automatic stack management to
all event code, while “hybrid” event code in Adya's system
still requires manual stack management.
Other systems like SEDA [2001 (Welsh et al.)] use threads and events in
concert to achieve flexible scheduling and intraprocess concurrency.
Tame is complementary to such hybrid systems and can be used as an
implementation technique to simplify their event code.
Many other systems attempt to improve threads' scalability and efficiency.
NPTL in Linux [9] and I/O completion ports in
Windows [22] improve the performance of kernel threads; we
compare Tame with NPTL in our evaluation.
Practical user-level cooperative threading packages include
Gnu PTH, which focuses on
portability [2000 (Engelschall)], and StateThreads, which focuses on
performance [31].
Existing practical event libraries fall into several categories.
The most primitive, such as libevent [27], focus
exclusively on abstracting the interface to OS events (i.e.,
select vs. poll vs. epoll vs. kqueue),
and don't simplify the construction of higher-level events, such as RPC
completions.
The event libraries integrated with GUI toolkits, such as Motif, GTK+,
and Qt, support higher-level events,
but are of course tuned for GUIs rather than general systems programming.
The type-safe libasync event library for C++ is the basis of
our work [2001 (Mazières), 2003 (Zeldovich et al.)].
The protothreads C-preprocessor library [2006 (Dunkels et al.)] gives the
illusion of threads with only one stack.
Protothreads are useful in resource-constrained settings such as
embedded devices and sensor networks, but lacking stacks or closures,
they must use global variables to retain state
and therefore are not suited to building composable APIs.
The Tame system shares implementation techniques
with protothreads and similar C coroutine
libraries [2006 (Dunkels et al.), 10], as well
as the porch program checkpointer [1997 (Ramkumar and Strumpen)].
The Tame language semantics draw from a rich body of
previous work on parallel programming [1998 (Skillicorn and Talia)].
Like condition variables [1974 (Hoare)], Tame's events
allow signaling and synchronization between different parts of a
program, but unlike condition variables, events do not require
locks (or threads, for that matter).
Many parallel programming languages have constructs similar to Tame's
twait: Occam has PAR [1986 (Jones)], and Pascal-FC has
COBEGIN and COEND [1990 (Davies)].
Tame also borrows ideas such as closures and function currying from
functional languages like Lisp [1984 (Steele)] and
Haskell [1992 (Hudak et al.)].
Previous work in modeling threads and concurrency in functional
languages, such as Haskell and ML, has noted a correspondence between
continuations and threads.
A user-level thread scheduler essentially chooses among a set of active
continuations; blocking adds the current continuation to this set and
invokes the scheduler.
For instance, Claessen uses monads in Haskell to implement
threading [1999 (Claessen)].
Li and Zdancewic extend Claessen's technique to combine threads and
events [2007 (Li and Zdancewic)].
Concurrent ML (CML) uses continuations to build a set of concurrency
primitives much like those of Tame [1991 (Reppy)].
Tame and CML have similar events, Tame's rendezvous
shares some properties with CML's choose operator, and Tame's
twait is analogous to CML's sync.
There are differences in performance and function.
CML events are effectively continuations and preserve the equivalent of
an entire call stack, while Tame events preserve only the top-level
function's closure, and CML has no direct equivalent for Tame's
user-supplied event IDs—instead the CML user must manipulate event
objects directly.
Tame's constructs have similar power but are efficiently
implementable in conventional systems programming languages like
C++.
3 Tame Semantics
// Threads
void wait_then_print_threads() {
sleep(10); // blocks this function and all callers
printf("Done!");
}
// Tame primitives
tamed wait_then_print_tame() {
tvars { rendezvous<> r; }
event<> e = mkevent(r); // allocate event on r
timer(10, e); // cause e to be triggered after 10 sec
twait(r); // block until an event on r is triggered
// only blocks this function, not its callers!
printf("Done!");
}
// Tame syntactic sugar
tamed wait_then_print_simple_tame() {
twait { timer(10, mkevent()); }
printf("Done!");
}
-1ex
Figure 1: Three functions that print Done! after
ten seconds. The first version uses threads; the second Tame version
is essentially as readable.
Tame makes easy concurrency problems easy to express in events (as they
were easy to express in threads).
Figure 1 shows three implementations of a trivial function;
the second Tame version is indeed close to the threaded version in code
length and readability.
The rest of this section describes the Tame primitives and syntactic
sugar.
We also show through examples how the full power of Tame simplifies the
expression of hard concurrency problems, and how Tame allows
users to develop composable solutions for concurrency problems (harder to
express correctly in threads).
3.1 Overview
Tame introduces four related abstractions for handling concurrency:
events, wait points, rendezvous, and safe
local variables.
They are expressed as software libraries whenever possible,
and as language extensions (via source-to-source translation)
when not.
First, each event object represents a future occurrence, such as the
completion of a network read.
When the expected occurrence
actually happens—for instance, a packet arrives—the programmer triggers the event
by calling its trigger method.
The mkevent function allocates an event of type
event<T>, where T is a
sequence of zero or more types.
This event's trigger method has the signature
void trigger(T).
Calling trigger(v) marks the event as having occurred, and
passes zero or more results v, which are called trigger values,
to whomever is expecting the event.
For example:
rendezvous<> r; int i = 0;
event<int> e = mkevent(r, i);
e.trigger(100);
assert(i == 100); // assertion will succeed
When triggered, e's int trigger value is stored in
i, whose type is echoed in e's type.
The wait point language extension, written twait, blocks
the calling function until one or more events are triggered.
Blocking causes a function to return to its caller, but the function does
not complete: its execution point and local variables are preserved in
memory.
When an expected event occurs, the function “unblocks” and
resumes processing at the wait point.
By that time, of course, the function's original caller may have returned.
Any function containing a wait point is marked with the tamed keyword, which
informs the caller that the
function can block.
The first, and more common, form of wait point is written
“twait { statements; }”.
This executes the statements, then blocks until all events
created by mkevent calls in the statements have triggered.
For example, code like “twait { timer(10, mkevent()); }”
should be read as “execute `timer(10, mkevent())', then block
until the created event has triggered”—or,
since timer triggers its event argument after the given number
of seconds has passed, simply as “block for 10
seconds”.
twait{} can implement many forms of event-driven control
flow, including serial and parallel RPCs.
The second, more flexible form of wait point explicitly names a rendezvous
object, which specifies the set of expected events relevant to the wait
point.
Every event object associates with one rendezvous.
A wait point twait(r) unblocks when any one of rendezvous
r's events occurs.
Unblocking consumes the event and restarts the blocked function.
The first form of wait point is actually syntactic sugar for the second:
code like “twait { statements; }” expands
into something like
|
{ |
rendezvous<> __r; |
|
|
statements; // where mkevent calls create events on __r |
|
|
while (not all __r events have completed) |
|
|
|
twait(__r); |
}
|
The twait() form can also return information about
which event occurred.
A rendezvous of type rendezvous<I> accepts events with
event IDs of type(s) I.
Event IDs identify events in the same way thread IDs identify threads,
except that event IDs have arbitrary, programmer-chosen types and
values.
A twait(r, i) statement then sets i to the ID(s) of the
unblocking event.
Although wait points are analogous to blocking a thread until a condition variable
is notified, blocking in Tame
has a different meaning than in threads.
A blocked threaded function's caller only resumes when the callee
explicitly returns.
In Tame, by contrast, a tamed function's caller resumes when the
called function either returns or blocks.
To allow its caller to distinguish returning from blocking, a tamed
function will often accept an event argument, which it triggers when it
returns. This trigger signals the function's completion.
Here is a function that blocks, then returns an integer, in
threads and in Tame:
Classes |
Keywords & Language Extensions |
Functions & Methods |
event<>
A basic event.
event<T>
An event with a single trigger value of type T. This value is
set when the event occurs; an example might be a character read
from a file descriptor.
Events may also have multiple trigger values of types T1... Tn.
rendezvous<I>
Represents a set of outstanding events with event IDs of type I.
Callers name a rendezvous when they block, and unblock on the triggering
of any associated event.
|
twait(r[,i]);
A wait point. Block on explicit rendezvous r, and optionally set the event ID
i when control resumes.
tamed
A return type for functions that use twait.
tvars { ... }
Marks safe local variables.
twait { statements; }
Wait point syntactic sugar: block on an implicit rendezvous until all
events created in statements have triggered.
|
mkevent(r,i,s);
Allocate a new event with event ID i. When triggered, it will
awake rendezvous r and store trigger value in slot s.
mkevent(s);
Allocate a new event for an implicit twait{} rendezvous.
When triggered, store trigger value in slot s.
e.trigger(v);
Trigger event e, with trigger value v.
timer(to,e); wait_on_fd(.1emfd,rw,e);
Primitive event interface for timeouts and file descriptor events, respectively.
|
Figure 2: Tame primitives for event programming in C++.
int blockf() { tamed blockf(event<int> done) {
... block ... ... block ...
return 200; done.trigger(200);
} }
i = blockf(); twait { blockf(mkevent(i)); }
In Tame, the caller uses twait to wait for
blockf to return, and so must become tamed itself.
Waiting for events thus trickles up the call stack until a caller
doesn't care whether its callee returns or blocks. This property is related to
stack ripping, but much simpler, since functions do not split into pieces.
Threaded code avoids any such change at the cost of blocking the entire call stack
whenever a function blocks.
Single-function blocking gives Tame its event flavor, increases its
flexibility, and reduces its overhead (only the relevant parts of the call
stack are saved).
We return to this topic in the next section.
When an event e is triggered, Tame queues a trigger
notification for e's event ID on e's rendezvous
r. This step also unblocks any function blocked on
twait(r).
Conversely, twait(r) checks for any queued trigger notifications
on r.
If one exists, it is dequeued and returned.
Otherwise, the function blocks at that wait point; it will unblock and
recheck the rendezvous once someone triggers a corresponding event.
The top-level event loop cycles through unblocked functions, calling
them in round-robin order when unblocking on file descriptor I/O and
first-come-first-served order otherwise.
More sophisticated queuing
and scheduling techniques [2001 (Welsh et al.)] are possible.
Multiple functions cannot simultaneously block on the same rendezvous.
In practice, this restriction isn't significant since most
rendezvous are local to a single function.
A Tame program that needs two functions to wait on the same condition
uses two separate events, triggering both when the condition occurs.
Tame-based read locks (see Section 7.5) are an
example of such a pattern.
Finally, safe local variables, a language extension, are variables
whose values are preserved across wait points.
The programmer marks local variables as safe by enclosing them in a
tvars {} block, which preserves their values in a heap-allocated
closure.
(Function parameters are always safe.)
Unsafe local variables have indeterminate values after a wait point.
The C++ compiler's uninitialized-variable warnings tell a Tame
programmer when a local variable should be made safe.
Type signatures
Events reflect the types of their trigger values, and rendezvous reflect
the types of their event IDs. The compiler catches type mismatches and
reports them as errors.
Concretely, rendezvous is a conventional C++ template type,
defined in a library.
All events associated with a rendezvous of type rendezvous<I> must
have event IDs of type I.
The mkevent function has type:
event<T1,T2,...> mkevent(rendezvous<I> r, const I &i,
T1 &s1, T2 &s2, ...);
The arguments are a rendezvous,
an event ID i, and slot references s1, s2,
... that will store trigger values when the event is later triggered.
C++'s template machinery deduces the appropriate event ID and slot type(s)
from the arguments, so mkevent can unambiguously accommodate
optional event IDs and arbitrary trigger slot types.
The event::trigger method has type:
void event<T1,T2,...>::trigger(const T1 &v1,
const T2 &v2, ...);
When called, this method assigns the trigger values v1,
v2, ... to the slots given at allocation time, then unblocks
the corresponding rendezvous.
Wait points have type twait(rendezvous<I> r, I &i); when the wait
point unblocks, i holds the ID of the unblocking event.
Primitive events
Three library functions provide an interface to low-level operating system
events: timer(), wait_on_fd(), and
wait_on_signal().
Each function takes an event<> e and one or more extra
parameters.
timer(to,e) triggers e after to seconds
have elapsed;
wait_on_fd(fd,rw,e) triggers e once the
file descriptor fd becomes readable or writable (depending on
rw); and
wait_on_signal(sig,e) triggers e when signal
sig is received.
The base event loop that understands these functions is implemented in
terms of select() or platform-specific alternatives such as
Linux's epoll or FreeBSD's kqueue [2001 (Lemon)].
Like all programs based on events or cooperative threads,
a tamed program will block entirely if any portion of it calls a
blocking system call (such as open) or takes a page fault.
Tame inherits libasync's non-blocking substitutes
for blocking calls in the standard library (such as open
and gethostbyname).
For tamed programs to perform well in concurrent settings,
they should use only non-blocking calls and should not induce swapping.
Figure 2 summarizes Tame's primitive semantics.
3.2 Control Flow Examples
Common network flow patterns like sequential calls, parallel calls,
and windowed calls [2003 (von Behren et al.)] are difficult to express in
standard event libraries but much simplified with Tame.
As a running example, consider a function that resolves addresses for
a set of DNS host names.
An initial design might use the normal blocking resolver:
1 void multidns(dnsname name[], ipaddr a[], int n) {
2 for (int i = 0; i < n; i++)
3 a[i] = gethostbyname(name[i]);
4 }
Of course, this function will block all other computation until all lookups
complete.
An efficient server would allow other progress during the
lookup process.
The event-based solution would use a nonblocking resolver, with a
signature such as:
tamed gethost_ev(dnsname name, event<ipaddr> e);
This resolver uses nonblocking I/O when contacting
local and/or remote DNS servers.
(Alternately, Tame's threading support makes it easy to adapt a
blocking resolver for nonblocking use; see Section 4.)
Since gethost_ev's caller can regain control before the lookup
completes, the lookup result is returned via a trigger value:
once the address a is known, the resolver calls
e.trigger(a).
The trigger simultaneously exports the result and unblocks anyone
waiting for it.
Here's how to look up a single name with gethost_ev:
tvars { ipaddr a; }
twait { gethost_ev(name, mkevent(a)); }
print_addr(a);
Without Tame, adapting multidns to use
gethost_ev is an exercise in stack-ripping frustration;
for the gory details, see Figure 3.
Tame, however, makes it simple:
void multidns_nasty(dnsname name[], ipaddr a[], int n,
event<> done) {
if (n > 0) {
// When lookup succeeds, gethost_ev will call
// "helper(name, a, n, done, RESULT)"
gethost_ev(name[0], wrap(helper, name, a, n, done));
} else // done, alert caller
done.trigger();
}
void helper(dnsname *name, ipaddr *a, int n,
event<> done, ipaddr result) {
*a = result;
multidns_nasty(name+1, a+1, n-1, done);
}
-1.51ex
Figure 3: Stack-ripped libasync code for looking up n DNS
names without blocking. A simple for loop has expanded into two
interacting functions, obscuring control flow; all callers must likewise
split.
1 tamed multidns_tame(dnsname name[], ipaddr a[],
int n, event<> done) {
2 tvars { int i; }
3 for (i = 0; i < n; i++)
4 twait { gethost_ev(name[i], mkevent(a[i])); }
5 done.trigger();
6 }
multidns_tame keeps all arguments and the local variable
i in a closure.
Whenever gethost_ev looks up a name,
it triggers the event allocated on line 4.
This stores the address in a[i] and unblocks
multidns_tame, after which
the loop continues.
Though the code somewhat resembles threaded code, the semantics are
still event-driven: multidns_tame can return
control to its caller before it completes. Thus, it signals completion
via an event, namely done.
Any callers that depend on completion must use Tame primitives to block
on this event, and thus become tamed themselves.
The tamed return type then bubbles up the call stack,
providing the valuable annotation that multidns_tame and
its callers may suspend computation before completion.
multidns_tame allows a server to use the CPU more effectively
than multidns, since other server computation can take place as
multidns_tame completes.
However, multidns_tame's lookups still happen in series: lookup
i does not begin until lookup i−1 has completed.
The obvious latency improvement is to perform lookups in parallel.
The tamed code barely changes:
1 tamed multidns_par(dnsname name[], ipaddr a[],
int n, event<> done) {
2 twait {
3 for (int i = 0; i < sz; i++)
4 gethost_ev(name[i], mkevent(a[i]));
5 }
6 done.trigger();
7 }
The only difference between the serial and parallel versions is the
ordering of the for and twait statements (and that
i doesn't need to be in the closure).
Since both versions have the same signature, the programmer can
switch implementation strategies without changing caller code.
With threads, however,
the serial version could
use one thread to do all lookups, while the parallel version would use as
many threads as lookups.
Tame preserves events' flexibility while providing much of threads'
readability.
A generalization of serial and parallel control flow is windowed
or pipelined control flow, in which n calls are made in
total, and at most w ≤ n of them are outstanding at any time.
For serial flow, w = 1; for parallel, w = n.
Intermediate values of w combine the advantages of serial and
parallel execution, allowing some overlapping without blasting
the server.
With Tame, even windowed control flow is readable,
although the simplified twait{} statement no longer suffices:
1 tamed multidns_win(dnsname name[], ipaddr a[],
int n, event<> done) {
2 tvars { int sent(0), recv(0); rendezvous<> r; }
3 while (recv < n)
4 if (sent < n && sent - recv < WINDOWSIZE) {
5 gethost_ev(name[sent], mkevent(r,a[sent]));
6 sent++;
7 } else {
8 twait(r);
9 recv++;
10 }
11 done.trigger();
12 }
The loop runs until all requests have received responses (recv ==
n).
On each iteration, the function sends a new request (lines 5–6) whenever a
request remains (sent < n) and the window has room
(sent - recv < WINDOWSIZE).
Otherwise, the function harvests an outstanding request (lines 8–9).
Again, the signature is unchanged, and the implementation is short and clear.
We have not previously seen efficient windowed control flow expressed this simply.
3.3 Typing and Composability
Tame's first-class events and rendezvous, and its distinction between
event IDs and trigger values, improve its flexibility, composability, and
safety.
First, Tame preserves safe static typing without compromising flexibility
by distinguishing event IDs from trigger
values.
Event IDs are like names. They identify events, and are known when the
event is registered; all events on the same rendezvous must have the same
event ID type.
Trigger values, on the other hand, are like results: they are not known
until the event actually triggers. Examples include characters read from
a file descriptor, RPC replies, and so forth.
Event IDs and trigger values are related, of course; when a twait
statement returns event ID i, the programmer knows that event i has
triggered, and therefore its associated trigger values have been set.
In contrast, several other systems return trigger values as part of an event
object; the twait equivalent returns the object, and extracting
its values requires a dynamic cast.
The Tame design avoids error-prone casts while still letting a
single rendezvous handle events with entirely different trigger value
types.
To demonstrate Tame's composability, we'll add timeouts to the following
event-based DNS lookup:
tvars { ipaddr a; }
twait { gethost_ev(name, mkevent(a)); }
We want to cancel a lookup and report an error if a name fails to resolve in
ten seconds.
The basic implementation strategy is to wait on two events, the
lookup and a ten-second timer, and check which event happens first.
tvars { ipaddr a; rendezvous<bool> r; bool ok; }
timer(10, mkevent(r, false));
gethost_ev(name, mkevent(r, true, a));
twait(r, ok);
if (!ok) printf("Timeout");
r.cancel();
The event ID false represents timeouts, while
true represents successful lookup.
The twait statement sets ok to the ID of the event that
triggers first1, so ok is false if and
only if the lookup timed out.
The r.cancel() call cleans up state associated with the
event that did not trigger.
Figure 4 diagrams the relevant objects.
Figure 4: Relationships between events (boxes) and rendezvous (round box) for
a DNS lookup with timeout.
This code is verbose and hard to follow.
Supporting timeouts on every lookup, or on other types of event,
would require adding rendezvous and timer calls across the
program and abandoning the twait{} syntactic sugar.
Tame can do better.
Its programmers can write an adapter
that can add a cancellation timeout to any event.
The adapter relies on C++'s template support and on Tame's first-class
events, and resembles adapters from higher-level thread packages such as
CML.
Many other event libraries could not express this kind of composable
abstraction, which was
a main motivator for Tame's design.
The adapter simplifies the lookup code to:
tvars { ipaddr a; bool ok; }
twait { gethost_ev(name, add_timeout(mkevent(ok, a))); }
if (!ok) printf("Timeout");
The caller generates an event with
two trigger slots, one for the base trigger value and one for a boolean that indicates success or failure.
Either a successful lookup or a timeout will trigger the event.
Success will set the boolean trigger value to true,
timeout will set it to false.
Thus, after waiting for the event, callers can examine the boolean to check
for timeout. This pattern works with twait{}
statements as well as explicit rendezvous.
Unfortunately, the gethost_ev function requires an event that
takes a single trigger value, namely the IP address.
It will not supply an extra trigger
value unless we change its signature and implementation, which would make
it specific to our timeout adapter—something we'd like to avoid.
But Tame's abstractions let us
transparently interpose between
gethost_ev and its “caller”.
The adapter will set the extra trigger
value.
1 template <typename T> tamed
__add_timeout(event<T> &e_base, event<bool, T> e) {
2 tvars { rendezvous<bool> r; T result; bool rok; }
3 timer(TIMEOUT, mkevent(r, false));
4 e_base = mkevent(r, true, result);
5 twait(r, rok);
6 e.trigger(rok, result);
7 r.cancel();
8 }
9 template <typename T> event<T> add_timeout(event<bool, T> e) {
10 event<T> e_base;
11 __add_timeout(e_base, e);
12 return e_base;
13 }
Figure 5: Code and object relationships for a composable timeout event
adapter, and its use in a DNS lookup.
Figure 5 shows the code and a diagram of the object
relationships.
The real work takes place in __add_timeout, which
creates two events: e_base, which is returned
(and eventually passed to gethost_ev), and an internal
event passed to the timer function on line 6.
The two created events associate
with the rendezvous r local to
__add_timeout.
This is the interposition.
When the timeout triggers, or when e_base triggers (due to
a successful DNS lookup), __add_timeout
will unblock, set the ok slot appropriately, and then trigger
e.
Only this last step unblocks the caller.
The caller observes that ok
and a have been set, but is oblivious to
__add_timeout's intercession; it is
as if gethost_ev set ok itself.
It would be trivial to add other types of “timeout”, such as signal receipt, to
add_timeout; its signature would not change, and neither would
its callers.
Similarly, one-line changes could globally track how many events time out.
We've added
significant additional concurrency semantics with only local changes:
the definition of composability.
3.4 Future Work
Tamed processes do not currently run on more than one
core or CPU.
The production Tame-based applications we know of
consist of multiple concurrent processes cooperating
to achieve an application goal.
OkCupid.com, for instance, uses exclusively multicore and SMP machines.
Its Web front-ends run no fewer than fifty site-specific Tame-based
processes, all of which simultaneously answer Web requests.
When traffic is high, all CPUs (or cores) are in use.
Nevertheless, few changes to Tame would be required for true simultaneous
threading support.
Tame already supports event-based locks to product data structures from
unwanted interactions (Section 7.5).
As in async-mp [2003 (Zeldovich et al.)], multiple kernel threads
could draw from a shared pool of ready tasks, as restricted by
Tame's current atomicity assumption:
at most one thread of control can be active in any given closure at a time.
Locks enforced by the kernel, or any equivalent technique, could ensure
this invariant.
Tame does not currently interact well with C++ exceptions: an exception
raised in a Tamed function might be caught by the event loop.
Some of Tame's limitations are not implementation-dependent but
rather consequences of its approach and semantics.
As mentioned in Section 3, changing a function from
a regular C++ function to a tamed function involves signature
changes all the way up the call stack.
Some developers might object to this limitation, especially those who
export libraries with fixed interfaces.
4 Threads
Tame can interoperate with threads when a thread package is
available, suggesting that the Tame abstractions (wait points,
events and rendezvous) apply to both programming models.
With thread support, Tame simplifies the transition between
threaded and event-style programming, for instance allowing
event-based applications to use threaded software in the C
library (e.g. gethostbyname) and database client
libraries (e.g. libmysqlclient [23]).
We have only experimented with cooperative user-level threading
packages, though kernel-level threads that support SMPs are also
compatible with our approach.
The key semantic difference between threaded and event-based operation
is how functions return.
In event-based Tame, functions can either return a useful value
via a return statement
or block via twait, but not both.
Threaded functions can both block and return a value, since
the caller regains control only when the computation is done.
With thread support, Tame exposes both event and thread return
semantics.
In Tame, a threaded function is one that calls twait but does not
have a tamed return type.
When such a function encounters twait(r), it checks for
queued triggers in r as usual; if none are present, it asks
for a wakeup notification when a trigger arrives in r, and
then yields to another thread.
During the yield, the threading package preserves the function's entire
call stack (including all of its callers), while running other, more
ready computations.
When the trigger arrives, the blocked thread awakes at the twait
call and can return to its caller.
A trivial example using threads in Tame is a reimplementation of
the sleep call:
1 int mysleep(int d) {
2 twait { timer(d, mkevent()); }
3 return d;
4 }
As usual, the call to timer registers an event that will be
triggered after a d second delay.
The function then calls twait on an implicit rendezvous at line 2,
yielding its thread.
After d seconds have elapsed, the main thread triggers the
event allocated on line 2, waking up mysleep
and advancing control to the return statement on
line 3.
Since mysleep is threaded (i.e., does not have a
tamed return value), it returns an actual value
to its caller.
Blocking the current thread uses Tame's existing twait syntax, but
starting a new thread requires a new tfork
function:2
tfork(rendezvous<I> r, I i, threadfunc<V> f, V &v);
The semantics are:
Allocate e = mkevent(r, i).
Fork a new thread. In the new thread context:
(a) Call f() and store its return value in v.
(b) Trigger event e.
(c) Exit the thread.
When the function f completes, the rendezvous r
receives a trigger with event ID i.
This unifies the usually separate concepts of event “blocking”
and joining on a thread.
Code like the following uses tfork and twait{} syntactic sugar to call a blocking library function
from an event-based context:
tamed gethost_ev(const char *name,
event<struct hostent *> e) {
tvars { struct hostent *h; }
twait { tfork(wrap(gethostbyname, name), h); }
e.trigger(h);
}
This starts gethostbyname(name) in a new thread, then
blocks in the usual event-driven way until that thread exits.
At that point, the caller is notified via an event trigger of the
struct hostent result.
5 Memory Management
Tame hides most details of event memory management from
programmers, protecting them at all costs from wild writes and
catching most memory leaks.
For the large majority of Tame code that uses the twait{}
environment, correct program
syntax guarantees correct leakless memory management.
For more advanced programs that use explicit
rendezvous,
Tame uses
reference counting to enforce key invariants at runtime.
The invariants are:
A function's closure lives at least until control exits the
function for the last time.
Some of an event's trigger slots may be safe local
variables, and triggering it assigns values to those variables.
Thus, a function's closure must live as least until events
created in the function have triggered.
Events associated with a rendezvous r must trigger
exactly once before r is deallocated. The
programmer must uphold I3 by correctly managing rendezvous
lifetime and triggering each event exactly once.
A closure should be deallocated as soon as so doing does not violate
I1 or I2.
Of these invariants, I3 depends the most on program correctness.
Some cases are easy to handle.
Tame ignores attempts to trigger an event multiple times (or
aborts, depending on runtime options), and
forgetting to trigger an event in a twait{} environment
will cause a program hang and is thus easily observable.
The difficult case involves managing the lifetimes of explicit rendezvous.
Consider the function broken:
1 tamed broken(dnsname nm) {
2 tvars { rendezvous<> r; ipaddr res; }
3 gethost_ev(nm, mkevent(r, res));
4 // Whoops: forgot to twait(r)!
5 }
The event created on line 3 uses the trigger slot res , a
safe variable in broken 's closure.
The function then exits without waiting for r or
examining res .
This is a bug—an event leak in violation
of I3.
If Tame deallocated broken 's closure eagerly, right after it
exited, then the event's eventual trigger would write its value
into the deallocated memory where the closure used to be.
Tame's answer is a careful reference-counting scheme;
its runtime keeps track of events and closures with C++ “smart
pointer” classes.
For example, event<> objects are actually smart pointers;
the event is stored elsewhere, in a private
object only accessible by Tame code.
If necessary, Tame keeps the event around even
after the user frees it.
There can be circular references among these three types of
objects—for example, a closure contains a local event, which
names a different closure-local variable as a trigger slot.
Tame uses two different types of reference counts to break the
circularity: strong references, which are conventional
reference counts, and weak references, which allow access to
the object only if it hasn't been deallocated.
In outline, Tame keeps the following reference counts:
[After the event allocation on line 3.]
[After control exits broken on line 5.]
Figure 6: Memory references for the broken function.
Weak references are shown as dotted lines,
and strong references as solid lines. Solid fill indicates a function
exit, and striped fill indicates cancellation.
Entering a tamed function for the first time adds a strong
reference to the corresponding closure, which is removed only when
the function exits for the last time. This preserves
I1.
Each event created inside a closure holds a strong
reference to that closure, preserving I2. The reference is dropped
once the event is triggered.
A rendezvous and its associated events keep weak references to
each other.
The references a rendezvous keeps to its events allow it to
cancel events that did not trigger before the
rendezvous's deallocation.
Canceling an event clears its R2 reference; any
future trigger attempt on the event will be ignored,
preserving I3.
The weak references the other way enable an event to update its
rendezvous upon a trigger.
Figure 6a shows these references in
the broken function following line 3's event allocation.
The most important problem introduced by this reference counting scheme is
due to R2: an untriggered event can cause a closure leak.
Such a leak can be caught by checking the associated rendezvous upon
deallocation for untriggered events.
A rendezvous' deallocation is up to the programmer, but there is an
important and common case in which Tame can intervene.
If a rendezvous was declared as a local variable in some closure,
and that closure has exited for the last time, then no future code
will call twait on the rendezvous, even if the closure
cannot be deallocated yet because of some stray reference.
Thus, Tame amends the reference counting protocol as follows:
Exiting a tamed function for the last time cancels any
rendezvous directly allocated in that function's closure.
Canceling a rendezvous cancels all events associated with it.
Actual deallocation occurs only when the closure is deallocated, which
might be some time later.
Figure 6b shows how Tame's reference counting
protocol solves broken's leak.
Control exits the function immediately, forcing r's
cancellation by R4.
Upon cancellation, r checks that all of its events
have triggered.
In this case, the event allocated on line 3 has not triggered, but
Tame cancels it, clearing its R3, which releases the
closure, and in turn, releases r.
Any eventual trigger of the event is ignored.
6 Implementation Details
Tame is implemented as a C++ preprocessor (or source-to-source
translator).
The difficulty of parsing C++ is well known [2].
Tame
avoids as much C++ parsing as possible at the cost of several
semantic warts, which could be avoided with fuller compiler
integration.
6.1 Closures
Each tamed function has one closure with a flat namespace,
restricting C/C++'s scoping.
Internally, the Tame translator writes a new C++ structure for each
tamed function, containing its parameters and its tvars
variables.
This structure gets an opaque name, discouraging the programmer
from accessing it directly.
Programmers are free to use arbitrary C++-stack allocation, as long
as no wait points come between the declaration and use of stack-allocated
variables.
When they do, the underlying C++ compiler generates a warning due to
goto branches in the emitted code (see the next section).
Maintaining Section 5's R4
requires that each closure know
which rendezvous
it directly contains, so it can cancel
them appropriately.
This knowledge is unavailable without fully parsing C++: a closure
might contain an object of type foo, that contains an object
of type bar, that contains a rendezvous, which will
in turn share fate with the closure.
As a first-order heuristic, Tame marks
the beginning and the end of the new closure in memory using simple
pointer arithmetic and associates with the closure all
rendezvous that fall between the two fence posts.
6.2 Entry and Exit Translation
The translation of a tamed function adds to the function one new
entry and exit point per twait statement.
A translated twait statement first checks whether a trigger is pending
on the corresponding rendezvous.
If so, control flow continues past the twait function as usual; but
if not, the function records the current wait point, adds a function
pointer for this wait point to the rendezvous, and returns to its
caller.
Later, a trigger on an event in the rendezvous invokes the
recorded function pointer, which forces control to reenter the
function and jump directly to the recorded wait point.
The Tame translation shifted the function's parameters and safe local
variables to a closure structure, so the
function can access these values even after reentry.
The Tame preprocessor adds an extra “closure pointer” parameter to each
tamed function.
The closure pointer is null when the function is called normally, causing the
translated function body to allocate and initialize a new closure.
The closure pointer is non-null when the function is reentered at a later
wait point.
The names of parameters and safe local variables are changed to opaque
identifiers to hide them from the function body; instead, local variables
with reference types make these names point into the closure.
This strategy reduces the extent to which Tame must understand C++ name
lookup, since the translation preserves the function implementation's original
namespace.
Multiple entry points are simulated with a switch statement at the
beginning of the function; each case in the switch jumps to a different
label in the function.
There is one label for the initial function entry and one for
each wait point.
Internally, mkevent is a macro that fetches some
specially named variables (such as the current closure, or the
current implicit rendezvous in the case of a twait
environment).
An input of mkevent(rv, w, t1, t2) generates a call of the
form:
_mkevent(__cls, rv, w, t1, t2);
for some closure __cls.
_mkevent heap-allocates a new event object, packing
it with references to all of the supplied arguments.
The resulting event object provides one method,
trigger, which takes trigger value parameters with the types
of t1 and t2.
All of these operations are type-safe through use of C++ templates.
Putting these pieces together, the translation of:
1 tamed A::f(int x) {
2 tvars { rendezvous<> r; }
3 a(mkevent(r)); twait(r); b(); }
looks approximately like:
1 void A::f(int __tame_x, A_f_closure *__cls) {
2 if (__cls == 0)
3 __cls = new A__f__closure(this, &A::fn, __tame_x);
4 assert(this == __cls->this_A);
5 int &x = __cls->x;
6 rendezvous<> &r = __cls->r;
7 switch (__cls->entry_point) {
8 case 0: // original entry
9 goto __A__f__entry__0;
10 case 1: // reentry after first twait
11 goto __A__f__entry__1; }
12 __A__f__entry__0:
13 a(_mkevent(__cls,r));
14 if (!r.has_queued_trigger()) {
15 __cls->entry_point = 1;
16 r.set_reenter_closure(__cls);
17 return; }
18 __A__f__entry__1:
19 b();
20 }
Lines 5–6 set up the function body so that references to
x and r refer to closure-resident values.
Lines 7–11 direct traffic as it enters and reenters the
function after twait points.
Lines 12–19 are the translation of the user code.
Lines 14–17 are the translation of the twait(r) call
from the original function.
If no trigger is queued on r, the translation bumps the entry
point (line 15) and tells the rendezvous to reenter
__cls via the method A::fn when a trigger
arrives (line 16).
Once that happens, f will jump to entry point 1 (line 18) and call
b().
6.3 Backwards Compatibility
Our implementation of Tame borrows its event loop and event objects
from the libasync event library [2001 (Mazières)].
The key compatibility feature is to implement events as libasync
callbacks, allowing legacy functions to interface with
tamed functions, and consequently, legacy projects to
incrementally switch over to tamed code.
The Tame prototype implements thread support with the Gnu PTH
library [2000 (Engelschall)].
PTH supplies stubs for blocking network calls such as select,
read and write.
Thus the select call in libasync's select loop transparently
becomes a call to PTH's scheduler.
Similarly, blocking network calls in third party libraries
like libmysqlclient drop into the scheduler and later
resume when the operation completes.
We also had to make libasync call a modified, Tame-aware
select. This select returns early when another thread
in the same process triggers an event that should wake up the current
thread (something that never happens in single-threaded Tame).
7 Experience With Tame
Like any other expressive synchronization system, Tame requires
some mental readjustment and ramp-up time.
In most cases, developers need only the twait{} environment,
which is designed to be
simple to learn and comparable to thread programming.
With only this subset of Tame, programmers become much more
productive relative to vanilla-event coders, and hopefully as
productive as thread programmers.
7.1 Web Server
The latest version of OKWS [2004 (Krohn)], a lightweight
Web server for dynamic Web content, uses the Tame system.
Its most obvious applications are serial chains of
asynchronous function calls, such as startup sequences that involve
IPC across cooperating processes.
These chains are common in OKWS; Tame lets them occupy a single
function body, making the code easier to read.
A more specialized Tame application is in OKWS's templating system,
which allows OKWS Web developers to separate their application logic
from the HTML presentation layer.
In a manner similar to Flash [1999 (Pai et al.)], OKWS uses blocking
helper processes to read templates from the file system; the main server
calls the helper processes asynchronously.
However, since templates can be arbitrarily nested, reading one template
may require many helper calls.
The previous version of OKWS, written without Tame, sacrificed
expressiveness for programmability.
Web site developers had to request all template files they
would ever need when their Web service started up, so that a call
to publishing a template in response to a Web request would not block
and force a stack rip.
In the new version of OKWS, publishing a template is an asynchronous
operation, and site developers can therefore publish any file in
the htdocs directory, at any time.
Tame saves developers from the stack ripping problem that previously
discouraged this feature.
7.2 An Event-Based Web Site
OkCupid.com [16] is a dating Web site that uses OKWS as
its Web server.
For several years, its programmers wrote code in the libasync
idiom to manage concurrency, but in early 2006 switched over to
Tame to simplify debugging and to improve productivity.
Currently seven programmers, none of whom are the authors, depend on
Tame for maintaining and developing site features.
The system is easy enough to use so that the first programming project
new employees receive is to convert code from the old event-based
system to Tame syntax.
OkCupid.com has found Tame's parallel dispatch particularly useful when
programming a Web site.
When a user logs into the site, the front-end Web logic requests
data from multiple databases to reconstruct the user's preferences
and server-resident state.
To minimize client-perceived latency from disk accesses, these queries
can happen in parallel.
With just libasync primitives, parallelism was hidden in stack-ripped
code and caused bugs.
Tame's solution is the parallelism inherent
in the twait environment.
To call f and g in parallel, then call h once they both complete, a
Tame programmer simply writes:
1 twait { f(mkevent()); g(mkevent()); }
2 twait { h(mkevent()); }
7.3 An NFS Server
A graduate distributed systems class at MIT requires its students to
write a simple Frangipani-inspired [1997 (Thekkath et al.)] file
server that implements the NFS Version 3 protocol [1995 (Callaghan et al.)].
In spring 2006, the students had the option to write their assignment with the
Tame tool.
Four out of 22 students used Tame, most successfully on the source
file that implements the file system semantics (about two thousand
lines long).
Consider, for example, the CREATE RPC, for creating a new
file on the server.
When given this RPC, the server must acquire a lock, lookup a file
handle for the target directory, read the contents of the directory,
write out new directory contents, then write out the file, and
finally release the lock, all the while checking for various error
conditions.
The solutions with legacy libasync involve code split up over no
fewer than five functions, with a stack rip at every blocking point.
Students who used Tame accomplished the same semantics with just one
function.
Quantitatively, the students who used Tame wrote 20% less code
in their source files, and 50% less code in their header files.
Qualitatively, the students had positive comments about the Tame
system and semantics, and strongly preferred writing in the Tame
idiom to writing libasync code directly.
7.4 Debugging
Tame's preprocessor implements source-code line translation,
so debuggers and compilers point the programmer to the line of code in
the original Tame input file.
The programmer need only examine or debug autogenerated code when
Tame itself has a bug.
Programmers can disable line-translation and view human-readable output
from the Tame preprocessor.
Relative to a tamed function in the input file, a tamed function
in the output file differs only in its Tame-generated preamble, at
twait points, and at return statements.
The rest of the code is passed through untouched.
Tame also has debugging advantages over legacy libasync
with unmodified debugging tools.
With legacy libasync, a developer must set a breakpoint at every
stack rip point.
With Tame, a logical operation once again fits inside a single
function body.
As a result, a programmer sets a break point at the suspect function,
and can trace execution until a blocking point (i.e., twait).
After the blocking point, control returns to the same breakpoint at
the top of the same function, and then jumps to the code directly
after the twait statement.
Future work calls for a Tame debugger and profiler.
In both cases, the runtime nesting of closures is Tame's
analogue of the call stack in a threaded program.
Slight debugger modifications could allow walking this graph to
produce a “stacktrace”-like feature, and similarly, measuring
closure lifetime can yield a gprof-style output for
understanding which parts of a program induce latency.
Even under the status quo, programmers can access safe local variables
in debuggers by simply examining the members of a function's closure
and can walk the closure-chain manually if desired.
7.5 Locks and Synchronization
Programmers using events or cooperative threading often falsely
convince themselves that they have “synchronization for free.”
This is not always the case.
Global data on one side of a yield or block point might look different
on the other side, if another part of the program manipulated
that data in between.
With threaded Tame programs, or threaded programs in general, any
function invocation can result in a yield, hiding concurrency
assumptions deep in the call stack.
In practice, a programmer cannot know automatically know when
to protect global data structures [2002 (Adya et al.)].
Event-only Tame programs make concurrency assumptions explicit,
since they never yield; they just return to the main event loop
(allowing other computations to run) on either side of a twait
statement or environment.
When Tame programs require atomicity guarantees on either side of
a twait (or yield in the case of threads), they can use a simple
lock implementation based on Tame primitives.
A basic lock class exposes the methods:
tamed lock::acquire(event<> done);
void lock::release();
The acquire method checks the lock to see if it's currently
acquired; if so, it queues the given event, and if not, it triggers
done immediately.
The release method either triggers the head of
the event queue, or marks the lock as available if no
events were queued.
An example critical section in Tame now looks like:
1 tamed global_data_accessor() {
2 twait { global_lock->acquire(mkevent()); }
3 ... touch global state, possibly blocking ...
4 global_lock->release();
5 }
We have also built shared read locks with Tame, in which a writer's
release of a lock can cause all queued readers to unblock.
8 Performance Measurements
The Tame implementation introduces potential performance costs
relative to threaded code and traditional event-driven software.
Unlike cooperative-threaded code, and more so than traditional
event libraries (e.g. libasync), Tame makes heavy use of
heap-allocated data structures, such as closures and one-time
events.
Tame also uses synchronization primitives (namely rendezvous
and events) that are potentially costlier than the lower level
primitives in threading packages or libasync.
We investigate the end-to-end cost of Tame relative to a comparable
high-performance system, and conclude that Tame incurs no
performance penalties and makes better use of memory.
8.1 End-to-End Performance
A logical point of comparison for the Tame system is the
Capriccio thread package [2003 (von Behren et al.)].
Like Tame, it provides automatic memory management
and cooperative task management; it is also engineered for
high performance.
The Capriccio work focuses its measurements on the simple
“Knot” web server.
We compared the performance of the original Capriccio Knot server
with a lightly modified, tamed version of Knot.
In selecting a workload, we factored out the subtleties of disk I/O
and scheduling that other work has addressed in
detail [2007 (Pariag et al.)] and focused on memory and CPU
use.
We ran a SpecWeb-like benchmark but used only the smallest files in
the dataset, making the workload entirely cacheable and avoiding
link saturation.
For all experiments, the server was a 2-CPU 2.33 GHz Xeon 5140 with 4GB
memory, running Ubuntu Linux with kernel 2.6.17-10, code compiled
with GCC version 4.1.2, optimization level -O2.
Because Capriccio does not compile with more recent
compilers, it was compiled GCC version 3.3.5.
Glibc and NPTL were both version 2.4.
Though the machine has four cores, only one was needed in our
experiments (neither Tame nor the other systems tested use
multiple CPUs).
Tame supports Linux's epoll, but its event loop was
configured to use select in our benchmarks.
Capriccio uses the similar poll call in its loop.
We used an array of six clients connected through a gigabit switch, each
making 200 simultaneous requests to the server.
The servers were given a thirty-second “warm-up” time in which they
pulled all of the necessary files from disk into cache, and then
ran for a one-minute test.
The results are shown in Figure 7.
|
Capriccio |
Tame |
|
1ptThroughput (connections/sec) |
28,318 |
28,457 |
Number of threads |
350 |
1 |
Physical memory (kB) |
6,560 |
2,156 |
Virtual memory (kB) |
49,517 |
10,740 |
Figure 7: Measurements of Knot at maximum throughput. Throughput is
averaged over the whole one-minute run. Memory readings are taken
after the warm-up period, as reported by ps.
The high level outcome is that under this workload, the Capriccio
and Tame versions of Knot achieve the same throughput, but Tame
Knot uses one-third the physical memory, and one-fifth the virtual memory.
We note that neither Knot server in this scenario ever blocks: both servers
use 100% of available CPU, even when idle.
A version of the Tame server that blocks when there is no work to be done
achieves a surprising 4,000 fewer connections per second on our benchmark
machine.
Another important optimization was to avoid dropping into the select
loop when outstanding connection attempts could be
accepted [2004 (Brecht et al.)].
Microbenchmarks in Section 8.2 show the select loop
is expensive relative to other Tame primitives.
Optimizing Capriccio Knot's performance required manual tuning.
The size of the thread pool must be sufficiently large
(about 350 threads) before Capriccio Knot can achieve maximal throughput.
Threads' stack sizes must also be set correctly—stacks that are too small
risk overflow, while stacks that are too big waste virtual memory—but
the default 128 kB per stack sufficed for these experiments.
Capriccio can automate these parameter settings, but the Knot server
in the Capriccio release does not use automatic stack sizing,
and manual thread settings were more stable in our tests.
Further work could bring Capriccio's memory usage more in line with Tame's,
but we note that Tame achieves its memory usage automatically without
changes to the base compiler.
Memory allocation in Tame Knot happens mainly on
the heap, in the form of event and closure allocations.
In our test cases, we noted 12 closure allocations and
12.6 event allocations per connection served.
We experimented with “recycling” events of common types (such as
event<>s) rather than allocating and freeing them each time.
Such optimizations had little impact on performance, suggesting
Linux's malloc automatically optimizes Tame's memory access pattern.
8.2 Microbenchmarks
Operation |
Min
|
Median
|
Mean |
|
1ptSimple function call |
63 |
63 |
66 |
Simple function call |
182 |
196 |
196 |
with int allocation |
Tame call (nullfn()) |
399 |
455 |
463 |
wrap() |
217 |
224 |
231 |
gettimeofday() |
2618 |
2660 |
2781 |
Figure 8: Cost of system calls and libasync and Tame
primitive operations, measured in cycles.
We performed microbenchmarks to get a better sense for how Tame was
spending its cycles in the web benchmark, and to provide baseline
statistics for other applications.
A first cost of Tame relative to thread programming is
closure allocation.
We measured closure costs with the most basic tamed function that uses
a closure:
1 static tamed nullfn()
2 { tvars { int i(0); } i++; }
For comparison, we also measure a trivial function, a function that
performs a small heap allocation, libasync's closure-approximating
function (i.e., wrap), and a trivial system call
(gettimeofday). In each case, an experiment consisted of executing the primitive
10,000 times, bracketed by cycle counter checks.
We ran each experiment 10 times and report averaged results over the 10
experiments, and the median results over all 100,000 calls.
In all cases, the standard deviation over the 10 experiments was within
5% of the mean.
Figure 8 summarizes our results:
entering a tamed function is about 2.2 times the cost of a simple
function with heap allocation, and 1.8 times the cost of a
wrap invocation.
A second function, benchfn, measures Tame overhead when
managing control flow:
1 static tamed benchfn (int niter, event<> done) {
2 tvars { int i; }
3 for (i = 0; i < niter; i++)
4 twait { timer(0, mkevent()); }
5 done.trigger();
6 }
Line 4 of benchfn is performing Tame's version of a
thread fork and join.
A call to mkevent and later twait is required to
launch a potentially blocking network operation, and to harvest
its result.
Unlike a libasync version of benchfn, the tamed version
must manage closures, an implicit rendezvous, and jumping into and
out of the function once per iteration.
We compare three versions of benchfn: with an implicit
rendezvous, with an explicit rendezvous, and with only
libasync features.
We ran all versions with niter=100, and repeated the
experiment one thousand times.
The results are presented in Figure 9.
All experiments spend a majority of cycles in the core select loop.
The benchfn that uses an implicit rendezvous is only
slightly more expensive, performing within 2% of the native
libasync code.
Tame's low-level implementation special-cases the implicit
rendezvous, reducing memory allocations and virtual method calls
along the critical path.
Hence, the benchfn version that uses an explicit
rendezvous runs about 6% slower still.
We also experimented with replacing libasync's native scheduler with
that of the PTH thread library, as is required when running Tame
with thread support.
Based on these benchmarks, we can estimate how Tame Knot's CPU time is
spent.
Tame Knot uses 81,877 cycles for each request. Assuming the microbenchmark
results hold, and given Tame Knot's use of 12.6 events and 12 closure
allocations per request, roughly 7.6% of these cycles are spent on event
management and 6.8% on closure management, with the remainder going
towards system calls and application-level processing.
Figure 9 also gives similar benchmarks for a version of
benchfn written in pure thread abstractions,
1 static void noop() { pthread_exit(NULL); }
2 void benchfn_thr(int niter) {
3 for (int i = 0; i < niter; i++) {
4 pthread_t t;
5 pthread_create(&t, NULL, noop, NULL);
6 pthread_join(t, NULL);
7 }
8 }
and a version using Tame's thread wrappers:
1 static void noop_tame() {}
2 void benchfn_tame_thr(int niter) {
3 for (int i = 0; i < niter; i++)
4 twait { tfork(wrap(noop_tame)); }
5 }
A thread allocation and join is five to seven times as expensive
as an event allocation and join in Tame when using standard
Linux libraries like PTH and NPTL.
Tame's thread wrappers added additional overhead relative to native
PTH since they require locks and condition variables.
Capriccio is much faster and competitive with Tame.
Traditionally, threaded programs allocate threads not quite as cavalierly as
benchfn_thr; they might use thread-pooling techniques to accomplish
more than one operation per thread.
However, examples like those in Sections 3.2
and 3.3 and those in real-world Web site programming
(Section 7.2) benefit greatly from repeated thread
creation and destruction.
Tame primitives are certainly fast enough to support this.
Function |
Cycles |
In Core |
Outside |
|
1ptbenchfn without Tame |
05,251 |
4,906 |
344 |
benchfn with twait{} |
05,331 |
4,840 |
491 |
with PTH core loop |
06,010 |
5,476 |
534 |
benchfn with twait(r); |
05,642 |
4,887 |
755 |
with PTH core loop |
06,565 |
5,678 |
887 |
benchfn_thr in PTH |
37,540 |
- |
- |
in NPTL |
28,803 |
- |
- |
in Capriccio |
07,892 |
- |
- |
benchfn_tame_thr |
64,957 |
- |
- |
Figure 9: Results from running the benchfn code in 1,000
experiments, with niter=100. Costs shown are the average
cycles per iteration, averaged over all experiments. These costs are
broken down into cycles spent in the core event loop, and time spent
outside.
In sum, Tame's primitive operations are marginally more expensive
than libasync's and roughly equivalent to those of a good thread
package.
The observed costs are cheap relative to real workloads in network
applications.
9 Summary
Tame confers much of the readability advantage of threads
while preserving the flexibility of events,
and modern thread packages have good performance:
the clichéd performance/readability distinction between events and
threads no longer holds.
Programmers should choose the abstraction that best
meets their needs.
We argue that event programming with Tame is a good fit
for networked and distributed systems.
The Tame system has found adoption in real event-based systems, and
the results are encouraging: fewer lines of code, simplified
memory management, and simplified code maintenance.
Our hope is that Tame can solve the software maintenance problems
that plague current event-based systems, while making events
palatable to a wider audience of developers.
Acknowledgments
We thank the following people for helpful comments on the Tame
system and earlier drafts of this paper:
Jeff Fischer,
Jon Howell,
Rupak Majumdar,
Todd Millstein,
Michael Walfish,
Russ Cox, Jeremy Stribling, the other
members of the PDOS group, and the anonymous reviewers.
Russ and Tim Brecht helped us in benchmarking Capriccio, and we thank
Vivek Pai for his “flexiclient” workload generator.
Chris Coyne inspired the authors to think about simplifying events.
Thanks to him and the developers at OkCupid.com who have adopted the system.
We also thank David Mazières for creating the libasync system on which
Tame is based.
Eddie Kohler's work on Tame was supported by the National Science
Foundation under Grant No. 0427202.
The Tame system (with minor syntactic differences) is distributed
along with the libasync libraries, all under a GPL version 2
license, at http://www.okws.org/.
References
-
2002 (Adya et al.)
-
A. Adya, J. Howell, M. Theimer, W. J. Bolosky, and J. R. Douceur.
Cooperative task management without manual stack management.
In Proc. 2002 USENIX Annual Tech. Conference, June 2002.
- [2]
-
A. Birkett.
Parsing C++.
http://www.nobugs.org/developer/parsingcpp.
- 2004 (Brecht et al.)
-
T. Brecht, D. Pariag, and L. Gammo.
accept()able strategies for improving Web server performance.
In Proc. 2004 USENIX Annual Tech. Conference, June 2004.
- 1995 (Callaghan et al.)
-
B. Callaghan, B. Pawlowski, and P. Staubach.
NFS version 3 protocol specification.
RFC 1813, Network Working Group, June 1995.
- 1999 (Claessen)
-
K. Claessen.
A poor man's concurrency monad.
Journal of Functional Programming, 90 (3), May 1999.
- 2005 (Cunningham and Kohler)
-
R. Cunningham and E. Kohler.
Making events less slippery with eel.
In Proc. HotOS-X, June 2005.
- 2001 (Dabek et al.)
-
F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica.
Wide-area cooperative storage with CFS.
In Proc. 18th SOSP, October 2001.
- 1990 (Davies)
-
G. L. Davies.
Teaching concurrent programming with Pascal-FC.
SIGCSE Bulletin, 220 (2), 1990.
- [9]
-
U. Drepper.
The native POSIX thread library for Linux.
http://people.redhat.com/drepper/nptl-design.pdf.
- [10]
-
T. Duff.
Duff's device.
http://www.lysator.liu.se/c/duffs-device.html.
- 2006 (Dunkels et al.)
-
A. Dunkels, O. Schmidt, T. Voigt, and M. Ali.
Protothreads: Simplifying event-driven programming of
memory-constrained embedded systems.
In Proc. 2006 SenSys, Nov. 2006.
- 2000 (Engelschall)
-
R. S. Engelschall.
Portable multithreading: The signal stack trick for user-space thread
creation.
In Proc. 2000 USENIX Annual Tech. Conference, June 2000.
- 2004 (Freedman et al.)
-
M. J. Freedman, E. Freudenthal, and D. Mazières.
Democratizing content publication with Coral.
In Proc. 1st NSDI, March 2004.
- 1974 (Hoare)
-
C. A. R. Hoare.
Monitors: an operating system structuring concept.
Communications of the ACM, 170 (10), 1974.
- 1992 (Hudak et al.)
-
P. Hudak et al.
Report on the programming language Haskell: a non-strict, purely
functional language version 1.2.
SIGPLAN Not., 270 (5), 1992.
- [16]
-
Humor Rainbow, Inc.
OkCupid.com.
http://www.okcupid.com.
- 1986 (Jones)
-
G. Jones.
Programming in Occam.
Prentice Hall International (UK) Ltd., 1986.
- 2004 (Krohn)
-
M. Krohn.
Building secure high-performance web services with OKWS.
In Proc. 2004 USENIX Annual Tech. Conference, June 2004.
- 2001 (Lemon)
-
J. Lemon.
Kqueue: A generic and scalable event notification facility.
In Proc. 2001 FREENIX, 2001.
- 2007 (Li and Zdancewic)
-
P. Li and S. Zdancewic.
Combining events and threads for scalable network services.
In Proc. 2007 PLDI, Jun 2007.
- 2001 (Mazières)
-
D. Mazières.
A toolkit for user-level file systems.
In Proc. 2001 USENIX Annual Tech. Conference, June 2001.
- [22]
-
Microsoft.
Inside I/O completion ports.
http://www.microsoft.com/technet/sysinternals/information/IoCompletionPorts.mspx.
- [23]
-
MySQL AB.
MySQL.
http://www.mysql.com.
- 1999 (Pai et al.)
-
V. S. Pai, P. Druschel, and W. Zwaenepoel.
Flash: An efficient and portable Web server.
In Proc. 1999 USENIX Annual Tech. Conference, June 1999.
- 2007 (Pariag et al.)
-
D. Pariag, T. Brecht, A. Harji, P. Buhr, and A. Shukla.
Comparing the performance of web server architectures.
In Proc. 2007 EuroSys, March 2007.
- 2006 (Park and Pai)
-
K. Park and V. S. Pai.
Connection conditioning: Architecture-independent support for simple,
robust servers.
In Proc. 2006 NSDI, May 2006.
- [27]
-
N. Provos.
libevent — an event notification library.
http://www.monkey.org/ provos/libevent.
- 2004 (Provos)
-
N. Provos.
A virtual honeypot framework.
In Proc. 13th USENIX Security Symposium, Aug 2004.
- 1997 (Ramkumar and Strumpen)
-
B. Ramkumar and V. Strumpen.
Portable checkpointing for heterogeneous architectures.
In Proc. 27th International Symposium on Fault-Tolerant
Computing, June 1997.
- 1991 (Reppy)
-
J. H. Reppy.
CML: A higher concurrent language.
In Proc. 1991 PLDI, June 1991.
- [31]
-
Silicon Graphics, Inc.
State threads for Internet applications.
http://state-threads.sourceforge.net/docs/st.html.
- 1998 (Skillicorn and Talia)
-
D. B. Skillicorn and D. Talia.
Models and languages for parallel computation.
ACM Computing Surveys, 300 (2), 1998.
- 1984 (Steele)
-
G. Steele.
Common Lisp.
Digital Press, Jun 1984.
- 2006 (Stribling et al.)
-
J. Stribling, J. Li, I. G. Councill, M. F. Kaashoek, and R. Morris.
OverCite: A distributed, cooperative CiteSeer.
In Proc. 2006 NSDI, May 2006.
- 1997 (Thekkath et al.)
-
C. Thekkath, T. Mann, and E. K. Lee.
Frangipani: A scalable distributed file system.
In Proc. 16th SOSP, October 1997.
- 2006 (Tolia et al.)
-
N. Tolia, M. Kaminsky, D. G. Andersen, and S. Patil.
An architecture for Internet data transfer.
In Proc. 2006 NSDI, May 2006.
- 2003 (von Behren et al.)
-
R. von Behren, J. Condit, and E. Brewer.
Why events are a bad idea (for high-concurrency servers).
In Proc. HotOS-IX, May 2003a.
- 2003 (von Behren et al.)
-
R. von Behren, J. Condit, F. Zhou, G. C. Necula, and E. Brewer.
Capriccio: scalable threads for Internet services.
In Proc. 19th SOSP, October 2003b.
- 2006 (Walfish et al.)
-
M. Walfish, M. Vutukuru, H. Balakrishnan, D. Karger, and S. Shenker.
DDoS Defense by Offense.
In Proc. 2006 NSDI, May 2006.
- 2001 (Welsh et al.)
-
M. Welsh, D. Culler, and E. Brewer.
SEDA: An architecture for well-conditioned, scalable Internet
services.
In Proc. 18th SOSP, October 2001.
- 2003 (Zeldovich et al.)
-
N. Zeldovich et al.
Multiprocessor support for event-driven programs.
In Proc. 2003 USENIX Annual Tech. Conference, June 2003.
- 1
- In other libraries we have examined, such as CML, the twait
function would return an opaque, system-chosen ID that the
programmer would compare with return values from mkevent.
Though this works, event IDs are far more convenient, particularly when
many events are outstanding.
- 2
- threadfunc<R> is an event whose
trigger method yields a return value of type R. Given the
function int f(), we can create a threadfunc<int>
from a function pointer to f. From the function int
g(int a), we can create a threadfunc<int> by
wrapping g with an integer argument,
as in function currying [1992 (Hudak et al.)].
This document was translated from LATEX by
HEVEA.
|