USENIX ;login: - summaries

3rd Symposium on Operating Systems Design and Implementation (OSDI '99)

NEW ORLEANS, LOUISIANA
February 22-25, 1999

Overview
Keynote Address
Session: I/O
Session: Resource Management
Panel: Virtual Machine-based Operating Systems
Session: Kernels
Session: Real-Time
Session: Distributed Systems
Session: Virtual Memory
Session: Filesystems
Works-in-Progress Session
Photos from Conference

Overview by David Sullivan

The third OSDI was, in the words of program chairs Margo Seltzer of Harvard University and Paul Leach of Microsoft Research, designed to "define the charter of operating-systems research for the coming decade" and to address whether OS researchers should be "embracing more divergent areas." The keynote about the World Wide Web by Jim Gettys and a lively panel on virtual-machine-based systems touched on some of these other areas, but the conference also showcased excellent work in the core areas of OS research.

Veteran attendees of conferences like this one remarked on the extremely high quality of the authors' presentations. The talks were clear, well-structured, and engaging, and they provoked a number of thoughtful questions from the audience which we have attempted to capture in the session summaries. The conference featured a well-attended works-in-progress session, a number of evening BOF sessions, and ample opportunities for attendees to socialize and exchange ideas while enjoying the conference-sponsored receptions, as well as the cuisine and local color of New Orleans's renowned French Quarter.

If the papers presented at the conference can be considered a foretaste of what is to come, there is an abundance of important work to be done during the coming decade of operating-systems research. And OSDI, which in its third instantiation was declared an established tradition by the program chairs, will be there to continue to showcase that work.

KEYNOTE ADDRESS

The Blind Men and the Elephant

Jim Gettys, Compaq Computer Corp. and the World Wide Web Consortium

Summary by Keith Smith

Jim Gettys is a senior consultant engineer for Compaq Computer Corpora-tion's Industry Standards and Consortia Group and a visiting scientist at the World Wide Web Consortium (W3C) at M.I.T. He is the chair of the HTTP/NG Protocol Design Working Group of W3C.

Gettys's talk took its title from the John Godfrey Saxe poem of the same name, in which a group of blind men encounter an elephant and each man, touching a different part of the elephant, draws a completely different conclusion about what manner of beast they've met. By analogy, Gettys suggested that any attempt to understand or to optimize the Web by considering only one component is probably doomed to failure. The "elephant'' of the Web consists of many components with strong interactions between them. To further complicate matters, all of these components are changing.

The two most significant parts of the Web are what happens on the wire (i.e., HTTP), and the content—HTML, style sheets, images, Java applets, etc. There are numerous interactions between these parts, for example, between Web content and the protocols that are used to access it, or between content and caching. Legal and social interactions are also interesting.

Gettys described HTTP as a "grungy'' protocol: verbose making poor use of TCP, and failing to separate metadata and content. Version 1.1 of HTTP, however, now being deployed, addresses many of these problems. It allows for persistent connections and the pipelining of requests. It supports content encoding, for compression. When HTTP 1.1 is well implemented, one TCP connection carrying HTTP 1.1 outperforms four concurrent connections carrying HTTP 1.0. HTTP 1.1 also has better support for caching.

Naturally, the changes in HTTP 1.1 lead to some interesting interactions. As an example, Gettys discussed the interaction between TCP and data compression, observing that compression scales up faster than the savings in bytes on a high-speed network. With cache validation performing 2-5 times better in HTTP 1.1, Gettys also speculated that this would change applications.

Web content is also changing. The advent of style sheets offers a variety of benefits, reducing the need for repetitive markup and ad hoc images. Style sheets will also reduce overhead by decreasing both the raw number of bytes that need to be transmitted and the number of HTTP requests, since many design elements that are now embedded images can be described more tersely with style sheets.

Other changes in Web content include the move from GIF to PNG images, new content types such as vector formats, and content types that are seldom used at present because of bandwidth constraints (e.g., audio and video).

The next topic Gettys addressed was the caching and proxy infrastructure. He observed that much of the so-called "dynamic content'' in the Web could be cached, as the databases from which it is generated are only updated periodically. Gettys cited data that shows, contrary to some predictions, that the fraction of dynamic content in Web traffic is not increasing.

Currently most caching is done at the periphery of the Net, near the clients. Gettys argued that caching would be more effective if it also occurred in the middle of the Net; the closer a cache is to the server, the more of that server's load the cache can offload. Web caching also has interesting interactions with intellectual-property issues. Can servers trust a proxy not to give their data to the wrong people? This has obvious importance for pay-per-view content.

Another area of interest is the increasing use of HTTP as a transport. More and more metaservices are being implemented on top of HTTP. Frequently, this involves using forms to invoke functionality on remote servers. Gettys pointed out that posting a form is equivalent to a method invocation on a remote object, or like an ioctl() call, without a procedure signature. This is a hole you can drive an elephant through. Current object-oriented technology is too brittle. In the Internet, either end must be able to change independently. In particular, there is no way for such a metaprogram to know when the underlying form has changed. As a result, the metaprogram might inexplicably stop working, or it could start doing undesirable things such as ordering thousands of dollars of products you don't want.

People frequently think that HTTP can be used to tunnel through a firewall. While this might work sometimes, firewall administrators weren't born yesterday. The firewall can look at content before passing on a request. If they don't know what is in it (e.g., SSL), they won't let it through.

HTTP is getting extended in all sorts of ways. CORBA, DCOM, Java RMI, and a variety of other protocols are now being run on top of HTTP. The result is frequently poor performance. DCOM and CORBA were originally designed for use on a local network and are even more verbose than HTTP.

Changes in the technology used for local loops will also have an impact on the future of the Web. With the advent of DSL and cable modems, traditional modem technology is dead. A variety of other technologies may also come into play in providing the final link to the user satellites (e.g., Direct TV), data over 110/220 volts (power companies already have a right of way to your house), and noncellular wireless. Gettys pointed out that the explosion of wireless devices means that bandwidth will still be a concern for the foreseeable future, as these devices often have less bandwidth than today's dial-up modems.

The many changes in the components of the Web, and the complex interactions among them, lead to some questions. Will application developers optimize for speed, or will they be content to keep download time constant? As new facilities become available, will site designers use them? Will future improvements lead to faster sites or to more junk on the page? Gettys feels that tools are the key to the future. Current tools are terrible, frequently generating excessively verbose and invalid HTML. Many current tools don't support important new technologies, such as caching.

In closing, Gettys observed that we are all neophytes. Researchers working with the Web, himself included, are starting to get a sense of the shape of the elephant, but still need to understand the interactions of the various parts before optimizing any single part in isolation.

In the Q&A session, Paul Leach of Microsoft suggested that Gettys's HTTP/NG work suggests that he must have some opinions about the answers to the questions that he closed his talk with. Gettys replied that fundamentally, it's about metaservices. They are being used by more and more programs and make safe extensibility vital. As the Internet evolves, things need to break at the right times.

Greg Minshall of Siara Systems asked where pressure can be applied to make tools better, and whether there is an economic pressure on the tool suppliers to provide better ones. Gettys replied that there should be economic benefits to making tools easier to use and to making the end-user experience better. He also observed that it is difficult now for tools to support cascading style sheets (CSS), as neither Netscape nor Internet Explorer supports them fully, and they implement different subsets of CSS.

Session: I/O

Summary by Keith Smith

Automatic I/O Hint Generation Through Speculative Execution [Best Student Paper]

Fay Chang and Garth A. Gibson, Carnegie Mellon University

Fay Chang presented this work, one of two winners of the award for Best Student Paper. It was one of those wonderfully novel papers that presents a seemingly bizarre idea that turned out to work surprisingly well.

The research was performed in the context of the Transparent Informed Prefetching (TIP) system that Hugo Patterson presented at the 1995 SOSP. In that system, applications were manually modified to provide file-prefetching hints to the kernel. Chang and Gibson's work eliminates the need for manual modification by providing the prefetching hints automatically through speculative execution of the application. The basic idea is that when an application blocks on a read request, a second thread in the same application (the "speculative thread'') continues executing, only instead of issuing read requests, it issues prefetching hints.

One of the major concerns about adding this speculative thread to an existing application is ensuring program correctness. Chang addressed this concern by noting that their system does not allow the speculative thread to execute any system calls except the hint calls, as well as fstat() and sbrk(). Exceptions generated by the speculative thread are ignored. To prevent the speculative thread from modifying data values that the main application thread needs, Chang and Gibson use "software-enforced copy-on-write.'' Before each store, the code determines whether the target memory region has been copied yet. If not, a copy is made. In either case, the store (and subsequent loads) are redirected to the copy. To insulate the main thread from the performance impact of these extra checks, they make a complete copy of the application's text and constrain the speculative thread to execute in that copy.

They maintain a log of the hints generated by the speculative thread. When a real I/O request is generated by the application, they check that the request matches the next hint in the log. If it doesn't match, they halt the speculative thread, tell the operating system to ignore any outstanding hints, and restart the speculative thread from the current location of the application. This technique allows the speculative thread to catch up if it falls behind and also allows it to get as far ahead of the main thread as possible as long as it is generating accurate hints.

Chang next discussed SpecHint, a tool that she and Gibson developed to generate the speculating binary by rewriting the original binary, thus avoiding the need to access application source.

Finally, she presented the results of experiments conducted to evaluate the system. They used three of the test programs from the TIP benchmark suite—XDataSlice, agrep, and gnuld. For all of these programs, the speculating version showed improved performance when compared to the original nonhinting version. In addition, two of the speculating versions showed performance improvements comparable to those achieved by programs with manually inserted hints from the original TIP work.

To measure the overhead of speculating, they ran the speculating versions of their test programs on a system with prefetching disabled. They saw a 1-4% slowdown compared to the unmodified versions of the test programs.

In the Q&A session, Fred Douglis of AT&T Research asked Chang to elaborate on what happens when the speculating thread wants to execute a disallowed system call. Chang replied that for most calls, the system call stub is replaced by code that returns success.

Jochen Liedtke of IBM's T.J. Watson Research Center asked why Chang had chosen to use a software-based copy-on-write scheme rather than a traditional hardware-based approach. Chang replied that they had tried forking a speculative version of the program, but they found the restart costs (i.e., fork()) prohibitive.

IO-Lite: A Unified I/O Buffering and Caching System [Best Paper]

Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel, Rice University

This paper won the conference's Best Paper award.

Vivek Pai started by observing that network-server throughput affects many people's perceptions of computing speed, because for them it's crucial to end-user response time.

A problem with current operating systems is that they contain many independent buffers and caches in different layers of the system: the filesystem buffer cache, VM pages, network mbufs, application-level buffers, etc. The interactions between these buffers and caches cause two problems—data copying and multiple buffering—both of which degrade overall system performance.

The goal of IO-Lite is to unify all caching and buffering in the system, allowing applications, the network, the filesystem, and IPC to use a single copy of the data safely and concurrently. Concurrent access to the data is accomplished using immutable shared buffers. Programmers manipulate the buffers using a data structure called a buffer aggregate, which provides a level of indirection to the physical data. This technique is similar to Fbufs, which were presented at the 1993 SOSP by Peter Druschel.

IO-Lite was implemented in a general-purpose operating system, FreeBSD, as a loadable kernel module. The IO-Lite API includes two new system calls, iol_read and iol_write, which are similar to the generic read and write system calls, except that they operate on buffer aggregates. The API also includes routines for manipulating buffer aggregates.

Pai pointed out that since IO-Lite's buffers are immutable, the combination of the physical address of a buffer and its generation number gives you a unique identifier for the data in the buffer. This UID can be used to cache information about the buffer. The network system, for example, uses this technique to cache the checksum for a buffer.

In comparison testings Flash-Lite outperformed the authors's Flash Web server, typically by 40-80%.

Kevin Van Maren from the University of Utah asked whether the IO-Lite buffers are pageable, and if so, what happens when the network tries to DMA to/from one. Pai replied that the IO-Lite buffers are pageable, but for network access they pin the pages in memory. Jose Brustoloni from Lucent's Bell Labs observed that some applications assume a specific layout of data in memory, and asked whether IO-Lite would support such applications without performing a data copy. Pai replied that in such cases they would need to perform one copy.

An attendee from Sandia National Labs asked how IO-Lite handles cache replacement. Pai said that IO-Lite buffers are reference counted. If there are no references to a buffer (other than from the cache), it can be replaced. The VM system pages things out normally, using LRU. An attendee from Veritas Software asked how IO-Lite would work on a system in which file data and file metadata coexist in the same cache. Pai replied that in their implementation platform, FreeBSD, a separate static file cache is used only for metadata. He didn't see any problem, however, using IO-Lite on systems that use the same cache for both, although you would probably want to pin metadata pages down separately.

Virtual Log Based File Systems for a Programmable Disk

Randolph Y. Wang, University of California, Berkeley; Thomas E. Anderson, University of Washington, Seattle; David A. Patterson, University of California, Berkeley

Randolph Wang opened his talk with a simple question, "How long does it take to write a small amount of data to a disk?'' An optimist would consider the time to transfer the data from the head to the disk and answer, "20 microseconds.'' A pessimist would take into account the costs of seek and rotational latencies and answer, "Several milliseconds.'' The goal of this work was to deliver microsecond write performance to applications and make it scale with disk bandwidth.

Wang explained that the problem with traditional filesystems is that the interface between the host filesystem and the disk controller is limited in expressive power, and the I/O bus doesn't scale up. Their solution is to move part of the filesystem into the disk, taking advantage of the CPU power available on today's disks and exploiting the free bandwidth there.

The authors minimize the latency of small synchronous writes by writing them to free sectors or blocks near the current location of the disk head. They call this technique "eager writing.'' To make it work, the disk maintains a table mapping logical blocks to their physical locations. This table is also written using eager writing. They handle recovery by threading the different pieces of the table together into a "virtual log.'' The log is a backward chain, with each record containing a pointer to the previous log record. In the event of power failure, all the system needs to do is to write the tail of the log to disk. Wang said that engineers at disk vendors had indicated that it would be easy to modify the disk firmware to perform this write to a fixed location prior to parking the heads.

Since disk support for eager writing does not yet exist, the authors used a disk simulator to evaluate their system. Pei presented a comparison of a standard implementation of the UNIX File System (UFS) to UFS running on a virtual-logging disk (VLD), as well as of LFS and LFS running on a VLD. The results show a substantial improvement in the performance of small-file creation and deletion when the VLD was used.

Since the performance of eager writing depends on the availability of free disk space near the disk head, the authors evaluated the performance of the different test systems for a variety of disk utilizations. Both UFS systems showed a slight performance degradation as the disk filled. Although LFS performed excellently at lower utilizations, its performance degraded much more quickly as the disk filled.

Sean O'Malley of Network Appliance observed that virtual logging moves a piece of the filesystem onto the disk. As a result, you have two filesystems, one on the disk and one on the host machine. O'Malley asked whether the two filesystems wind up fighting with each other. Wang replied that he had really only moved a piece of information to the disk—whether or not the log is free. If you want to move more functionality onto the disk, you probably need to redefine the disk interface.

Peter Chen from the University of Michigan wanted to know how reliable writing the tail of the log to disk during powerdown is. Wang replied that the people he spoke with thought it was reasonable to write as much as a track of data at powerdown.

Session: Resource Management

Summaries by Xiaolan Zhang

Resource Containers: A New Facility for Resource Management in Server Systems [Best Student Paper]

Gaurav Banga and Peter Druschel, Rice University; Jeffrey C. Mogul, Compaq Western Research Laboratory

This paper was the other winner of the Best Student Paper award. Gaurav Banga began by discussing the motivation behind the work: the fact that general-purpose operating systems provide inadequate resource-management support for server applications. In current systems, processes serve as both resource principals and protection domains. The joining of these two roles is often inappropriate for server applications. In a Web server, for example, many independent tasks share the same process, and much of the processing associated with HTTP connections happens inside the kernel and is unaccounted for and uncontrolled. Banga and his co-authors developed the resource container abstraction to separate the notions of protection domain and resource principal, and thus enable fine-grained resource management.

Resource containers encompass all of the resources used by an application for a particular independent activity. The system associates scheduling information with a resource container, not a process, thus allowing resources to be provided directly to an activity, regardless of how it is mapped to threads. To be effective, resource containers require a kernel-execution model in which kernel processing can be performed in the context of the appropriate container.

The authors implemented resource containers in Digital UNIX 4.0. They modified the CPU scheduler to implement a hierarchical decay-usage scheduler that treats resource containers as its resource principals. They also modified the network subsystem to associate received network packets with the correct resource container, allowing the kernel to charge the processing of each packet to its
container.

Banga presented results showing that resource containers are quite lightweight, and he discussed two experiments designed to test their effectiveness. In the first, one high-priority client and several low-priority clients request documents from a Web server. Without resource containers, the response time for the high-priority client increases greatly as the number of low-priority clients increases because of added networking processing in the kernel. With resource containers, the response time also increases, but in a much more controlled way.

In the second experiment, a Web server's throughput for static documents was measured in the face of an increasing number of CGI requests. Without resource containers, the throughput of the static requests decreases dramatically as the number of CGI requests increases. But resource containers can be used to create a "resource sandbox" around the CGI connections, allowing the static throughput to remain constant.

Banga concluded by emphasizing that resource containers are purely a mechanism and are general-purpose in nature. A lot of recent scheduling work can be used in conjunction with resource containers.

In questions following the talk, Eric Eide of the University of Utah sought to confirm that resource containers do not provide an authorization scheme. Banga said that this is indeed the case; resource containers are orthogonal to protection. Eide then asked if, when issuing a read from a file, you need to build a new container or can just use the default container with which you're associated. Banga said that either approach could be used. Timothy Roscoe of Sprint, addressing a point also raised by Mike Jones of Microsoft Research, pointed out that schemes like this one traditionally encounter problems when a server (such as an X server) and its clients are on the same machine and thus share the same resources. He asked how resource containers would be used in such cases. Banga replied that resource containers just provide a mechanism, and application-specific policies need to be built on top of them. Michael Scott of the University of Rochester said that he was puzzled by the criteria used to decide what goes into a resource container; things seem to be grouped together that are not logically coherent. He wondered if resource containers could be used for logically coherent things and then given different amount of resources using something like lottery scheduling. Banga agreed that this could be done.

Defending Against Denial of Service Attacks in Scout

Oliver Spatscheck, University of Arizona; Larry L. Peterson, Princeton University

Oliver Spatscheck started by asking why denial-of-service (DoS) attacks are a concern. The short answer is the Internet, and the situation is getting worse since routers now allow third-party code to run. Spatscheck outlined a three-step process to defend against DoS attacks: (1) account for all resource usage; (2) detect violations of your security policy; and (3) revoke resources when violations occur.

Spatscheck presented the Scout system, which allows this three-step process to be implemented in the context of special devices that attach to the Internet, such as routers, firewalls, WWW servers, and other network appliances. Since Scout was designed with such devices in mind, it addresses the need to support soft realtime, but does not provide complete, general-purpose OS support.

Scout builds a network appliance by combining modules, such as a disk module, a filesystem module, HTTP, TCP and IP modules, and a network device-driver module. Scout also introduces the concept of a path, to which distributed resources are bound. Paths contain global state pertaining to a particular I/O data flow, and each of them has its own thread pool. Paths provide automatic mapping of shared memory, as well as automatic thread migration along the modules contained in the path.

Spatscheck also described Escort, the security architecture for Scout. Escort allows the modules that have been configured into a system to be isolated into separate protection domains. A configurable resource monitor is responsible for resource protection and accounting. All resource usage is charged to an owner, which can be either a path or a protection domain. A DoS attack is detected by a violation of the configurable security policy. The resource monitor can deal with it in one of three ways, depending on how the policy has been configured: (1) suspend data delivery; (2) deny resources; (3) destroy the owner.

The authors implemented a Web server in Scout with two test configurations—one protection domain per module, and all modules in the same protection domain. Accounting overhead is about 8% for Scout with a single protection domain, and a factor of four for six protection domains. All interrupts and almost all cycles were correctly accounted for. With a SYN flood attack of 1000 syns/second, regular clients slowed down by only 5-15%.

At the conclusion of the presentation, one audience member asked if Spatscheck had any ideas or mechanisms for defending against domain-directed attacks, as opposed to path-directed ones. Spatscheck said that one method would be to have multiple instantiations of modules. For example, you could have two different IP networks; if one is corrupted, you can replace that protection domain with another one.

David Black from EMC asked how their system can distinguish between good and bad packets in situations in which IP sources are forged. Spatscheck provided three ways of addressing this: use firewalls to block forged IPs; authenticate IP addresses; let it go and see if it violates the policies in place. The last approach might have a larger performance impact than the other two, but you can still revoke all of the resources after detection.

Greg Minshall from Siara Systems asked how their I/O buffers compared with IO-Lite. Spatscheck explained that one writer creates an I/O buffer and, once a reader locks it, the writer loses its privileges. The advantage over IO-Lite is that Scout's path abstraction tells you the protection domains into which a buffer should be mapped.

Self-Paging in the Nemesis Operating System

Steven M. Hand, University of Cambridge

Steven Hand began by explaining that the goal of his work is to support simultaneously both continuous media/soft realtime applications and more traditional, "standard" applications. He noted that conventional operating systems offer poor support for several reasons. First, they schedule the CPU using priority schemes which tell you who should get the processor, but not when or how much. Second, contention for other resources is arbitrarily arbitrated. Third, there can be "QoS crosstalk" as a result of the kernel's performing a significant amount of work on behalf of applications. In particular, applications that repeatedly cause memory faults will degrade overall system performance.

Hand's work, requires every application to deal with all of its own memory faults using its own concrete resources. "Self-paging" involves three principles: (1) control—resource access is multiplexed and resources are guaranteed over medium-term time scales; (2) power high-level abstractions are not imposed on the underlying resources, giving applications greater flexibility; and (3) responsibility applications must carry out their own virtual-memory operations.

More specifically, self-paging requires that the system grant/allocate physical frames explicitly, dispatch all memory faults to the faulting application, allow applications to map/unmap their own pages, and provide low-latency protected access to the backing store.

The fourth requirement is fulfilled by the user-safe backing store (USBS). The USBS is composed of the swap filesystem, which is responsible for admitting an application into schedule and allocating it some region of the disk for swap space, and the user-safe disk, which schedules I/O requests according to their associated QoS.

Hand mentioned that one lesson he learned in doing this work is that exposing low-level details works better on RISC architectures. The results on an x86 are very poor, in part because of the higher kernel/user boundary crossing overhead. In addition, he mentioned that while granting each application its own disk time slice allows applications to optimize their own access patterns, large seeks may be required every time you switch between applications. It's possible to do better globally if you allow applications to interleave. It's an open research question as to what the tradeoffs are between global performance and local optimizations, and how to get the best of both worlds.

Miche Baker-Harvey from Equator Technologies asked about how truly shared resources (e.g., a page fault on a shared library) are handled. Hand said that applications involved in sharing might need to deal with a third party. For shared libraries, a system service charges everyone a fair share. I/O can also involve a lot of sharing; when DMA is used, you need to pin down the memory. Jay Lepreau of the University of Utah asked if you could have several protection domains cooperating on a task and sharing the paging responsibilities among them. Hand said that they could and pointed out that Nemesis's activation domains are orthogonal to protection domains, and they can be assigned to multiple accountable entities.

Eric Cota-Robles of Intel asked about the level of granularity used for reservations in the system, as well as how reservations for different resources interact. Hand pointed out that in choosing the level of granularity, there is a tradeoff between quality and overhead. He chose 250ms because that works as well as 100ms and 500ms; for anything smaller than 100ms, the overhead of context switches becomes unacceptable. As far as different resources are concerned, he mentioned that Nemesis uses EDF for both disk and CPU. Allocating resources independently and making sure they interact well is tricky; it is something they are still working on.

Finally, Bruce Lindsay from IBM pointed out that we've talked about external pagers for almost 10 years (about 45 Web years). He asked if, with all this talk, there had been any commercial utilization of external pagers. Hand gave as an example any system based on Mach.

Sources and documents for this system are available at <https://www.cl.cam.ac.uk/Research/SRG/netos/nemesis>.

Panel: Virtual Machine-based Operating Systems

Summary by Zheng Wang

Panelists: Ken Arnold, Sun Microsystems Inc.; Thorsten Von Eicken, Cornell University; Wilson Hsieh, University of Utah; Rob Pike, Bell Labs; Patrick Tullmann, University of Utah

Moderator Paul Leach introduced the topic of the panel discussion: what's new and what's not in virtual-machine-based operating systems. Today, when someone talks about a virtual machine (VM), it is usually based on the Java language. In days of yore, however, the language could be Smalltalk, Lisp, or Pascal, among others. Is Java just another programming language, or has it introduced something new to OS research?

Ken Arnold defined a virtual machine as a system with a uniform instruction set, uniform system calls, uniform libraries, uniform semantics, and a uniform platform. He then presented his view of a computer as a system hooked to the network instead of a system based on local disks. With network connections, the Internet can become "my computer." Compared to "your computer," "my computer" is bigger; it grows geometrically and gets better exponentially. Meanwhile, "my computer" breaks more often (when the remote resource fails) and gets fixed more easily (when a similar resource is found elsewhere on the network). Also, "my computer" can be "our computer." For this situation, VM-based OSes are the only solution. To take advantage of "my computer," each node should provide the code for using its service, that is, the ways to talk to this node. This is not particular to Java, but a Java VM is an example of a homogeneous system over a network. Finally, Arnold claimed that "everything else is wasted research." He qualified the statement by saying that "wasted" in this case does not mean "useless." His point was that there are only a small group of people working on VMs compared to the number of questions to be answered, and that is a wasted opportunity.

Thorsten Von Eicken started by comparing the traditional, virtual-memory-based OSes with new, virtual-machine-based OSes. He noted that the concepts of page-level memory protection, user/kernel privilege, and hardware primitives in virtual-memory-based OSes are comparable with the concepts of object-level protection, module/class privilege, and type-system primitives in virtual-machine-based OSes. "What's new" in virtual-machine-based OSes includes the Java language, real protection (against malicious code), resource management (e.g., enforcing limit, revocation, termination), and safe-language research. Von Eicken said what's interesting here is the balance between sharing and resource management. This introduces a lot of trade-off possibilities among program compile time, link time, load time, and run time. Pitfalls ("showstoppers") include Java's speed, garbage collection, debugging, and the need to design around the Java class loader. In summary, Von Eicken stated that virtual-machine-based OSes complement (but don't replace) virtual-memory-based OSes and provide a new playground for revisiting old OS issues.

Wilson Hsieh started by drawing a comparison between virtual machines and operating systems. Since both VMs and OSes are layers above the hardware on which the applications run, Hsieh concluded that an OS is already a VM. So what are the distinctions between VMs and OSes? According to Hsieh, one issue is multiple-language support. OSes usually support many languages, while VMs typically don't. Another is whether to do certain tasks, such as protection and resource management, in software or hardware. OSes generally provide a way for users to talk about resources. This is exemplified by the functionality of C language. VMs and their associated languages typically hide resource management from users. VMs can either let the underlying OS do the work or handle it themselves. In the latter case, could the VM be smarter than most of today's OSes?

During Hsieh's talk, questions were raised about the performance of VMs, especially that of the existing Java VM, compared to native OSes. Ken Arnold argued that the performance of the Java VM is sufficient, considering that most people have fast computers. He pointed out that Java bytecode can now run up to twice as fast as native code. Some audience members suggested that Java's speed problems come from memory-hierarchy performance and application load time.

Rob Pike of Bell Labs had a fairly different point of view. Pike was a leader on the Inferno project, in which a virtual machine was integrated with the operating system underneath. Pike listed several advantages of the VM approach: It provides real portability; programs can be compact (the VM itself is large, however); it allows a single address space with the VM providing protection; integrating the VM and OS eliminates the overhead of the VM entering the kernel; and the OS becomes a runtime library for the VM and provides resource management. However, Pike's conclusion after the project was that we should not merge OSes and VMs. He cited a number of reasons: there is no possibility of compartmentalization since the memory, resources, and execution of the VM and OS are intermingled in horrific ways; debugging is a nightmare; and the storage models of the two are incompatible, as are their process models (scheduling schemes will be either separated and hard or mingled and messy).

The last panelist, Patrick Tullmann, observed that a VM runtime is comparable to that of an OS. So the question is, why do we bother with VMs, and if we structure VMs as OSes, where is the win over hardware-based OSes? Tullmann gave two answers. One is fine-grained sharing, where computers share not only data but also code. The second is optimal resource accounting. One example is a malloc-less server/kernel, where the server resources are allocated by the client and passed down to the server.

During the Q&A session, David Black of EMC Corporation commented that the panelists' talks sounded like a "solution seeking a problem." He asked the panelists to name the real problem. Tullmann's answer was fine-grained sharing (and the fact that graduate students need something to work on). Arnold answered with "plug and play." He used a real-life example of having to look for the Windows CD-ROM in order to install a new printer, suggesting that the printer should be able to handle itself. Pike said the problem is supposed to be portability. Von Eicken pinpointed security problems from running untrusted code, but Pike disagreed, questioning the necessity of downloading methods through a VM. Hsieh said the problem lies in structuring software.

Session: Kernels

Summary by Rasit Eskicioglu

Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System

Ben Gamsa, University of Toronto; Orran Krieger, IBM T.J. Watson Research Center; Jonathan Appavoo and Michael Stumm, University of Toronto

Orran Krieger began by pointing out that the performance of a simple multithreaded counter operation (increment/decrement) on today's modern shared-memory multiprocessors is orders of magnitude worse than on older systems, and argued that OSes for modern systems should be designed in a fundamentally different way. The main goal of Tornado is to maximize locality and thus reduce this performance problem.

Tornado uses an object-oriented design; every physical and virtual resource in the system is represented by an object. These objects encapsulate all the data structures and locks necessary to manage the resources. This approach has two key benefits. First, encapsulation eliminates any sharing in the underlying operating system. Second, an object-oriented approach allows multiple implementations of objects, enabling the system to choose the best implementation for a given situation at runtime. Additional innovations in Tornado that help maximize locality and concurrency include clustered objects, protected procedure calls, and semi-automatic garbage
collection.

Krieger next described the experimental platform. The reported results are based on a 16-processor NUMAchine prototype currently being developed at the University of Toronto, as well as the SimOS simulator from Stanford. Further experiments were performed on some modern commercial multiprocessor systems. Tornado demonstrated large performance gains on the microbenchmarks. Future research will address scalability and running real applications. The core technology has already been incorporated into the Kitchawan OS from IBM Research, which runs on PowerPCs and x86 processors.

Finally, Krieger summarized some of the lessons they had learned. If used carefully, an object-oriented strategy is good, and the cost of this approach is more than compensated for by the locality achieved. Also, indirection with translation tables is a useful tool. Furthermore, an object-oriented strategy, in conjunction with clustered objects, allows fine-tuning of the system using incremental optimization.

Marvin Theimer from Microsoft Research asked Krieger to give a sense of the percentage of the Tornado kernel that they found impossible to parallelize in the manner described in the talk. Krieger replied that they hadn't encountered anything that couldn't be parallelized. He pointed out that there was an important tradeoff here. In Tornado, they get a big win for policies that need only local information to make decisions, but there may be areas where they will lose because they can't make global policy decisions.

Interface and Execution Models in the Fluke Kernel

Bryan Ford, Mike Hibler, Jay Lepreau, Ronald McGrath, and Patrick Tullmann, University of Utah

Jay Lepreau discussed the implementation and API of Fluke, a microkernel-based operating system motivated by nested virtual machines. A process is able to implement some OS services for its children and have the rest taken care of by whoever provides those services to the parent.

Lepreau described two models of execution, the "process model" and the "interrupt model." In the process model of execution, each thread of control has a kernel stack, and a blocking thread's state is implicitly stored on its stack. Most monolithic kernels, such as Linux, UNIX, and Windows NT, fall into this category. On the other hand, in the interrupt model of execution, there is only one kernel stack per processor, and the required thread state is explicitly stored in the thread control block (TCB) when the thread blocks. This category includes systems such as V, QNX, and the Exokernel implementations.

All conventional kernel APIs belong to the process model. They support long-running operations and maximize work per kernel call. In this model, thread states are inexact or unobtainable. In the interrupt model (a.k.a. "atomic" APIs), thread states are always well defined and visible, and per-thread kernel state is minimized.

The Fluke kernel exports an atomic API while also supporting long-running operations. It can support both the process and interrupt execution models through a build-time configuration option. The basic properties of the Fluke API include promptness, correctness, and completeness, as well as interruptible and restartable kernel calls. These properties greatly facilitate services such as user-level checkpointing or process migration. They also simplify permission revocation and facilitate application development. Unfortunately, the atomic API has some disadvantages: extra effort is needed to design it, intermediate system calls are needed, and there is extra overhead from restarting system calls.

Concerning performance, Lepreau addressed preemption latency, rollback overhead, and speed. As expected, a fully preemptive kernel (only possible in the process model) always allows much smaller and predictable latencies. Non-preemptive kernels for both models exhibit highly variable latencies causing a large number of missed events. On the other hand, even with only a single preemption point, the preemptible interrupt model fares well on the benchmarks he discussed. Similarly, the rollback cost during a page fault is very reasonable compared to the already high cost of page-fault handling. As an unoptimized, experimental kernel, Fluke does not show any major slowdowns for any of the five execution model/preemptibility combinations.

Lepreau concluded that an atomic API is easy to implement and that OS folks can do as well as "those hardware guys" who provide fully interruptible "long" instructions such as block move.

After the talk, Margo Seltzer of Harvard said that she was reminded of the Lauer and Needham paper that discussed the equivalence of message-passing and shared-memory OS architectures, so she was expecting to see similar conclusions about the interrupt and process execution models. Lepreau replied that if the API and implementation are done right, there is an equivalence between these two models. However, the two models can have performance differences, as he and his colleagues discovered.

Jon Shapiro of IBM observed that both EROS and KeyKOS have an atomic API and that 25% of the restart cost for those systems was the user-to-supervisor reentry. He was curious to know where Fluke restarted (i.e., in user mode or kernel mode). Lepreau said that Fluke restarts in user mode, just on the other side of the system-call boundary.

The Fluke sources are available at <https://www.cs.utah.edu/flux/>.

Fine-Grained Dynamic Instrumentation of Commodity Operating System Kernels

Ariel Tamches and Barton P. Miller, University of Wisconsin

Ariel Tamches described his research on runtime kernel instrumentation. His vision for future operating systems is that they should provide a dynamic, unified infrastructure that allows fine-grained runtime code instrumentation for purposes of performance measurement, tracing, testing and debugging, optimization, and extensibility. Such an infrastructure would provide measurement primitives, such as simple counters, cycle timers, and cache-miss counters, that could be used to instrument the kernel as it runs. Similarly, it would allow functions on certain code paths to be analyzed using predicates. Using these measurements and runtime code-insertion technology, the kernel would be able to be optimized dynamically.

KernInst is a fine-grained, dynamic kernel-instrumentation tool that allows users to insert runtime-generated code into an unmodified Solaris kernel. Using code splicing, the machine-code instruction at an instrumentation point is overwritten with a jump to a code patch that consists of the runtime-generated code, the overwritten instruction, and finally a jump back to the following instruction.

Code splicing has some inherent problems. For example, jumping to a patch using two instructions cannot be done safely. A context switch at the wrong time could lead to the execution of the original first instruction followed by the new second instruction. However, the address range that is reachable within a single instruction on a SPARC is limited to ±8 megabytes. This problem is solved by introducing an intermediate branch location, called a springboard. When the patch code is far away, the instrumented instruction is replaced with a branch to a springboard that contains a long jump with as many instructions as needed to reach the code patch. In general, any scratch space located close to the splice point in the kernel is suitable for a springboard.

Tamches next described a simple kernel-measurement tool that he built on top of kerninstd as a proof of concept. This simple tool counts the number of calls to any kernel function as well as the number of kernel threads executing within a kernel function. This tool was used to analyze the performance of the Squid v1.1.22 proxy server. Since it seemed that the L2 cache functionality of the disk was not performing well, the performance of the I/O routines in the kernel were analyzed. Analysis revealed that the open function, which was called 20-25 times/sec took 40% of time and was the real bottleneck. Within open, the name-lookup and file-create functions were the two sub-bottlenecks. Squid creates one file per cached HTTP object in a fixed hierarchy of cache files. It also reuses stale files to eliminate file-deletion overhead. However, before overwriting the files, Squid was truncating them first, which caused UFS to synchronously change the metadata. Two simple modifications to Squid eliminated these bottlenecks—the size of the directory-name lookup cache in the Solaris kernel was increased, and the Squid code was modified to truncate the file only when needed.

Steve Pate from Veritas Software asked what the runtime overhead of the instrumentation extensions was. Tamches replied that it was the (additional) overhead of adding two extra branches and one cache miss to your code. Bruce Lindsay from IBM Research asked how an optimized version of a routine would work once installed. Tamches replied that an optimized routine could be downloaded as just another code patch. The initial routine would then be instrumented at its entry point to check for the proper conditions, and to jump to the optimized version if the conditions are satisfied.

Marvin Theimer from Microsoft Research asked how complicated it was to write the patch code. Tamches observed that it could be difficult for complicated patches. So far they haven't done anything more complicated than the counters and timers presented in this work, although they do have some experience with doing other types of patches using Paradigm, a tool that performs the same types of operations on user-level applications.

Marianne Lent from Veritas software asked whether it was possible to unload instrumentation points after they were spliced into the kernel. Tamches replied that this could be done by restoring the instructions that were overwritten in installing the instrumentation points.

Session: Real-Time

Summary by David Sullivan

ETI Resource Distributor: Guaranteed Resource Allocation and Scheduling in Multimedia Systems

Miche Baker-Harvey, Equator Technologies, Inc.

Miche Baker-Harvey described the ETI Resource Distributor (ETI RD), a scheduler designed for use on multimedia processors, which allows you to emulate fixed-function hardware such as MPEG video encoders and decoders, audio devices, and modems. Since the system must maintain the illusion that real hardware is present, this scheduler must support what Baker-Harvey termed "firm" deadlines that are harder than conventional soft realtime guarantees.

Baker-Harvey characterized the types of applications that the ETI RD was designed to support. They: (1) are primarily periodic; (2) can shed load if the system becomes overloaded; and (3) have mainly discrete resource requirements that are known long in advance (e.g., processing an MPEG frame to a given resolution requires a known amount of CPU time). She argued that existing approaches for soft realtime scheduling are inadequate for this type of environment. Reservation-based schedulers provide firm guarantees but do not allow for graceful load shedding; constraints and best-effort schedulers handle overload but do not provide strong guarantees.

She next outlined the three components of the ETI RD. First, a Resource Manager performs admission control and grant control, deciding whether a task can be given scheduling guarantees and what percentage of each resource will be given to each task. An application requests admission by giving the Resource Manager a resource list consisting of the resource requirements for each level of quality of service (QoS) that it can provide. Each entry includes a function that can be called to provide that level of QoS. The Resource Manager admits a task if the sum of its minimal requirements and the minimal requirements of all currently admitted tasks can be simultaneously accommodated.

The Resource Manager also computes a grant set that gives applications the largest possible resource share that the system can accommodate. This grant set is passed to the second component of the system, the Scheduler, which uses an EDF algorithm. In overload, the Resource Manager calls the third component of the system, the Policy Box, which passes back a policy that determines which applications should shed load and by how much. Together, these three components guarantee that an admitted task will be scheduled every period until it exits, and that its allocations will be from among the ones defined by the task.

Baker-Harvey went on to explain some of the finer details of the system. Finally, she addressed performance. Context switches take a reasonable amount of time and are only taken when necessary. Admissions and grant control are done in the context of the task that needs the computations to occur, so that other tasks' guarantees are not impacted. She also described an experiment in which a set of threads with various possible levels of QoS gradually request admission to the system. As each additional thread is added, the system becomes more and more overloaded, and the grant set is adjusted to give each thread a smaller grant.

A Feedback-driven Proportion Allocator for Real-Rate Scheduling

David C. Steere, Ashvin Goel, Joshua Gruenberg, Dylan McNamee, Calton Pu, and Jonathan Walpole, Oregon Graduate Institute

Ashvin Goel presented a scheduler designed for "real-rate" applications like software modems, Web servers, and speech recognition tasks, whose throughput requirements are driven by real-world demands. Current priority-based schedulers are inflexible and ill suited to fine-grained allocation, whereas reservation-based schedulers require the correct specification of the proportions needed by each application and fail to provide adequate dynamic responsiveness. The system that Goel discussed addresses these problems by using a feedback-based scheme to dynamically estimate the proportion and period needed by a given job, based on observations of its progress.

Goel described how their system is able to estimate application progress through what he termed "symbiotic interfaces," which link application semantics to system metrics. For example, a queue shared by a producer and a consumer could use a symbiotic interface that exposes the queue's size and fill level and the role
of each thread. The kernel can then monitor the queue's fill level and adjust the allocations given to the producer and the consumer as needed.

Goel then explained the role of the feedback controller, which first computes a pressure for each real-rate thread based on its progress metrics. The pressure is fed to a proportional-integral-derivative (PID) control which determines the allocation given to the thread. Realtime threads with known reservations can specify allocation and period directly, and miscellaneous jobs are treated as if they had a constant positive pressure. When the allocations determined by the controller lead to overload, it "squishes" the allocations of real-rate and miscellaneous jobs using a weighted fair share approach where the weighting factor is an importance associated with each thread. Realtime jobs with specified reservations and real-rate jobs that are driven externally are given a "quality exception" so that their resource reservations can be renegotiated.

In the performance section of his talk, Goel discussed two experiments that tested the controller's responsiveness by having a producer (with a fixed reservation) oscillate between two rates of production, one double the other. The controller succeeded in adjusting the consumer's allocation so that its rate of progress closely matched that of the producer. Goel also mentioned tests that show that the controller overhead is linear in the number of threads, but with a small slope. He concluded with a brief discussion of related work and future directions.

Jose Brustoloni of Lucent/Bell Labs asked about the sensitivity of the system to
the choice of parameters. Goel replied that a dispatch interval of 1 ms and a controller period of 10 ms seem to work well, and that more work needs to be done on setting the PID parameters and on determining if one unique set of parameters works for all applications.

Gopalakrishnan of AT&T Labs raised
the possibility of application-specified progress metrics being used in denial-of-service attacks by tasks that fraudulently claim to have made no progress at all. Goel agreed, but pointed out that quality exceptions provide a mechanism for applications to figure out that such attacks are occurring, and policies can manipulate job importance to insulate a thread from an attack.

Carl Waldspurger of Compaq Systems Research Center asked how the user-specified notion of importance interacts with the progress metric in determining allocations. Goel responded that their current system provides a set of dials that users can adjust to dynamically control the importance of various tasks, and that other approaches (such as economic models) could also be used.

A Comparison of Windows Driver Model Latency Performance on Windows NT and Windows 98

Erik Cota-Robles and James P. Held, Intel Architecture Labs

Erik Cota-Robles began with the motivation for this work: Realtime and multimedia applications are becoming increasingly common on personal computers, and general-purpose operating systems are thus being asked to provide temporal correctness, which depends on the time-complexity of the calculation, hardware and OS latency, and concurrent resource utilization. Throughput metrics are insufficient for such applications. He defined the notion of an application's latency-tolerance, which depends on the amount of buffering that it does (since before a deadline is missed, all buffered data must be consumed) and is orthogonal to how processor-intensive the application is. He also pointed out that on Windows, in addition to interrupt and thread latency, there is an additional latency from deferred procedure calls (DPCs), which are used by interrupt service routines (ISRs) for compute-intensive computations.

Cota-Robles went on to describe the design goals used in developing their microbenchmarks to measure latency. They wanted to achieve near-zero measurement overhead to avoid embedding the benchmarks in loops, and they wanted to cover a variety of behaviors with a few tests. The Pentium timestamp register allowed them to achieve a single-instruction measurement cost. They used the programmable interval timer as the source of hardware interrupts, and they measured the latency to the software ISR, to the DPC, and to the kernel-mode thread. On NT, the first of these measurements cannot be made, since the timer ISR cannot be modified. While their methodology was developed for Windows, Cota-Robles mentioned that it could also be applied to UNIX.

Next, Cota-Robles covered the measurement methodology used in this work. Since latency measurements are uninteresting on a quiescent system, they used a spectrum of application stress loads. For repeatability, they used the Winstone97 benchmarks, a number of 3-D games, and a set of Web-browsing benchmarks that included viewing files of various sizes as well as audio/video playback using RealPlayer.

The resulting distributions had very large tails. As a result, Cota-Robles and Held focused on the median, and they also characterized the thickness of the tail in terms of hourly, daily, and weekly worst-case values.

Cota-Robles pointed out that for Windows NT there is almost no distinction between DPC latencies and thread latencies for threads at high realtime priority. On Windows 98, on the other hand, there is an order of magnitude reduction in the worst-case latencies that a driver obtains by using DPCs as opposed to realtime high-priority kernel-mode threads.

Cota-Robles presented some additional data on Windows 98 thread latency. Finally, he briefly looked at an example involving a soft modem to illustrate how the latency numbers gathered by their tools can be used to reason about quality of service even before an application is available.

Victor Yodaiken of New Mexico Tech commented that it is not always the case that OS latency swamps the hardware effects; on realtime Linux, the effect of going to the timer over the ISA bridge is the dominant effect. He also asked why external timing wasn't used in this work. Cota-Robles said that the latency tolerance of the applications they were interested in was much greater than the timer tick rate; they didn't care about anything under a millisecond.

Session: Distributed Systems

Summary by Xiaolan Zhang

Practical Byzantine-Fault Tolerance

Miguel Castro and Barbara Liskov, M.I.T.

Byzantine-fault-tolerant systems are hacker-tolerant—they can continue to provide correct service even when some of their components are controlled by an attacker. Hacker-tolerance is important, because industry and the government increasingly rely on online information systems, and current systems are extremely vulnerable to malicious attacks.

Research on Byzantine-fault tolerance is not new, but most of it has demonstrated only theoretical feasibility and cannot be used in practice. The few techniques that have been developed for practical application make unrealistic assumptions
such as synchrony and are too slow to be useful.

Miguel Castro presented a Byzantine-fault-tolerant replication algorithm that doesn't rely on synchrony assumptions, performs faster than previous implementations, and is resistant to denial-of-service attacks. He also discussed a replication library based on this algorithm that he and Barbara Liskov used to implement BFS, a Byzantine-fault-tolerant NFS service. Using the Andrew benchmark, they showed that BFS is only 3% slower than the standard NFS implementation on Digital UNIX.

Castro pointed out that the algorithm can be used to implement any deterministic replicated service. It provides two main properties. First, it ensures that the replicated system behaves like a correct, centralized implementation that executes operations atomically one at a time—a strong safety property called linearizability. Second, it provides liveness, which ensures that the service remains available despite faults. The algorithm relies on several assumptions, including the existence of at least 3f+1 replicas to tolerate f Byzantine faults, and the presence of strong cryptography.

The algorithm is a form of state-machine replication. The authors use a primary-backup mechanism to maintain a total order of the requests to the service. Replicas move through a succession of configurations called views. In each view, one replica is the primary and the others are backups. The primary assigns a sequence number to every request, and the backups check on the primary and ensure that it is behaving correctly. When the primary misbehaves, the backups trigger a view change to select a different primary.

Castro also discussed some optimizations that they implemented. To reduce the cost of large replies, only one client-designated replica sends the full result; the others just send a digest of the result. In addition, replicas can tentatively execute a request as soon as the request prepares, which improves latency. For read-only requests, clients multicast requests to all replicas and wait for 2f + 1 replies with the same result, retransmitting as necessary. Finally, the performance of message authentication was improved using message-authentication codes for all messages except view-change and new-view messages, which still use slower digital signatures.

The authors are working on a number of extensions, including how to recover faulty replicas and support for fault-
tolerant privacy.

Marvin Theimer from Microsoft Research said that they were comparing something that writes synchronously to disk with something that doesn't. He asked what the overhead would be for a replicated service that didn't need the disk. Castro said that the paper also presents a comparison with an unreplicated system that did not write to disk, and that the overhead was 26%. Theimer then asked if turning the power off on all replicas risked corrupting their filesystem. Castro concurred, and Barbara Liskov suggested that using a UPS could prevent this. Peter Chen from the University of Michigan asked if Castro could give an example of the class of faults their system is targeting. Castro said any class of attacks, as well as nondeterministic software errors. Burton Rosenberg from Citrix asked if, when the public key is replaced by the MAC secret key, the secret keys have to remain secret even though the faulty nodes know them all and can broadcast them. Castro replied that there is a secret session key between every active client/replica pair.

The Coign Automatic Distributed Partitioning System

Galen C. Hunt, Microsoft Research; Michael L. Scott, University of Rochester

Galen classified distributed systems into what he called the 5PM of distributed systems: people, protection, peripherals, persistence, processes, and memory. If an application needs any two of these and they're located on different computers, then it needs distributed software. A fundamental problem of distributed computing is to decompose the distributed software into pieces and to decide where to place them. Smart programmers can do it statically. But static partitioning is expensive; the optimal partition can be user- or data-dependent, and it changes with the network topology.

Coign is an automatic distribution-partitioning system for applications built out of COM components. The authors use scenario-based profiling to quantify the communication between the components and the application, and analyze that information to partition and distribute the application with the goal of minimizing the distributed communication. All this is achieved without access to source code.

Hunt explained that the process takes four steps: (1) take the binary of the application and find the objects inside of it; (2) use scenario-based profiling to identify the interfaces between those objects; (3) quantify the communication across the interfaces; and (4) build a graph and cut the graph to produce an optimal distribution.

Hunt then showed a live demo of Coign using PhotoDraw (an image composition application that will ship with Office 2000). First, he instrumented the binary to intercept every COM call. Next, he ran the application on a training dataset. There was a 45% overhead for the instrumented version of the software. In the worst case, it could be as high as 85%. Next he took the profiling information, combined it with the network statistics, and created a graph for the program. The graph was a giant circle, with each dot on the circumference representing a COM object. Lines connecting the dots were COM interfaces. The idea was to put objects that communicate heavily on the same machine. The graph algorithm computed a distribution model. Finally, a distributed version of PhotoDraw was created. Galen pointed out that when the distributed version runs on two machines, it's 25% faster than the original version. Since the source code of PhotoDraw is 1.8 million lines, it would be hard to analyze manually.

Galen also looked at another application, MSDN Corporate Benefit. He was able to reduce the communication time by 35%. Ironically, this application was originally designed as a model of good distributed programming techniques.

One key issue in Coign is identifying similar transient objects across multiple executions. When the application creates an object, it needs to figure out which object it corresponds to from the profiled scenarios, so that it can decide where it should be located. This has to be done dynamically. The key insight is that similar objects across executions have similar instantiation histories. The algorithm walks the stack and creates an instantiation signature consisting of callers and objects. The goal of the classifier is to identify as many unique objects as possible, since the more unique objects it can identify, the more distribution choices it has.

Galen pointed out that the idea of automatic distributed partitioning is not new. The contribution of Coign is its application of automatic partitioning techniques to dynamic, user-driven applications. Several open questions remain, including how to handle distributed error handling, how to merge vertical load balancing across an application with horizontal load balancing, and how to achieve live repartitioning.

Christopher Small from Bell Labs asked if they had thought about implementing different metrics for deciding how to do partitioning, since, for example, minimizing network communication isn't the same as minimizing overall latency, because machines run at different speeds. Hunt replied that it would be simple to add processing speed to the Min-Cut/Max-Flow graph-cut algorithm that he used, but that including memory consumption (and anything that cannot be converted directly to time) is an open research question.

Session: Virtual Memory

Summary by Rasit Eskicioglu

Tapeworm: High-Level Abstractions of Shared Accesses

Peter Keleher, University of Maryland

Distributed shared-memory (DSM) protocols support the abstraction of shared memory to parallel applications running on networks of workstations. The DSM abstraction facilitates programming, allowing users to avoid worrying about data movement, and it allows applications to become portable across a broad range of platforms. Unfortunately, the DSM abstraction does not allow applications to improve performance by directing data movement. The proposed extensions to overcome this problem are usually protocol-specific and not portable across multiple platforms.

To address this problem, Peter Keleher developed the "tape" mechanism. Tapes are objects that encapsulate accesses to shared data. They allow applications to hide or eliminate data-access latency and to get all the benefits of data aggregation by moving data in advance of demand. The tapes can be used in various ways once they are created ("recorded"). For example, the tapes can be added to messages and applied ("played back") when they are received. Tapeworm is a library of such tape operations.

Keleher explained that a tape consists of a set of events, each of which is described by an ID number identifying an interval of a process's execution and the set of page IDs that were accessed during that interval. Several operations can be applied to a tape, including adding and subtracting pages and "flattening" it into an extent of pages mentioned by the tape.

Keleher next described how the tape mechanism interacts with the underlying DSM system. Hooks into the underlying consistency protocol allow tapes to be generated by capturing shared accesses, while hooks into the message subsystem are used for such things as adding tapes to messages or extracting them at the other end. These hooks are used to construct the Tapeworm library, which provides three types of synchronization mechanisms: (1) update locks, (2) record-replay barriers, and (3) producer-consumer regions.

Keleher indicated that the emphasis of Tapeworm was on being able to create high-level descriptions, rather than on actual performance gains. Nevertheless, Keleher reported performance improvements enabled by the tape abstraction. For the six applications in the test suite that he used, speedup improvements averaged close to 30%. On average, the number of messages sent was reduced by 50%. Furthermore, the average of remote-miss improvements was 80%. Keleher concluded that the tape mechanism is cheap, effective, and easy to use, especially as part of a run-time library. Future work includes encapsulating consistency semantics into objects.

Margo Seltzer of Harvard University mentioned that this work reminded her of the Eraser system presented at the last SOSP, and she asked if the tape abstraction could be used as a debugging aid for multithreaded programs. Keleher agreed with the suggestion and indicated that one of his students used the underlying data-movement mechanism to detect data races in distributed applications.

MultiView and Millipage: Fine-Grain Sharing in Page-Based DSMs

Ayal Itzkovitz and Assaf Schuster, Technion-Israel Institute of Technology

Ayal Itzkovitz briefly talked about the first software DSM system, IVY, and identified two major performance limitations: page size (false sharing) and protocol overhead (high number of messages). Over the years, there have been many attempts to overcome the false-sharing problem. One common approach is to relax the consistency requirements. This approach lessens the problem but does not totally eliminate it. It changes the semantics of memory behavior, introduces memory-access and processing overhead, and is still a coarse-grain solution. A newer approach is to provide compiler support and do code instrumentation. This technique solves false sharing, but it isn't portable, it has substantial run-time overhead, and, in some cases, it requires additional hardware support.

Close examination of these problems identified two improvement possibilities: (1) reducing the sharing unit (i.e., page size), and (2) using ultra-fast networking and thus minimizing protocol overheads. Itzkovitz described a new technique called MultiView for implementing small-sized pages (mini pages). Multiview is highly efficient and does not require a special compiler. In addition, it offers application-tailored granularity and resolves false sharing with practically zero overhead. It is built as a "thin protocol layer" on top of the operating system. The basic idea is to map each shared variable to a different, nonoverlapping virtual-memory region, called a view. In reality, these variables may reside on the same (physical) page, but they are not necessarily shared. Furthermore, each variable may have different access privileges. MultiView lets the application manage each of these views independently as needed, with separate access policies.

Itzkovitz next described Millipage, a software DSM system based on the MultiView technique. Millipage is implemented as a user-level shared library on top of Windows NT 4.0. The programming model of Millipage is sequential consistency, which also allows application-tailored fine access control. It employs a static-home approach for each minipage. Fast Messages from the University of Illinois is the underlying communication mechanism on a Myrinet LAN. The malloc() system call is wrapped to allocate a new minipage each time it is called.

Basic costs in Millipage are quite low. For example, invoking a remote page takes less than 300 microseconds, including all software processing. A suite of five applications was used to test the performance of the Millipage DSM system on a testbed cluster of 8 PCs with Pentium II processors. All applications showed encouraging speedups, although there was a tradeoff between false sharing and aggregation for some applications.

Itzkovitz concluded that MultiView approach performs as well as any other relaxed model. In future work, using the compiler to do the minipage allocation and mapping, and improving access locality with the help of the operating system, will be investigated, as will revisiting relaxed consistency models and considering the applicability of MultiView approach to other services, such as global memory systems, garbage collection, and data-race detection.

After the talk, Michael Scott from the University of Rochester admitted that he was wrong when he claimed about ten years ago that false sharing was the problem for software DSM. He added that nobody wants to use software DSM for applications with 8MB datasets, but, rather, for applications with gigabyte datasets in which fine-grain sharing is impractical. If there is no significant false sharing, then aggregation becomes a big issue for performance. He suggested that the minipage approach is very promising for addressing pathological cases where only a small portion of the data requires fine-grain access.

Optimizing the Idle Task and Other MMU Tricks

Cort Dougan, Paul Mackerras, and Victor Yodaiken, New Mexico Institute of Technology

Dougan indicated that this project grew out of an effort to optimize the PowerPC port of the Linux operating system. The major constraint was to get good performance without breaking Linux compatibility. This implied that the efforts should be concentrated on the architecture-specific components of Linux. Memory management was the obvious starting point.

Dougan and his colleagues discovered that the OS was occupying one-third of the TLB entries on average. The PowerPC offers an alternative translation, called block address translation (BAT), from logical to physical that bypasses the TLB mechanism. The idea of using superpages to reduce TLB contention was not practical for a straightforward implementation using BAT, because there were only 8 BAT registers. Since user processes are generally ephemeral, only the kernel memory was mapped using the BAT mechanism. This approach reduced the percentage of TLB slots used by the kernel to nearly zero. Also, a 10% decrease in TLB misses and a 20% decrease in hashtable misses were observed during the benchmarks. The real measure of this improvement was a 20% decrease on a complete kernel compilation time.

Since the PowerPC uses a hashed page table (HTAB), the next optimization attempt was to improve the efficiency of this HTAB. A three-level table is used to back the HTAB, and it is searched on a HTAB miss. The next idea was to reduce hashtable hot spots by scattering the entries in the hashtable. A different virtual segment identifier (VSID) generation technique was used to reduce collisions.

Dougan next indicated that their original conjecture that TLB reload speed was not as important as reducing TLB misses was incorrect. This suggested yet another improvement opportunity. The HTAB miss-handling code was rewritten in assembly and executed with the MMU turned off. Also, the assembly code was optimized to reduce pipeline stalls. As part of this optimization, the HTAB is completely eliminated on the PowerPC 603 by performing searches only on Linux tables. These efforts yielded major performance improvements: 33% reduction in context switching and 15% reduction in communication latency as measured with lmbench. This optimization strongly reduced the effect of BAT mapping on both the 603 and 604.

The best optimization was achieved by tuning the idle task. Since the OS must provide cleared pages to users, this functionality is moved to idle task to be performed during its execution. Large performance gains were obtained when the cache was off during this operation. The full kernel compilation was done in 5.5 minutes. lmbench also showed 15% across-the-board latency improvements. Dougan concluded that memory management is extremely sensitive to small changes. Small operations in the kernel can destroy locality. Also, intuition is not reliable, microbenchmarks are needed, and repeatable sets of microbenchmarks like lmbench are invaluable. He added that OS designers need to learn from processor designers about quantitative measures. As the second part of his conclusion, Dougan said that some version of superpages can help, and that page-fault-handling details are critical for good performance. More work on cache preloading is needed, and greater use of dynamic code generation will be pursued.

One attendee from Bell Labs asked if one can generalize by saying that it is better to have a software-based TLB and cache management than to have hardware-managed versions. Dougan replied yes, because a software approach usually gives greater flexibility.

Session: Filesystems

Summary by Zheng Wang

Logical vs. Physical File System Backup

Norman L. Hutchinson, University of British Columbia; Stephen Manley, Mike Federwisch, Guy Harris, Dave Hitz, Steven Kleinman, and Sean O'Malley, Network Appliance, Inc.

Norman Hutchinson began by giving a brief motivation for this work. Since filesystems are getting bigger and disks are getting bigger still, ensuring the reliability of data stored on filesystems is becoming more and more difficult.

Hutchinson outlined two basic backup approaches they studied and compared: logical backup and physical backup. Logical backup is a file-oriented approach, such as that employed by the UNIX dump and tar commands. The advantage of this approach is portability, because the formats used are platform-independent. In physical backup, on the other hand, the data of one physical medium is replicated on another physical medium. Although the physical strategy is nonportable, it is very fast, because the medium is accessed sequentially. Also, the output is an exact copy of the original. UNIX's dd command and Plan 9's filesystem backup strategy are examples of this approach.

Before going into the details of their implementations of these two approaches, Hutchinson described Network Appliance's Write Anywhere File Layout (WAFL) filesystem. WAFL is a log-based filesystem that uses NVRAM to reduce latencies. WAFL stores metadata in files, the most important of which is the block-map file, a generalization of the free bitmap that indicates what blocks are being used for what purposes. WAFL also uses copy-on-write techniques to provide snapshots, read-only copies of the entire filesystem. A disk block referenced by a snapshot is never overwritten.

Hutchinson described their logical backup implementation as an evolution of the BSD dump utility. After finding the files to be backed up, the program does a sequential i-node traversal because the files are written to backup media in increasing i-node order. This restrictive requirement is augmented in their implementation by a custom pre-fetch policy at the kernel level. Their logical restore first creates an in-core virtual directory tree and then uses this tree to resolve any name lookup that needs to be done, thus avoiding the creation of unnecessary directories and increasing the speed of lookups. Their physical backup implementation first writes a snapshot of the filesystem and then simply walks through the blockmap file sequentially and writes all blocks that are in use (i.e., referenced by the snapshot). For speed purposes, the physical-backup process bypasses the filesystem. With physical restore, some cleanup is necessary.

The performance of these implementations was measured using an aged filesystem, basically a physical copy of the company's live filesystem, to reflect real-life situations. The first set of measurements was collected by doing a single-tape (both logical and physical) backup and restore of a 188GB filesystem. Logical backup took 7.4 hours, with a speed of roughly 7.8MB/sec, while logical restore requires 8 hours with a speed of 8.8MB/sec. Physical backup and restore has only a single stage of simply reading and writing the disk blocks, which takes 6.2 hours for backup and 5.9 hours for restore. Interestingly, logical backup and restore require significant amounts of CPU, whereas the physical operations have marginal CPU requirements.

In order to improve performance for extremely large filesystems, it is necessary to add more tapes to increase backup bandwidth. This is tricky for logical backup, because the format of a dump tape is fixed and cannot easily be spread across several tapes. Therefore, multiple dumps are performed in parallel. On the other hand, spreading the dumps across multiple tapes is easy for physical dump, since all the disk blocks are independent. Using a four-tape implementation, logical dump requires about 3 hours and logical restore requires 4 hours, while physical dump and restore operations each completed in a little under 2 hours.

Hutchinson concluded by pointing out that the physical strategy provides better performance, whereas the logical strategy is more flexible. Therefore, two remaining areas of research are making logical backup faster and making physical backup more flexible.

Rob Pike of Bell Labs asked if Hutchinson was familiar with the Plan 9 papers on filesystem-backup strategies, and he further asked why one couldn't restore a single file on a standalone file server. Hutchinson replied that he was familiar with the Plan 9 papers, and that the operation is tricky because the metadata kept on the backups is totally independent of the disks, and thus additional information is needed to interpret the metadata.

Masoud Sadrolashrafi of Veritas Software asked if snapshots are sufficient for restoring data in different situations. Hutchinson replied that snapshots are sufficiently self-contained for it to be possible to rebuild a filesystem from any snapshot.

The Design of a Multicast-based Distributed File System

Bjorn Gronwall, Assar Westerlund, and Stephen Pink, Swedish Institute of Computer Science and Luleå University of Technology

Bjorn Gronwall introduced JetFile, a distributed filesystem targeted at personal-computing environments and designed for ubiquitous file access over local- and wide-area networks with different physical characteristics. The system relies on a model in which clients are also the servers in the system. Taking a protocol-centric approach to distributed-filesystem design, JetFile's major goals are to hide the effects of delays induced by propagation, retransmission, and limited bandwidth, and to minimize and localize the traffic. In order to achieve these goals, JetFile uses optimistic algorithms to increase update availability of files and to hide update delays. It also uses replication and multicast to increase read availability of files. Using clients as servers also decreases update latencies and improves scalability. Other methods to reduce latencies, such as hoarding and prefetching, are planned for future work.

JetFile uses Scalable Reliable Multicast (SRM) layered on top of IP multicast. SRM's communication paradigm consists of two kinds of messages: request and repair. SRM is a receiver-oriented protocol, in which the receiver makes a multicast request when it needs data and those who are able to respond to the request (i.e., have the data requested) send a repair message containing the data. In order to eliminate the inflation of repair messages from many hosts, each host that is able to respond sets an internal timer and waits. If a repair message for the same data is received, the node cancels the timer. Otherwise, the host sends the repair message when the timer expires. The value of the timer is randomized with a bias toward the closer hosts.

Files in JetFile are named using tuples, such as organization, volume, file number, and file-version number. All tuples except the version number are hashed to map files to multicast channels. Therefore, a particular file always uses the same multicast channel. The basic JetFile protocol deals with data units, which can be either status objects (carrying file attributes) or data objects (carrying actual file contents). Correspondingly, SRM messages include status-request, status-repair, data-request, data-repair, plus version-request and version-repair for retrieving file-version numbers.

To retrieve file contents in JetFile, the receiver node multicasts an initial data-request message, and the source node responds with a multicast data-repair message. Since the receiver now knows the source of the data, the remaining data objects are transferred using the same protocol but with unicast request and repair messages. File updates use write-on-close semantics. Since the client acts as a server for the new file version, a lot of write-through is avoided. As long as a file is not shared, there is no communication over the network.

New file-version numbers are generated by a versioning server. If two different updates are made to the same file, the change with the higher version number will shadow the change with the lower version number. However, no change is lost. The system detects update conflicts and signals the user to invoke an application-specific resolver that retrieves conflicting versions and merges them to create a new version. When other nodes see a request for a new version number or its corresponding repair message, they can mark the corresponding file in their caches as stale. To deal with situations in which both messages are lost, JetFile maintains a Current Table that contains all the current file-version numbers for a particular volume. It has a limited lifetime, which limits the time a host may access stale data. When the network
doesn't drop many packets, file consistency can be as good as in AFS. However, in the worst case, it will only be as good as in NFS.

Gronwall presented performance-measurement numbers using the Andrew benchmark. For the hot cache case, the performance of JetFile over a LAN is similar to that of a local-filesystem UFS. For the cold cache case, Gronwall compared the performance of JetFile over a LAN to its performance over an E-WAN with round-trip time of 0.5 seconds. The time for the CopyAll operation increases dramatically for the E-WAN case, because CopyAll requires synchronous communication.

The first questioner asked about security, and what trust relationship can be expected from filesystem peers. Gronwall answered that no trust is assumed between the hosts. As in other systems, files can carry signatures or be encrypted. The performance measurements indicate that there is rarely redundant communication. Therefore, the amount of work for encryption and verification should be similar to or less than that in most conventional designs.

The second question addressed the issue of limiting the number of filename-to-multicast-channel hashes that routers need to store. Gronwall suggested some possible ways of dealing with this. One is to limit the range of the hash function. What is perhaps more interesting is that you can use "wakeup messages" for volumes that are not referenced in a long time. State for such files need not be stored in routers until a wakeup message brings the servers back from idle status.

David Anderson of the University of Utah asked about JetFile's scalability in terms of the number of clients, because the SRM message volume will scale up quickly. Gronwall agreed that the system needs some form of locality. You can use IP multicast scope to make a smaller request first and only go beyond the local scope if necessary.

More information on the project is available at <https://www.sics.se/cna/dist_app.html>.

Integrating Content-based Access Mechanisms with Hierarchical File Systems

Burra Gopal, Microsoft Corporation; and Udi Manber, University of Arizona

Manber, who now works at Yahoo!, started by emphasizing that this work was done when both authors were at University of Arizona, and should not be seen as an indication of a mysterious collaboration between Microsoft and Yahoo!. Since both authors have moved on to other areas, Manber hopes that people will realize this is an important area and take over the work.

Manber claimed that one of the main challenges for operating systems in the future will be providing convenient access to vast amounts of information. Here the word "convenient" refers not only to speed, but also to the ability to find the right information. Filesystems today use the same paradigm as 30 years ago: a hierarchical naming system in which the user has to do the naming and remember the names. This paradigm does not scale for gigabyte or terabyte systems, because users will not remember where everything is. Users should be able to easily access files with certain attri-butes, such as "anything I changed on Tuesday" or "anything with foo AND bar."

Gopal and Manber began with the Semantic File System (SFS) by Gifford et al. SFS builds virtual directories where the name of the directory corresponds to a query, and the content of the directory is files that match that query (represented by symbolic links).

Gopal and Manber combined SFS's features with a regular UNIX filesystem and allowed users to add semantic directories in addition to regular directories. Each semantic directory acts like a regular directory: users can add, remove, and modify files in it as usual. Query results can be modified automatically or manually. Regular file operations are preserved, while semantic-directory operations are added, such as smkdir (create a query), smv (change the query), ssync (reevaluate the query), and smount (define a semantic mount point).

When a user specifies the query associated with a semantic directory, the system puts symbolic links to all files matching the query into the directory. The symbolic links are called transient links because users can add or remove them. If a user adds a file, the link becomes permanent. If a user removes a file, the link becomes prohibited.

Manber next addressed the issue of scope consistency. Each query has a scope, and a subdirectory is a refinement to the query. For example, for semantic subdirectory foo/bar, bar evaluates only items within foo. When the user changes foo or moves bar to somewhere else, bar has to reevaluate the query because of the different context. Such changes can cause scope-consistency problems or even cycles of dependency. The paper presents a reasonable way of handling scope consistency. By design, the system does not handle data consistency. Queries are evaluated only periodically or when instructed by the user, not all the time.

Another interesting idea is semantic mount points, which allow users to connect to different query systems (such as Yahoo!). This allows sharing not only of data but also of classifications or other ways the data are organized. The result
is that users can treat the Web or other filesystems as part of their own directories.

Manber briefly talked about the implementation of the system. It was built on top of SunOS using the search engine Glimpse. It uses about 25,000 lines of C code. A major design decision was not to make kernel modifications so that people can easily accept the system. This most likely had a large impact on performance, with 30-50% time overhead and 10-15% space overhead for directory operations. In closing, Manber reiterated that the problem will not go away and will only become harder. He said this work has proven the approach is feasible, but not that it is the right approach. In particular, a user study would be needed.

David Steere of OGI referred to his SOSP paper on improving the performance of search engines. In that work, the results of queries were made immutable. He asked whether they had other ways to get around the possible usability problems caused by query mutations. Manber answered that while their design tried to be as "natural" as possible, he didn't really know what users will find to be "natural." This is a new issue, and a lot of work needs to be done, Manber said.

Steere then noted the similarity between Gopal and Manber's filesystem and a database. He asked if Manber thought filesystems will still be around 10 years from now, or if we will just be using databases. Manber said you can view filesystems as databases, and the question is how structured those databases will be. His guess is that there will be all kinds, and he said he hopes the dominant ones will be less structured than the filesystems of today.

Works-in-Progress Session

Summaries by Xiaolan Zhang

Multi-Resource Lottery Scheduling in VINO

David Sullivan, Robert Haas, and Margo Seltzer, Harvard University

Lottery scheduling's ticket and currency abstractions can be used to manage multiple resources (CPU, memory, disk, etc.). Sullivan described extensions to the lottery-scheduling framework designed to increase its flexibility while preserving the insulation properties that currencies provide. Ticket exchanges allow applications to modify their resource allocations by trading resource-specific tickets with one another, and they do so without affecting the resource rights of nonparticipants. VINO's extensibility mechanism can be used to install resource negotiators that initiate exchanges; currency brokers that provide flexible access controls for currencies; and specialized, per-currency scheduling policies.

Quality of Service Support in the Eclipse Operating System

John Bruno, Jose Brustoloni, Eran Gabber, Banu Ozden and Avi Silberschatz, Bell Labs, Lucent Technologies

Brustoloni described Eclipse/BSD, a system derived from FreeBSD to provide the QoS support required by an increasing number of applications. Eclipse uses resource reservations to guarantee that a given client receives the QoS that it requests. Resource reservations enable hierarchical proportional sharing of all resources in the system. Using separate resource reservations, servers can guarantee that the requests of a given client are isolated from the influence of overloads caused by other clients. Applications specify resource reservations using a new /reserv filesystem API. Results show that Eclipse/BSD can improve the isolation between Web sites hosted on the same system.

Long-Term File System Read Performance

Drew Roselli and Jeanna Neefe Matthews, University of California, Berkeley; Tom Anderson, University of Washington

Neefe Matthews described studies based on traces of long-term file behavior that show that, even with large caches, read performance is still significantly affected by disk seeks. They have therefore examined the impact of different layout policies on reads, including a new historically based policy that outperforms FFS and LFS on all workloads they have examined but requires many disk reorganizations. They are working on ways to limit the number of reorganizations required and to quantify their overhead.

High-Performance Distributed Objects over a System Area Network

Alessandro Forin, Galen Hunt, Li Li, and Yi-Min Wang, Microsoft Research

Wang described optimization techniques to improve DCOM performances over a system-area network with user-level networking. In particular, he and his colleagues are interested in the performance of distributed applications on top of the Virtual Interface Architecture (VIA). They applied both runtime and transport optimizations, and they removed an extra copy at the marshaling layer, yielding significant improvements in latency and throughput. Wang summarized by noting that fast networks push the bottleneck to protocol stacks, user-level networking pushes the bottleneck to the distributed infrastructure, and their optimization techniques push the bottleneck to transactions and security.

The Pebble Component-Based Operating System

Eran Gabber, John Bruno, Jose Brustoloni, Avi Silberschatz, and Christopher Small, Bell Labs, Lucent Technologies

Gabber presented the Pebble operating system, which allows system programmers to mix and match dynamically replaceable, user-level components. It includes a minimal, privileged-mode nucleus. IPC is done via portals, which are synthesized dynamically by a portal manager. Because each portal is specific to a particular caller and callee, it can be optimized to run fast. Each component is a protection domain and contains its own portals. Pebble is intended as a platform for high-end embedded applications.

Cellular Disco: Resource Management Using Virtual Clusters on Scalable Multiprocessors

Kinshuk Govil, Dan Teodosiu, Yongqiang Huang, and Mendel Rosenblum, Stanford University

Kinshuk Govil noted that system software that fully utilizes the features of large-scale multiprocessors is still not available, since most commercial operating systems do not provide efficient management of hardware resources or fault containment for such processors. Govil and his colleagues address this problem by running multiple instances of an off-the-shelf OS on top of a virtual machine monitor. The multiple OS instances talk to one another using a distributed-systems protocol and form a virtual cluster. A prototype implementation on an SGI Origin 2000 with 16 processors shows that faults can be isolated to a single OS instance and that the performance overhead is less than 10%.

PerDiS: A Persistent Distributed Store for Cooperative Engineering

Xavier Blondel, INRIA

Blondel described the PerDiS architecture, which was developed to support data sharing by members of the construction industry involved in a common project. PerDiS provides a unique combination of features, including persistent shared memory, shared objects, security, and transparent persistence through a garbage-collection mechanism. The system is intended to be used in a large-scale WAN environment such as the Internet.

Resource Management in a Multi Computer System (MCS)

Dejan S. Milojicic and the MCS Team, Hewlett-Packard

Milojicic introduced MCS, a shared-memory machine running multiple copies of NT on multiprocessor nodes, and the mechanisms and policies needed to manage MCS resources. These mechanisms include global memory-management support and global schedulers for initiating processes on the nodes and scheduling I/O on the devices. Policies are used to make decisions based on resource usage. Innovations of the system include several simplifying assumptions that make the design and implementation of resource management easier, e.g., a limited single system image, distributed memory management based on hardware, and intelligent I/O processors.

ISTORE: Introspective Storage for Data-Intensive Network Service

Aaron Brown and David Oppenheimer, University of California, Berkeley

Oppenheimer described ISTORE, a hardware/software architecture that enables the rapid construction of self-monitoring, adaptive single-purpose systems.
A system built using ISTORE couples LEGO-like plug-and-play hardware with application-specific, programmer-
specified policies for "introspection" (continuous self-monitoring and adaptation) in the face of changes in workload and unexpected system events such as hardware failure. It can thereby provide high availability, performance, and scalability while reducing the cost and complexity of administration. Adaptability is enabled by a combination of intelligent self-monitoring hardware components, a virtual database of system status and statistics, and a software toolkit that uses a domain-specific declarative language for specifying application-specific monitoring and adaptation policies.

SafeThreads: New Abstraction of Control and Protection

Masahiko Takahashi and Kenji Kono, University of Tokyo

SafeThreads, a mechanism that provides fine-grained protection domains for multiprocessor systems, allows threads to execute safely in the presence of malfunctioning external components. Takahashi described an efficient implementation of this mechanism based on "multi-protection" page tables that allow each virtual memory page to have multiple protection modes at the same time. At any moment, one of the protection modes is effective on each processor. Context switches involve a simple change of the effective protection mode without other high-latency operations such as TLB flushes. The implementation doesn't require special hardware support.

Fast and Predictable Automatic Memory Management for Operating Systems

Godmar Back, Jason Baker, Wilson Hsieh, Jay Lepreau, John McCorquodale, Sean McDirmid, Alastair Reid, and Joseph Zachary, University of Utah

Reid and his colleagues are writing significant parts of operating systems in modern languages such as Java. To improve the performance and predictability of such languages, the authors are developing techniques to "stack allocate" objects to avoid heap allocation and garbage collection. First, they are measuring object lifetimes through system tracing. In addition, they have developed a static analyzer that approximates the lifetimes of objects and determines at load time the activation record in which an object may be allocated.

File System Fingerprinting

Drew Roselli and Jeanna Neefe Matthews, University of California, Berkeley; Tom Anderson, University of Washington

Roselli pointed out that current filesystem implementors face the following dilemma in layout decisions: the filesystem has more information about the likely access patterns of files, but the storage system has more information about the performance of the storage media. She proposes an enriched filesystem/storage system interface that allows the filesystem to provide abstract rather than absolute file positions to the storage system. The enriched interface can improve performance in the following way: For predicted next write, the storage system can retain the data in write buffer; for predicted next read, the storage system can perform read-optimized layout. The filesystem can also provide relative block placement information that infers abstract data relationships.

Agile: A Hierarchically Extensible Security Policy Architecture

Dave Anderson, Ray Spencer, Mike Hibler, and Jay Lepreau, University of Utah

Andersen observed that contemporary diverse operating environments call for hierarchically extensible and flexible security policies. Agile is a security-policy architecture that provides policy-
independent, hierarchical extensibility. Agile borrows techniques from network-packet routing. In particular, it uses
integer-namespace routing. Children are assigned SIDs and parents route decisions to their children using a routing algorithm such as the Patricia algorithm.

Photos from Conference

Margo Seltzer & Tynan	Jim Gettys
Fay Chang (Best Student Paper Award)	Vivek Pai (Best Paper Award)
Gaurav Banga (Best Student Paper Award)	Steven M. Hand
Paul Leach	L. to R.: Ken Arnold, Rob Pike, Wilson Hsieh, Patrick Tullman, Thorsten Von Eicken

3rd Symposium on Operating Systems Design and Implementation (OSDI '99)

NEW ORLEANS, LOUISIANA February 22-25, 1999

NEW ORLEANS, LOUISIANA
February 22-25, 1999