USENIX - HotOS IX - Scribe Reports

HotOS '03, The 9th Workshop on Hot Topics in Operating Systems

HotOS-IX Scribe Reports

Invited Talk by Andrew Hume
The Emperor's Clothes
Popping & Pushing the Stack
Distributed Systems
Outrageous Opinions Session
When Things Go Wrong
Performance Optimization
Storage 1
Trusting Hardware
Pervasive Computing
Storage 2

Monday, May 19, 2003

9:15 a.m.–10:15 a.m.: Invited Talk (scribe: David Oppenheimer)

Operating System Reliability: Does Anyone Care Any More?
Andrew Hume, AT&T Labs—Research

Andrew Hume of AT&T Research gave the HotOS keynote talk, entitled "Operating System Reliability: Does Anyone Care Any More?" He explained that his perspective comes from having designed, implemented, and delivered large-data applications for more than ten years. The problems he discussed in the talk were that operating systems have gone "from a help to a hindrance," that even users' lowered expectations for operating systems have not been met, and that, as a result, applications have to be designed around OS quirks. Hume pointed out that this situation hasn't always been the case, citing WORM-based backup systems in research versions of UNIX and a cluster-based billing system that AT&T built using Plan 9 as examples of systems that were highly reliable, even under load.

The first problematic system Hume described was Gecko, a large-scale (250GB/day) billing system implemented in 1996 on Solaris 2.6. AT&T required 1 GB/s of filesystem throughput and predictable use of memory. Among the problems encountered were that Solaris crashed every few days for the first six months that the system was in production, that Solaris limited file throughput to about 600 MB/sec, that reading large files sequentially crashed the VM, and that a "VM roller coaster" developed when a large chunk of memory was allocated (resulting in all of physical memory being paged out and then almost all of the same data being paged back in, over and over again in a cycle, rather than the system just paging out the amount of new memory needed).

The second problematic system Hume described was a replacement for Gecko that required 6 times the capacity of the original Gecko. This system was implemented on a cluster running Linux. The architecture was a "Swiss canton" model of loosely affiliated independent nodes with a single locus of control, data replication among nodes, and a single error path so that software could only halt by crashing (there was no explicit shutdown operation). Hume described eight problems the Gecko implementors experienced with Linux (versions 4.18 through 4.20), including Linux's forcing all I/O through a filesystem buffer cache with highly unpredictable performance scaling (30 MB/s to write to one filesystem at a time, 2 MB/s to write to two at a time), general I/O flakiness (1-5% of the time corrupting data read into gzip), TCP/IP networking that was slow and behaved poorly under overload, lack of a good filesystem, nodes that didn't survive two reboots, and slow operation of some I/O utilities such as df. In general, Hume said, he has concluded that "Linux is good if you want to run Apache or compile the kernel. Every other application is suspect."

Hume proposed the following definition of OS reliability: "[The OS] does what you ask, or it fails within a modest bounded time." He noted that FreeBSD has comparable functionality to Linux, better performance, and higher reliability, and he speculated that this might stem from BSD's (and other "clean, lean, effective systems") having been built using "a small set of principles extensively used, and a sense of taste of what is good practice, clearly articulated by a small team of mature, experienced people." Hume took Linux to task for not demonstrating these characteristics, in particular for being too bloated in terms of features, and for having been developed by too large a team. Further, he singled out the Carrier Grade Linux effort for special condemnation for "addressing zero of the [type of] problems" he has had.

Discussion

During Q&A, Margo Seltzer asked if perhaps part of the problem is that the dependable systems community has traditionally been isolated from the operating systems community. Hume agreed, and he suggested that the dependability community may understand some things that the OS community does not yet, such as the end-to-endness of the reliability issue. Armando Fox asked if perhaps the problem is not the sheer number of developers who contribute to projects that have turned out imperfect but what many users perceive as "good enough" systems (e.g., Linux and HTTP), but, rather, the need for invariants to which all the developers would program, to help make such systems more reliable. Fox suggested Hume's rule of "It does what you ask or it fails within a modest bounded time" as one possible invariant. Hume agreed, saying that he is working hard on educating the Linux community about how to make the OS more reliable, particularly in large-scale HA configurations like that used by Gecko. Jeff Mogul suggested that perhaps what's missing is repeatable, easy-to-run OS dependability benchmarks based on the kinds of failure anecdotes Hume was describing. Jay Lepreau suggested that instead of taking a large general-purpose OS like Linux and stripping it down to a reliable core, an alternative approach would be to start with a reliable application-specific OS and to add just the extra functionality that might be necessary to run Gecko. Hume disagreed, suggesting that he had more hope for making existing general-purpose OSes more reliable. Tal Garfinkel pointed out that Hume had talked primarily about performance scalability rather than absolute reliability, and Hume agreed that for his application absolute single-node reliability is less important than the ability to get predictable performance scalability when more nodes are added and predictable performance degradation when nodes fail. Ethan Miller asked if product liability lawsuits might help the OS reliability problem, but Hume disagreed, stating that he is "not keen on litigation to direct OS research." Finally, Val Henson asked if OS reliability, while bad, is perhaps "good enough" for most people. Hume agreed, but suggested that we should raise our expectations.

10:45 a.m.–12:55 p.m.: The Emperor's Clothes (Chair: Steve Gribble; scribe: Matt Welsh)

High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two
Charles Blake and Rodrigo Rodrigues, MIT Laboratory for Computer Science

Charles Blake opened the first technical session with a talk on the overhead of "maintenance bandwidth"—network bandwidth consumed to maintain a given level of replication or redundancy—in a peer-to-peer storage system. The basic argument is that maintenance bandwidth across the WAN, not the aggregate local disk space, is the fundamental limit to scalability in these systems. Given the dynamics of nodes joining and leaving the system, Charles presented a conservative estimate of the maintenance bandwidth that scales with the WAN bandwidth and average lifetime of nodes in the system. Under a typical scenario (100 million cable modems with a certain bandwidth available for replication, one week average lifetime, and 100 GB storage per node), only 500 MB of space per node is usable, only 0.5% of the total.

To try to address these problems, Charles looked at alternatives such as admission control (only admitting "reliable" nodes) or incentivizing nodes to have long lifetimes. It turns out that a small core of reliable nodes (such as a few hundred universities with a single reliable machine dedicated to hosting data) yields as much maintenance bandwidth reduction as millions of home users with flaky connections. The talk concluded with a number of open issues in organizing WAN-based storage systems, such as whether it is appropriate to assume millions of flaky users and whether the requirement of aggregate data availability should be reconsidered.

Discussion

Armando Fox raised the point that, rather than high availability, millions of home users really bring lack of accountability. The question is whether there is any benefit other than pirating movies, and Charles answered that perhaps circumventing legal requirements should be a top design goal of these systems.

Eric Brewer argued that once a system becomes popular, it will be centralized, but that P2P systems are really useful for prototyping. Charles responded that, while there are benefits to collaboration, one may as well use a small number of available, well-connected/well-motivated friends to do a prototype.

Margo Seltzer wondered whether P2P could be useful to the Grid community and their interest in harvesting cycles rather than storage. Charles agreed. Andrew Hume pointed out that the Grid community is talking about needing gigabits of networking between sites.

Mike Swift wondered about not providing a disconnect button in the user interface, as a way of creating an incentive to users to offer long lifetimes.

Timothy Roscoe asked whether the tradeoff between storage space and bandwidth usage was inherent, or whether it was a transient condition resulting from current economic conditions. Charles replied that telcos may not have the motivation to increase bandwidth upstream from home users if most applications (e.g., Web surfing) require only a small amount.

One Hop Lookups for Peer-to-Peer Overlays
Anjali Gupta, Barbara Liskov, and Rodrigo Rodrigues, MIT Laboratory for Computer Science

Anjali Gupta presented a talk on the use of one-hop lookups in peer-to-peer systems, avoiding the high latency associated with the typical log(N) lookup paths required by most systems. The challenge is keeping up with membership change information on all hosts. For example, the UW Gnutella study in 2002 showed an average node session time of 2.9 hours, implying 20 membership changes per second in a system with 100,000 hosts. Anjali presented a hierarchical scheme, in which the address space (forming a ring) is subdivided into slices, each with a slice leader that is the successor to the midpoint in the slice. Slices are further subdivided into units.

The basic approach is for nodes to exchange frequent keep-alive messages with their predecessor and successor nodes. A change to a node's successor is an event that is propagated by piggybacking a recent event log onto keep-alive messages. A node change event is relayed to the slice leader, which periodically (every 30 seconds) notifies other slices of the updates. Internally to a slice, slice leaders periodically (every 5 seconds) notify unit leaders of node change information. Given some reasonable assumptions on the size of the system, all nodes can be updated within 45 seconds of a node leaving or joining the system, which permits a 99% "hit rate" for an address lookup. In this scheme, it is important to choose good slice leaders that are well-provisioned. Anjali concluded with a summary of ongoing implementation and experimentation work, noting that systems larger than a million nodes will require two-hop lookups.

Discussion

Mike Jones noted that there would most likely be a burst of traffic to and from a new node as it comes up, and he asked how a node just joining could communicate when it would take the 45-second update propagation delay before other nodes would know how to reach the new node. Anjali responded that a node starts participating only after it completely downloads a routing table from a unit or slice leader.

Dan Wallach asked about correlated failures, where a node and its successor fail at the same time. Anjali responded that the odds of this occurring are low and that this only introduces a bit more delay for the update. Dan asked about malicious nodes, and Anjali responded that this is not addressed by this work.

Ethan Miller asked about how the system scales up to a million-node system; Anjali said that this system reaches its upper limit at that size and does not scale beyond it.

Timothy Roscoe asked whether this system assumes that every node can talk to every other node in the network directly, and whether this was realistic. The response was that they were indeed making the assumption that every node can talk to its neighbors and unit and slice leaders. There are cases when this assumption may be invalid, and this question will be addressed in future work.

Eric Brewer pointed out that there is a lot of work in industry on hierarchical caching, allowing a big win since updates need not be propagated unless an object is hot. Anjali said that they were looking at designs where information was propagated for all objects, and had not looked at schemes that would do bookkeeping for object popularity. Nevertheless, it was a good idea and worth looking at.

An Analysis of Compare-by-hash
Val Henson, Sun Microsystems

Val Henson presented one of the most controversial papers of the conference, admonishing those systems that rely upon comparison of data by comparing cryptographic hashes of the data. Many systems (such as rsync, Venti, Pastiche, LBFS, and OpenCM) use this technique, but it is not yet widely accepted by OS researchers, due to little characterization of the technique and many unanswered questions. The risk of collision using (say) a 160-bit SHA-1 hash is just 2^-160, which is far lower than a hardware failure or probability of an undetected TCP error. So why the controversy?

First, these techniques assume that data is random, but real data is not random and has a fair amount of commonality (think about ELF headers and English text). Second, cryptographic hashes were designed for authentication and care about "meaningful" collisions, such as two contracts with the same text but different dollar amounts that happen to collide in the hash space. Third, hash algorithms are short-lived and obsolescence is inevitable—systems need an upgrade strategy. Finally, collisions are deterministic—two blocks that collide always collide—rather than a transient error such as a hardware fault. Hash collision is therefore a silent error in those systems that rely on compare-by-hash techniques. Val claims that we should be striving for correctness in systems software, not introducing "known bugs." It is OK to rely on compare-by-hash when the address space is not shared by untrusted parties, and when the user knows and expects the possibility of incorrect behavior, and he cited rsync as an example. Note that "double hashing" is not an acceptable solution, as this results in just another hash function, albeit one with a lower collision probability.

Some alternatives to compare by hash were discussed, such as content-based addressing that checks for collisions; using compression maintaining state only to send or store identical blocks once (as in LBFS); sending diffs instead of an entire block; or using universal IDs for common blocks.

Discussion

Peter Druschel argued that Val was taking a purist view of correctness, and that optimizations in general lead to complexity which reduces reliability in systems. Val responded that this has to do with tradeoffs, and that the tradeoffs taken in compare-by-hash were not the right ones.

Tal Garfinkel claimed that Val's arguments were based on "math FUD" and that paranoia is unwarranted—after all, all of cryptography is based on this technique. Val responded that we should not assume that hash functions designed for cryptography are appropriate for these applications. Tal further explained that similar data does not produce similar hashes, contrary to her assumption.

Andrew Hume pointed out that checksums work fine as long as there is a uniform distribution across the hash space. He also commented that while he is skeptical of Val's birthday paradox argument that with 2^80 documents there is 50% chance of a collision with a 160-bit hash, that if true, that is a very scary probability and that a lot of people would be much happier with a much smaller threshold like one in a million or one in a billion. Val answered that this is a difficult number to calculate (due to the infinitesimal quantities involved), and indeed, that teams of people and weeks of time could not produce the standard "M = sqrt(2 * p * N)" formula for the address space pressure needed to produce a collision with probability p, but that they believe it.

Dirk Grunwald asked at what point the hash collision probability becomes infinitesimal compared to everything else, such as TCP or ECC bit errors. Val again answered that the seriousness of the undetected collision made it important to consider the possibility.

Why Events Are a Bad Idea (for High-Concurrency Servers)
Rob von Behren, Jeremy Condit, and Eric Brewer, UC Berkeley

Rob von Behren raised the argument of thread-based versus event-driven concurrency in high-concurrency servers, claiming that thread-based approaches are far better due to their ease of programming. To counter the arguments that threaded systems have inherently higher overhead than event-driven ones, Rob presented early results from a lightweight user-level thread system that performed as well as an event-driven system on a Web server benchmark. Furthermore, threads have better programming and debugging tools, leading to increased productivity. To address the problem of high overhead for per-thread stacks, Rob proposed the use of compiler support to automatically compress stacks, for example, by moving "dead" elements off the stack across a blocking operation. Using cooperative scheduling avoids the overhead of generic thread synchronization; however, there are some issues to address here such as fairness, the use of multiprocessors, and how to handle spontaneous blocking events such as page faults.

Rob pointed out that events have the advantage of permitting very complex control flow structures; however, very few programmers use these structures, and threads can capture the vast majority of scenarios. Another problem with thread schedulers is that they are "generic" and have little knowledge of application structure. To permit greater cache locality, Rob proposed "2D" batch scheduling, in which the compiler annotates the application code to indicate to the scheduler system where the various stages of the thread's execution are located.

Rob presented some measurements of a simple Web server benchmark based on his user-level threads package, capable of supporting over 100,000 threads, implemented in about 5000 lines of C code. The server outperforms a Java-based event-driven Web server, probably due to the large number of context switches in the event-driven system. Rob concluded that it may be possible to achieve higher performance using threads than events, in part because events require dynamic dispatch through function pointers, which makes it difficult to perform inlining and branch prediction.

Discussion

Timothy Roscoe shared a vignette from his upbringing: "When I was young, I was surrounded by big tall men with beards, who talked about something called continuations." He asked whether this is what Rob was proposing and whether we shouldn't be looking at what the functional languages community is doing in this area. Rob was brief—"Yes."

Margo Seltzer asked about the background of this work, to which Rob explained that the UC Berkeley Ninja project had first gone down the path of using threads to build a large-scale system, and then switched to events to address some of the scalability problems. Both approaches "kind of sucked," and when Rob started writing tools to fix this problem he realized that a lean, user-level threading approach was the right solution.

Jeff Chase asked about writing tools to make it easier to write in an explicit event-based model, rather than a thread-based model. His motivating example was distributed systems using callbacks, for which an event-driven style of programming seems inevitable. Rob responded that one does not need to use events to handle callbacks—with cheap enough threads, one can spawn a new thread to wait for the callback.

General Discussion

Charles Blake kicked it off by asking why Val's birthday paradox probability was so hard to compute. She responded that essentially it comes down to the infinitesimal numbers involved. Afterwards Charles contended that while this was Val's response, that has nothing to do with the actual state of affairs, since the next order term in M=sqrt(2pN) is proportional to O((M/N)^2), which is very negligible on any of the address-space pressure scales in question.

Eric Brewer pointed out that systems should use CRC, not MD5—since CRC is no good for preventing malicious collisions, there should be no illusion that it is. One should also use a random salt with the checksum, which should help with the non-randomness of real data. Val responded that if one has to recompute the checksum across the actual data, then you are losing the benefits of this technique.

George Candea raised the point that, although a P2P client that prevents a user from disconnecting appears less desirable at first, it would lead to higher availability for the service as a whole. This makes the service more valuable, and hence provide greater incentive to use it (i.e., download the client).

Ethan Miller asked whether people are really comfortable with the concept of probabilistic storage. Val agreed that the notion of dynamic, unreliable storage systems makes her uncomfortable.

Ranjita Bhagwan pointed out that Charles's calculations don't push P2P out of the picture, asking whether there may be a cost benefit to a peer-to-peer approach versus a centralized approach. Charles said that fundamentally his argument was economic, concerning the bandwidth versus storage requirement for these systems. Andrew Hume said that the best nodes are professionally managed and that high-bandwidth connections and support are expensive, so the economics of the two approaches are more similar than they are different.

Mohan Rajagopalan said that compiler optimizations actually perform very well for event-based systems, and that implementation is what really matters. Event-based systems permit a decoupling between the caller and callee, so it is easier to write an event-based "adaptive" program than one in threads. Isn't this a fundamental benefit? Rob responded that events do make it easier to perform composition and interpositioning, but that this can also be done in the thread model. Eric Brewer mentioned that Click is very configurable and runs as a single large thread.

Peter Druschel was skeptical that we can do P2P storage based on home connected desktops, but that the alternative is not centralized systems. For example, one can reap the benefits of unused desktop systems within a large organization. Charles did not disagree with that.

This was followed by an exchange between Peter and Rodrigo Rodrigues about using so-called "scalable lookup" (Pastry/Chord/CAN/...) vs. some other organization for P2P file storage. Basically Rodrigo pointed out that in a scenario where the individual nodes are very available/reliable and the network isn't giant, there is no need for scalable lookup and other considerations should take higher priority. Peter responded that having a large number of nodes and security implied the need for small lookup-state optimizations.

2:30 p.m.–4:15 p.m.: Popping & Pushing the Stack (Chair: Geoff Voelker; scribe: Ranjita Bhagwan)

TCP Offload Is a Dumb Idea Whose Time Has Come
Jeffrey C. Mogul, Hewlett Packard Laboratories

TCP offload in the traditional sense violates performance requirements, has practical deployment issues, and targets the wrong applications. TOEs impose complex interfaces and cause suboptimal buffer management. Moreover, lots of small connections overwhelm savings because of connection management. Event management is a problem. Lots of virtual resources need to be managed. Also, one of the main motivations for TOE has been that TCP implementation in the OS is bad.

However, it is no longer a dumb idea, because now we are offloading higher-level protocols onto hardware. The justification for offloading TCP is simply that you can't offload the higher-level protocols without also offloading TCP. The sweet spot for TCP offload is when the application uses very high bandwidth and has relatively low end-to-end latency, long connection durations, and relatively few connections. For example, storage server access and graphics. Also, several economic trends are favorable to TCP offload. One would like to replace special-purpose hardware with cheap commodity parts, such as 1 gig or 10 gig Ethernet. With these in place, operators will have only one kind of fabric to provision, connect, and manage. Still, many challenges remain. Data copy costs still dominate, and busses are too slow. Zero copy and single copy seem too hard to adopt in commercial OSes. However, with the advent of RDMA, vendors want to ship RNICs in volume, allowing one kind of chip for all applications. It would mean cheaper hardware. Also, there are several upper-level protocols available, such as NFSv4 and DAFS. Still, many problems still exist. There are security concerns, and so far the benefits have been elusive. The new networking model may require changes to traditional OS APIs. Systems people need to give this due consideration.

Discussion

Margo Seltzer observed that by coupling RDMA with TCP in hardware, some of these issues might be resolved, since now you would get a generic version of RDMA. Jeff said that offloading even higher-level protocols is possible but difficult.

Rob von Behren asked what the write interface should be like. Jeff responded that standardization of APIs is not a priority. Currently, the concentration is on standardization of primitives. However, he does not know if that is sufficient.

Val Henson said that Linux has a commercially viable implementation of zero-copy that uses TCP checksum implementation in the NIC.

TCP Meets Mobile Code
Parveen Patel, University of Utah, David Wetherall, University of Washington, Jay Lepreau, University of Utah, Andrew Whitaker, University of Washington

The authors address the problem of deployment of transport protocols by proposing an extensible transport layer called XTCP. The main argument is that transport protocols such as TCP need a self-upgrade mechanism, and untrusted mobile code can help build such a mechanism. Several modifications to TCP, as well as alternative transport protocols, have been proposed. However, as with any new protocol, deployment is an issue. Currently, it takes many years before a new protocol or an extension can be used by applications: a new protocol or extension has to be approved by standards committees, implemented by OS vendors, and finally enabled-by-default at both ends of communication.

In the proposed solution, untrusted peers can upgrade each other with new transport protocols using mobile code. A typical usage scenario is that of a Web server. A Web server can download a high-performance version of TCP, after which it tells every client to download the same version from it. Then the client and the server can speak the upgraded version of TCP. This solution avoids all the steps of the deployment process that need approval and support from third parties, such as standards committees and OS vendors.

There are several challenges to building such an extensible layer, especially of host and network safety. The presenter contrasted XTCP with "active networking" and argued that the domain of transport protocols is restricted enough that host and network safety challenges can be met without degrading performance.

Host safety is assured by providing memory protection and resource control. Memory protection is achieved by using Cyclone, a typesafe C-like language. The stylized memory usage pattern of TCP extensions—no shared state between extensions and predictable ownership of data buffers—makes resource control possible using traditional runtime methods. XTCP uses the well understood notion of TCP-friendliness as a measure of network safety. All extensions written using the XTCP framework are forced to conform to TCP-friendliness using the ECN nonce mechanism. In contrast, active networking had no such well-defined notion of network safety, and host safety in the face of arbitrary code was costly.

XTCP has been implemented in FreeBSD 4.7. Support for user-level transports is being developed currently.

Discussion

Constantine Sapuntzakis asked how many extensions could be implemented on the XTCP API. Parveen said that 25 could be. The other two that they considered were not TCP-friendly.

Peter Steenkiste asked who builds the headers. Parveen responded that XTCP builds the IP header, while the extension builds the transport header. Peter then asked, if the definition of TCP-friendliness changes, how would XTCP work? The speaker said that, currently, changing the definition of TCP-friendliness would require a regular upgrade of the operating system.

Val Henson said that she was not convinced that XTCP could be secure. She thought extensions could be malicious or non-performing in more ways than stated, although she lacked specific examples. Parveen responded that non-performance of extensions is not a threat to XTCP. This is because extensions can hurt or benefit only the code provider. For example, if Yahoo accepts code from a client, then that extension will be used only in sessions with that client, so the extension provider has an incentive to provide good code.

Geoff Voelker asked what would happen in the case of worm spread. One can compromise a server such as yahoo.com and then use this mechanism to spread worms. Each worm would behave in a TCP-friendly manner, but many worms could collaborate to attack a remote host. Parveen responded that such a worm could not be implemented using XTCP alone, because the IP header is built by the XTCP layer, and the extensions cannot affect it. Therefore extensions cannot send data to an arbitrary host unless an application explicitly opens a connection to it and sends data. Furthermore, an extension received from yahoo.com would be used for communication with yahoo.com alone. That makes XTCP as harmless, or as harmful, as the currently deployed trusted versions of TCP.

Exploiting the Synergy between Peer-to-Peer and Mobile Ad Hoc Networks
Y. Charlie Hu, Saumitra M. Das, and Himabindu Pucha, Purdue University

There appears to be a number of similarities in the problems that research addresses in peer-to-peer and mobile ad-hoc networking. One such area is that of routing. The speaker showed the similarity to the problems solved by Pastry and how it can be used in ad-hoc networking. He described a new protocol, DPSR, which stores routing state in a manner similar to Pastry's. This reduces routing state per node from O(N) to O(log N). DPSR uses node ID assignment, node state, routing, node join procedures, and node failure or out-of-reach in much the same manner as Pastry, inherits all DSR optimizations on source routes, and contains a number of additional optimizations related to Pastry's routing structures and operations.

Simulations of DPSR for a 50-node system show that the routing overhead of DPSR scales better than that of DSR. In short, DPSR outperforms DSR when the number of connections per source is greater than 1. It ties DSR otherwise.

Discussion

Timothy Roscoe asked how many routes were going through the Pastry routing for more than one hop. Charlie said that though he did not specifically measure this number, the average number of overlay hops for the 50-node simulation is around 1.5.

Srini Seshan asked why not do a low TTL flood optimization to DSR. Charlie said this does not help in general.

General Discussion

Bogdan Popescu asked Parveen (XTCP) if you could use a signing mechanism to detect unresponsive connections. Parveen said that the nice thing about XTCP is that it works well without it. Bogdan said that then you could have DoS attacks.

Rob Von Behren said that it would be very easy to do DoS on XTCP, such as malloc'ing large amounts of memory, using a lot of CPU time, etc. Parveen said that each malloc call is accounted for. Rob responded that there is the problem of DDoS. With a considerable number of nodes using a little too much memory, one could perform a DDoS attack. Parveen said that this is possible; the only way to avoid it is strict admission control.

Jay Lepreau brought up Geoff Voelker's question again, and said that the code is identified by a hash. One could develop a history-based code-rating scheme, so that if a specifically "bad" user tried to use an XTCP extension, the server would disallow it. But then that user could keep changing the code slightly.

Jeff Mogul said that you have to make sure that XTCP itself is not subvertible. Because if it is, then it is a very rich environment for spreading worms.

Peter Steenkiste said that in the early '90s, after 6 months of effort, he had decided that TCP offloading is no good. In general, enthusiasm for TCP offload seems lukewarm. He asked Jeff if it would take off. Jeff responded that he does believe that it will take off, mainly for commercial reasons. Having only one fabric to manage for datacenters seems good. Peter said that there appears to be a contradiction somewhere here. Earlier on, we wanted to move things to the software level, and now attempts are being made to move them to hardware. Jeff said that switches are clearly a larger investment than NICs, so commoditizing the NICs would be good.

Geoff Voelker asked Charlie about the benefits of his approach depending on the number of shared source routes. Did he have a sense of the minimum shared source routes needed for DPSR to work? Charlie answered that so far, the sharing was small, but even in this scenario, DPSR does not do worse than DSR, so it seems overall to be a gain over DSR.

4:45 p.m.–5:35 p.m.: Distributed Systems (Chair: Jeff Chase; scribe: Amit Purohit)

* Scheduling and Simulation: How to Upgrade Distributed Systems
Sameer Ajmani and Barbara Liskov, MIT Laboratory for Computer Science; Liuba Shrira, Brandeis University

In the first talk of the session Sameer presented a solution for upgrading distributed software automatically with minimal service disruption. He presented a technique which uses a combination of centralized and distributed components. The infrastructure consists of three main components. Scheduling functions tell the node when to upgrade. Simulation objects enable communication among nodes running different versions. Transform functions change a node's persistent state from one version to a higher one.

Discussion

Timothy Roscoe pointed out that this method is expensive. He claimed that instead of upgrading each node and preserving the state, it is possible to ignore the state. Sameer agreed that one can ignore the state and allow the system to recover it, but said that this reduces the rate at which upgrades can happen and may require additional state transfer over the network. A. Veitch asked Sameer to compare his work with the existing work on on-line upgrades. Sameer said they are mainly targeting distributed systems.

* Development Tools for Distributed Applications
Mukesh Agrawal and Srinivasan Seshan, Carnegie Mellon University

In the second talk of the session Mukesh explained the motivation for his current research. He claimed that there are few distributed applications, not because they are not useful, but because they are very hard to implement. He cited routing table upgrading for distributed applications as one of the hard problems. He mentioned ns-2 simulator as a tool that helps to compare design choices. DHT is developing building blocks to help in implementation of distributed systems. Mukesh then pointed out some inherent flaws in the current approaches. The main concentration of the research is on the initial stages of the life-cycle of the applications, whereas his work mainly addresses the later life-cycle stages of the application.

Discussion

George Candea asked whether there is any good reason to build a distributed application instead of a centralized one. Mukesh answered that control over the application is the key. George argued it is not safe to give up control, to which Mukesh replied a single trusted entity is not necessary.

* Virtual Appliances in the Collective: A Road to Hassle-free Computing
Constantine Sapuntzakis and Monica S. Lam, Stanford University

In this talk Constantine envisioned a computing utility that runs not only Internet services but highly interactive applications that are commonly run on desktop computers. Patches arrive at a high rate. Multiple applications share a lot of elements, such as the OS and shared libraries. Hence, an upgrade to one application can break another. He argued that it is possible to borrow idea from appliances to improve the manageability and usability of computers. Then he described the architecture of their framework. Appliances are maintained by makers without user involvement. Cheaper hardware made virtualization an attractive option.

Discussion

Jay Lepreau was concerned about the deployment of this idea in wide use. He was skeptical about whether people would give up their autonomy. He said it may be advantageous to separate the application into sets but even that is not problem-free. Constantine replied that the appliance could be fairly small, making its deployment easier and reliable. David Oppenheimer asked how this would differ from application services servers. Constantine said it is more flexible than the centralized techniques. Andrew Hume joined the discussion at the end. He said it is very important that the interface be extremely simple. This is a real challenge, since appliances are dumb. Constantine is offering a simple user interface; hence it is not a complex mechanism, so it is hoped that it will be easy to use and fun for the user.

* POST: A Secure, Resilient, Cooperative Messaging System
Alan Mislove, Ansley Post, Charles Reis, Paul Willmann, Peter Druschel, and Dan S. Wallach, Rice University; Xavier Bonnaire, Pierre Sens, Jean-Michel Busca, and Luciana Arantes-Bezerra,Université Paris VI

The author presented a p2p solution that interoperated seamlessly with a wide range of collaborative services by providing one serverless platform. It provides three basic services to applications: a secure single copy message storage; an event notification service which alerts users to certain events such as availability of a message; and single writer logs, which allow applications to maintain metadata. The author claims that their features are sufficient to support a variety of collaborative applications.

Discussion

Mike Jones argued that application-level multicast will achieve the same goals. The author said the main motivation is to prevent spam. Armando Fox asked, what is the motivation behind choosing email as the primary application. The author replied, email has great locality properties. If such a demanding application can be changed, then POST can be extended to more complex applications.

Panel Session

In the panel session, M. Swift asked Constantine about the cost of complexity. Also, device drivers talk to hardware, so they couldn't be virtualized. MOreover, if they are buggy, then they can crash. Constantine said in future device drivers would be written in user land, then the problem could be solved. But if the application crashes, not much can be done.

Eric Brewer stated the view that the hard part is sharing information: having separate virtual appliances for everything only works if they don't share any information, which means that the user must replicated all "shared" information by hand (as we do now with real appliances, e.g., setting the clock). The path of safe sharing leads you to shared segments as in Multics, including layers (rings) and call gates for protected calls into shared resources. The author replied that Multics has some problems, which they are planning to address as well.

8:30 p.m.–10:30 p.m.: Outrageous Opinions Session (Chair: Jeff Mogul; scribes: David Oppenheimer and Matt Welsh)

Scribe Notes by David Oppenheimer

The Outrageous Opinions session featured 13 speakers.

Val Henson kicked off the session by suggesting that end-to-end checksums replace component-to-component checksums. Andrew Hume disagreed with this philosophy, pointing out that component-to-component checks are vital for pointing the finger at the component that has caused the problem.

Matt Welsh presented "a brief history of computing systems research," in which he urged computer scientists to think about how their research can help to address social problems. He pointed out that computer scientists have always worked on improving life for computer scientists, focusing on improving their own day-to-day tasks. He suggested that computer scientists should think more about social problems rather than "How can I download porn and pirate music more efficiently?" In particular, he cited education and health care, particularly outside the United States and Europe, as social problems that computer scientists could help tackle, for example by empowering local government and remote communities. Specific technologies he cited were peer-to-peer wireless networks for emergency and disaster response systems; censorship-proof electronic publishing of dissenting political opinions; sensor networks for environmental monitoring, precision agriculture; inexpensive medical diagnostics; highly reliable power environments; and maintenance-free PCs.

Mendel Rosenblum talked about the similarities between microkernels and virtual machine monitors (VMMs) for running multiple untrusted environments on top of an operating system. Both provide isolation, resource management, and a small operating environment with simple interfaces, leading to high assurance. The key difference is what runs on top of each—for a VMM you don't need to port applications to run on it, whereas for a microkernel you do. He pointed out that despite this advantage for VMMs, academics and industry researchers are often interested in microkernels, because by not leveraging existing software, microkernels provide academics an opportunity to write lots of papers and industry an opportunity to define APIs, leading to an industry control point.

Mike Chen presented "two easy techniques to improve MTTR": redirect users' attention to what's still available when something becomes unavailable (e.g., "Please read this new privacy policy"), and blame it on the user (e.g., "Did you forget your password? Please try again."). He pointed out that by tricking the user, perceived availability can easily be increased.

Timothy Roscoe presented a short talk entitled "Scalability Considered Harmful." The core of his argument was a rejection of the traditional dogma that maximum scalability is always a good thing. Instead, he argued that systems should scale as far as required, but no further. He argued that extra scalability is bad because it facilitates abuse and concentrates power. Moreover, he claimed that non-scalable systems are always more efficient than scalable systems. As an example, he pointed out that email scales globally, but it doesn't need to—the real social communication graph is cliquey, but because email scales beyond the group of people with whom a user really wants to communicate, we have the problem of spam. A second example he cited was one he credited to Chris Overton: an unscalable mechanism that some societies use to solve trust problems is reputation and referral, which works, while a scalable mechanism that some societies use to solve trust problems is lawyers, which doesn't work. A third example Roscoe gave was that of email whitelists versus blacklists—whitelists don't scale but are highly effective at preventing spam, while blacklists do scale but are much less effective. A final example he cited was Google, a highly scalable, indispensable system. He argued that the downsides of Google's success are that they have a lot of power, it's not possible for someone else to deploy a competing search engine anymore, and paper-based publishing and libraries have declined. The non-scalable alternative he argued for instead of Google was traditional libraries. Roscoe conclude by summarizing that scalable systems benefit the system owners to the detriment of the system users. He argued that by designing scalable systems, we as researchers collude with the concentration of power in the hands of the few. He therefore suggested that researchers stop designing scalable systems.

Dan Wallach argued that the logical extension of the virtual machine is the process.

Eric Brewer issued a call for IT research for developing regions. He made 5 claims: (1) there is a need for a direct attack by developing new devices rather than giving developing regions hand-me-down machines, which are a bad fit on cost, power, user knowledge, administration, and user literacy; (2) there is a need for infrastructure to support thousands of projects which are currently not sharing any infrastructure; (3) building IT for developing regions is economically viable, in that there is a market of 4 billion people, but IT must create income for its users (e.g., by offering a franchise model akin to that used to provide cell phone service to rural Bangladesh) because users do not have disposable income; (4) the time is right with the availability of $5 802.11 chipsets, systems on a chip, low-power designs, and solar panels; and (5) this work can have a big impact by reducing poverty and disease and improving the environment, by providing developing regions with a source of income that they can in turn use to improve their standard of living, stability, and security. In terms of concrete systems research areas, Brewer suggested wireless networks, spoken user interfaces for regions with low literacy rates, screens, proxies, and sensors. He indicated that he is starting to work on the problem of IT research for developing countries, and he urged the systems community to consider it to be a part of their research agenda.

George Candea discussed why he believes that wide-area decentralized systems such as peer-to-peer networks are "a good idea whose time has passed." He argued that such systems are hard to build, test, debug, deploy, and manage, and that they have little economic incentive beyond "lack-of-accountability" applications. He suggested that the principles learned from building wide-area distributed systems, e.g., strong componentization, using open protocols, loose coupling, reducing correlated faults through diversity, and writing components while keeping emergent behaviors in mind should be used to build highly dependable "distributed" systems within the datacenter. He summarized by saying, "Don't distribute centralizable apps into the wide-area network (WAN)—take the good ideas from distributed systems and apply them in the system-area network (SAN)."

Geoff Voelker parodied the evolution of data prediction schemes from value prediction (cache previously loaded value and speculatively execute on next load) to "result prediction" (don't bother running the program, just guess the results). He presented several possible applications of result prediction: 5% of the time have gzip just exit with "not in gzip format"; occasionally have the Web browser return random data from its cache, forcing the user to hit reload; occasionally tell the user that he or she has new mail with the "African bank spam"; and—the killer application for result prediction—have SETI@HOME predict no ETs.

Emmett Witchel argued for a "thin malloc" that does not spread data around the address space, but instead keeps it compacted into the smallest memory region possible.

Ethan Miller proposed SCUBA: Scalable Computing for Underwater Biota Assessment, in which 802.11 networks would be deployed on coral reefs.

Sameer Ajmani suggested that systems researchers consider work in computational biology: "We help biologists, then they help thousands of people through pharmaceuticals, genetics, etc.—it's easier than sending computers to Africa." As specific examples of computational biology problems that can directly apply well-known computer science algorithms, he cited string alignment (dynamic programming) and database searches for genes (hash table with 2-tuples and 3-tuples). Andrew Hume added that the National Institutes of Health also has more money than does the National Science Foundation.

Armando Fox called for a bet as to whether a peer-to-peer application would exist before the next HotOS, that would make more sense (economically and technically) to deploy as a peer-to-peer system than as a centralized service. 7 people in the audience said yes, 17 said no. Armando offered to bet someone (Armando taking the "no" side) for a case of alcohol valued less than the conference registration fee in 2005, but there were no takers.

Scribe Notes by Matt Welsh

In classic HotOS tradition, the Outrageous Opinions session consisted of a stream of short presentations, some serious, some mundane, some hilarious. Jeff Mogul tempted potential participants with a "local delicacy" with a "value of up to $1000" concealed in a paper bag. The level of disappointment in the room was palpable when it was revealed to be a can of Spam.

Val Henson argued against the use of checksums at all levels in a storage system versus end-to-end checksums at the application. Andrew Hume countered that it's good to have accountability at each level when something goes wrong in the system.

Matt Welsh presented a brief but accurate history of systems research goals, culminating in the current trend towards supporting faster downloads of porn and pirating of MP3 files. He argued that systems researchers should be focusing on social problems, such as education and healthcare in developing countries.

Mendel Rosenblum contrasted two approaches to running two untrusted environments on the same hardware—microkernels and VMMs. He wondered aloud who would want to build a microkernel that doesn't leverage existing software, then answered his own question with two possibilities: academics who want to publish papers, or corporate vendors who want to exercise a "control point." Mike Jones asked who these vendors were that were pushing new microkernels, and Mendel responded that Microsoft's Palladium system would be built this way. Much more was said on Palladium in Tuesday's evening session.

Mike Chen argued the opposite of Val Henson's talk, claiming that one should always use compare by hash, since in compare-by-data the error rate is proportional to the data size and is much larger than the hash collision rate. His real talk, though, was about the Berkeley/Stanford Recovery Oriented Computing project, which aims to improve MTTR in systems. He raised two ways to do this: first, when something fails, redirect the user's attention to something else ("read our NEW privacy policy"), or simply blame it on the user. This buys you tens of seconds to several minutes of wait time.

Timothy Roscoe railed against the construction of scalable systems, claiming that systems should only scale as far as needed and no further. For example, email does not need to scale globally—after all, who needs to send an email message to everybody? After all, whitelisting your email to reduce spam doesn't scale, but it works. Mothy used Google as an example of a system where extra scalability only concentrates power. He cited libraries as a non-scalable alternative to Google.

Dan Wallach talked about all of the recent work on virtualizing resources and getting multiple virtual machines to share resources, such as shared libraries or the OS kernel. He proposed an alternative to these approaches, a radical concept called a "process."

Eric Brewer brought the proceedings back to a more serious note with a call for IT research for developing regions, claiming that we need a direct approach rather than hand-me-downs from previous technology. For these applications we need systems with very low cost and power that make few assumptions about user knowledge or literacy. He pointed out that there are thousands of ongoing IT projects in developing countries but with little sharing. Developing an infrastructure that all these projects could use would be very helpful. Lots of research directions here, including very low-power and low-cost wireless communications, new speech-based user interfaces, network proxies, and sensors.

George Candea argued that WAN P2P distributed systems are a good idea whose time has passed. They are hard to build, test, debug, deploy, and manage, and they offer little economic incentive. He argued that the good distributed systems ideas should be used within the data center instead.

Geoff Voelker presented a novel idea based on the notion of value prediction from hardware architecture—"result prediction." Rather than running the program, we can simply guess the results, leading to excellent speedup potential! Examples: 5% of the time, gunzip could spontaneously report, "Not in gzip format." Web browsers could simply return random stuff from the browser cache. When retrieving new mail, your inbox could simply predict that it's an African Bank spam. The killer app for this technique is clear—SETI@Home.

Emmett Witchel asked whether there is a use for anti-optimization. One application he suggested was to allow users to specify in advance the amount of resources a computation takes, possibly eliminating covert channels. This would be accomplished by intentionally adding delay loops to code to use all available CPU and by spreading allocated memory all over the address space.

More important, Emmett suggested that the last session of the conference be moved forward to replace the breakout session, so that those who had early flights on Wednesday could attend. Most agreed.

Ethan Miller expressed concern about the environment, particularly the disappearance of many species of fish, since we're eating them all. He proposed a sensor-based approach to tracking fish populations, called Scalable Computing for Underwater Biota Assessment. This project can't use 802.11, as it would boil the fish. The main value of this project would be the ability to check the box on the grant proposal form explaining that SCUBA diving is being used in the project.

Sameer Ajmani upheld Matt and Eric's calls for socially relevant applications, pointing out that computational biology has the potential to help thousands of people, and there are real systems problems here (such as managing enormous databases).

And the SPAM award goes to Geoff Voelker!

Tuesday, May 20, 2003

9:00 a.m.–11:15 a.m.: When Things Go Wrong (Chair: Dan Wallach; scribe: Amit Purohit)

Crash-Only Software
George Candea and Armando Fox, Stanford University

The session began with a talk from George Candea in which he explained how to build Internet services that can be safely and cheaply recovered by crash-rebooting minimal subsets of components. He pointed out that most downtime-causing bugs are transient and intermittent, and it is not feasible to guarantee that an application can never crash. Even with recovery-safe applications, recovery can take too long. Crash-only software achieves crash safety and fast recovery by putting all important nonvolatile state outside the application components, into crash-only state stores. For components to be crash-only, they must be decoupled from each other, from the resources they use, and from the requests they process. He conceded that steady-state performance of crash-only systems may suffer, but argued that (a) the overall goal is to maximize the number of requests successfully served, not to serve some quickly and then be unavailable for a long time, and (b) that techniques will evolve that will improve performance of crash-only systems, the way compilers improved the performance of programs written in high-level languages.

Discussion

Andrew Hume asked how to set timeouts for crashing the systems deliberately. George said that timeouts are set in application-specific ways, at the entrance to the system (e.g., in the Web tier, based on URL). Certain known thresholds can be exploited (e.g., respond within the TCP timeout, or before the HCI-established 8-second distraction threshold for humans). Margo Seltzer asked George to compare their work with that of Stonebraker. George replied that Stonebraker's pioneering ideas were geared mostly toward data management, while their work extends this to recovery management of Internet services. Somebody from Colorado was concerned about the possibility of Do+S attacks by maliciously exploiting this technique. George answered that a stall proxy holds up new requests during recovery; this form of admission control allows the system to quiesce.

The Phoenix Recovery System: Rebuilding from the Ashes of an Internet Catastrophe
Flavio Junqueira, Ranjita Bhagwan, Keith Marzullo, Stefan Savage, and Geoffrey M. Voelker, University of California, San Diego

This presentation explains the design of an operative, distributed, remote backup system called Phoenix. Operating systems and user applications have vulnerabilities. Large numbers of hosts may share vulnerabilities, resulting in major outbreaks. Phoenix uses a strategy based on attributes and cores. By replicating data on a set of hosts with different values for each attribute, it is possible to reduce the probability of error to a negligible number. In the Phoenix system there is no single point of failure; copying with a large number of requests is achieved by exponential backoff.

Discussion

Somebody from MIT was concerned about Phoenix's flexibility. The answer was that the system offers sufficient flexibility to allow different failures. Popescu from Vrije University was concerned about the backup storage, as a Linux machine will always store 10 times more than a Windows machine.

Using Runtime Paths for Macroanalysis
Mike Chen, University of California, Berkeley; Emre Kiciman, Stanford University; Anthony Accardi, Tellme Networks; Armando Fox, Stanford University; Eric Brewer, University of California, Berkeley

The authors emphasized the benefits of microanalysis, namely latency profiling, failure handling, and detection diagnosis. They introduced the concept of runtime path analysis, in which paths are traced through software components and then aggregated to understand global system behavior via statistical inference. Runtime paths are also used for failure handling and for diagnosing problems, all in an application-generic fashion. The group explained that their work could be extended to p2p message paths, event-driven systems, forks, and joins.

Discussion

To Margo's query about the future of this approach, the authors answered that similar techniques are being used by designers for big projects, thus promising stability.

Magpie: Online Modelling and Performance-aware Systems
Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan, Microsoft Research Ltd, Cambridge, UK

In this work, the author explains a modeling service that collates traces from multiple machines, extracts request-specific audit trails, and constructs probabilistic models of request behavior. He mentions that workload description and hardware model could be used to predict performance. It is possible to augment the system by adding feedback from past models. To understand the behavior under all scenarios and use the knowledge, behavioral clustering is employed.

Discussion

A researcher from MIT asked whether the Markov model could be useful. Paul said that some of the Markov solutions could prove useful. Somebody from Washington asked Paul to prove that an unusual activity is always a fault and not a genuine activity. Paul said detection of unusual activity is the basis of intrusion detection systems.

* Using Computers to Diagnose Computer Problems
Joshua A. Redstone, Michael M. Swift, and Brian N. Bershad, University of Washington

Joshua talked about building a global-scale automated problem diagnosis system that captures the workflow of system diagnosis and repair. The computer generates search terms and locates problem reports. It stores problem reports in a canonical global database. When a problem occurs, the computer detects symptoms and searches the problem database. Joshua argues that more effort building the database could be a good trade-off, resulting in cheaper diagnoses. He said the main challenge lies in creating a database structure so that searching meets user expectations.

Discussion

Mike Jones commented that Windows XP has an automated bug-reporting facility. Joshua said that the XP facility only reports application crashes. Timothy Roscoe said that the main challenge lies in the development of a formal language. Joshua addressed that concern by saying that this is a simple structure. But he agreed that it will be difficult to design the system if problems are disjoint, but he argued that that would not always be the case. Val Henson said Google works perfectly fine for identifying error messages. Joshua argued that his system can give more information, such as the version, which will be more helpful in solving the problem.

General Discussion

In the general discussion Armando Fox asked a question regarding the deployment-of-diversity problem. Another question was how this system impacts the time needed to recover. The author said he didn't have the final answer, but it should be feasible without much trouble. Jeff Mogul remained concerned about time to recovery. The author said there aren't critical time-recovery requirements, as it is more important to get the data back. He also pointed out that it is possible to make it faster by adding redundancy. But it is a trade-off between storage and time.

Somebody from MIT asked about the deployment of Phoenix. The author said it seems reasonable if there are enough hosts running diversified OSes. Rive Peter compared the Phoenix approach with the randomized replica approach, and questioned whether, with a small number of attribute values, Phoenix really performs better than replicas. The author said that in the randomized replica approach, it is very hard to implement the recovery solution. Ranjitha from UCSD asked Joshua what would be the motivation for people to fill the database. Joshua said organizations tend to use external support, so people would find the increased ease of resolving their problem worth the effort. Mike Jones proposed an alternative scheme which asks the user for a request and if it is solved, then posts the solution. Joshua was not sure how to maintain the database. Mike said you could save the entire request and response sequence and then process it offline.

Jay Lepreau asked George (Crash-only software) about his project's impact on throughput. George said it's likely to be lower, because of the inherently distributed approach (loose coupling, explicit communication, etc.) to building the system, but with time we will learn how to improve performance, citing as an example going from assembly language to high-level languages. High-level languages enabled a qualitative jump in the types of software written, even if the actual writing of code was slower than in assembly. He also argued that "goodput" (throughput of successfully handled requests) is more important than absolute throughput. There was a follow-up question regarding the throughput of applications that do not fail. George said that it's true nobody needs to recover applications that never fail, but that such apps do not exist.

11:45 a.m.–12:30 p.m.: Performance Optimization (Chair: Jeff Mogul; scribe: Ranjita Bhagwan)

* Using Performance Reflection in Systems Software
Robert Fowler and Alan Cox, Rice University; Sameh Elnikety and Willy Zwaenepoel, EPFL

The main idea of this work is to use application-independent measures such as hardware instrumentation mechanisms and general system statistics to adapt system behavior. Performance indicators such as TLB misses and cache misses can be used to measure overhead, while measures of productivity include bytes sent to a network card, as well as flop rate. Productivity and overhead are used to determine whether the system needs to be tuned. Sameh Elnikety showed how server throttling of mySQL using the TPC-W workload succeeded in keeping the throughput at the maximum level while load increased, whereas, without using reflection, the throughput dropped at higher loads.

There was no Q/A.

* Cassyopia: Compiler Assisted System Optimization
Mohan Rajagopalan and Saumya K. Debray, University of Arizona; Matti A. Hiltunen and Richard D. Schlichting, AT&T Labs—Research

Cassyopia combines program analysis and OS design to do performance optimizations. Whereas the compiler offers a local perspective on optimizations, the OS provides a more general view. Cassyopia tries to bring about a symbiosis between the OS and the compiler. An example of this symbiosis is system call optimization. Cassyopia profiles system call sequences, then clusters them using compiler techniques. The clustered system calls are called multi-calls. Mohan described preliminary results that showed that significant performance improvement can be obtained using multi-calls.

Discussion

Margo Seltzer asked how one would do error-checking with clustered system calls. How would you know which system call generated the error? Mohan said that they have a means of determining that.

Constantine Sapuntzakis asked for an example of a case in which multi-calls would be useful. Mohan said that Web servers employ a repetitive system call behavior. Another example would be database systems.

* Cosy: Develop in User-Land, Run in Kernel-Mode
Amit Purohit, Charles P. Wright, Joseph Spadavecchia, and Erez Zadok, Stony Brook University

User applications are only allowed restricted access, which causes a lot of crossings of the user-kernel boundary. To prevent this, Amit Purohit proposed a compound system call, named Cosy, using which one can execute a user-level code segment in the kernel. The authors have a modified version of gcc that uses Cosy. Kernel safety is ensured by limiting kernel execution time; x86 segmentation and sandboxing techniques can also be used. The performance benefits of Cosy have been reported at 20–80%.

Discussion

Matt Welsh said that there is enough stuff in the OS already that adding more would not gain much. No performance improvement is justified by the loss of reliability that this technique would introduce. Amit said that, at some level, the OS has to be trusted. If it isn't trustworthy, there are lots of other problems that need to be dealt with.

Andrew Hume said that people from Sun Microsystems have said that code executes best in the kernel. Matt Welsh reiterated his statement that reliability was still more important than performance.

Constantine Sapuntzakis asked where the performance improvement came from. Amit said that the savings came from avoiding context switches and from read and write system calls. Constantine asked if the authors had compared their solution to zero-copy solutions. Amit said that zero-copy solutions are applicable only to certain applications; the Cosy solution is more general.

Ethan Miller asked what operations this is useful for, other than zero-copy. Amit said that zero-copy is the primary use right now, since the project is still in its initial stages. They plan to look at I/O next. Andrew Hume said that context switches were important too. Amit said that using compiler techniques and C, it is possible to exploit new avenues for performance improvement.

General Discussion

Mike Swift said that combining the last two papers might be one way to proceed: take the compiler techniques from Cassyopia to use for kernel execution. Mohan clarified that they do not plan to move any user-level code into the kernel, since he believes that interpretation in the kernel has high overhead. They primarily wanted to reduce the boundary crossing cost, so the two ideologies are not the same. Amit said that interpretation in the kernel is a bottleneck, and so they use a small interpreter. Checking pointers would cost a lot, and so they are using TLBs. This has some overhead, but it's a one-time check. Mohan brought up the point that the aim of Cassyopia is to apply optimizations that are quite obvious, but have not yet been done.

Ethan Miller said that the TLB overhead in Cosy could be unavoidably high. Amit said that is true, but the savings they are getting more than make up for the TLB overhead.

Margo Seltzer drove the nail home by saying that what matters finally is the kernel-user API. All this work on extensible OSes and moving code across the boundary is probably done because the kernel-user API needs to be revisited. So let's fix the API. Applause.

Andrew Hume said that for the applications he has looked at, apart from zero-copy and I/O, there is not much to be gained by putting code into the kernel. Apart from the stated scenarios, the chances of using a multi-call are small. Sometimes reassurance outweighs performance benefits. Mohan said that they are also looking at smaller devices, such as cell phones. All devices are resource-constrained. In these cases, there is also the issue of energy savings. Eric Brewer asked why, for small devices, you would even want a kernel boundary. Mohan answered that cell phones and iPAQs can now have JVMs running on them. They are not targeting reprogrammable devices.

Matt Welsh asked why one would care so much about performance on an iPAQ.

Timothy Roscoe said that since there are different kinds of devices, there should be different OSes for them. And then the question would be where, if anywhere, to put the kernel boundary on them. Whether there is a generic answer to that question is still unclear.

1:45 p.m.–3:30 p.m.: Storage 1 (Chair: Armando Fox; scribe: David Oppenheimer)

Why Can't I Find My Files? New Methods for Automating Attribute Assignment
Craig A. N. Soules and Gregory R. Ganger, Carnegie Mellon University

Craig Soules described new approaches to automating attribute assignment to files, thereby enabling search and organization tools that leverage attribute-based names. He advocates context analysis to augment existing schemes based on user input or content analysis. Context analysis uses information about system state when the user creates and accesses files, using that state to assign attributes to the files. This is useful because context may be related to the content of the file and may be what a user remembers when searching for a file. The Google search engine has proven the usefulness of context analysis; it chooses attributes for a linked site by examining the text associated with the link, and it considers user actions after a search to decide the user's original intent in the search. However, the kind of information Google's context analysis relies on cannot be applied directly to filesystems. In particular, information such as links between pages does not exist in traditional filesystems, and individual filesystems do not have enough users or enough "hot documents" to make Google-like context statistics useful.

Soules described access-based context analysis, which exploits information about system state when a user accesses a file, and inter-file context analysis, which propagates attributes among related files. The former relies on application assistance or existing user input (e.g., filenames), and the latter relies on observing temporal user access patterns and content similarities and differences between potentially related files and versions of the same file. Based on a trace analysis of usage of a single graduate student's home directory tree over a one-month period, Soules concluded that a combination of the techniques he proposes could be useful for automatically assigning attributes. For example, a Web browser can relate search terms to the document the user ultimately downloads as a result of the search, files created and accessed in a single text-editor session can be considered related, and attributes about documents used as input to a distiller such as LaTeX or an image manipulator program can be distilled for attachment as attributes of the output file. Soules also found that examining temporal relationships between file accesses in the trace successfully grouped many related files.

Soules stated that he will be investigating larger user studies, mechanisms for storing attribute mappings, appropriate user interfaces, and how to identify and take advantage of user context switches, i.e., a user's moving from using one program to another.

Discussion

During the question and answer period, Timothy Roscoe observed that tools such as the old AltaVista Personal Edition can index filesystems reasonably effectively. Andrew Hume pointed out that Soules' system does not address situations in which there are multiple access hierarchies for the same underlying files—the original hierarchy is effectively locked in, making it hard to handle multiple filesystem views simultaneously. Soules responded that the focus of this paper was how to get the keywords, and that he has not yet considered how to present the keywords or present multiple views. Hume also suggested looking into software that produces high-quality indexes for printed books.

Secure Data Replication over Untrusted Hosts
B.C. Popescu, B. Crispo, and A.S. Tanenbaum, Vrije Universiteit, Amsterdam, The Netherlands

B. C. Popescu described a system architecture that allows arbitrary queries on data content that is securely replicated on untrusted hosts. This system represents an improvement over systems based on state signing, which can support only semi-static data and predefined queries, and systems based on state machine replication, which require operations to be replicated across multiple machines. In the authors' system, every data item is associated with a public/private key pair. The private key is known only to the content owner, and the public key is known by every client that uses the data. There are four types of servers: master servers, which hold copies of content and are run by the content owner; slave servers, which hold copies of data content but are not controlled by a content owner and thus are not completely trusted; clients, which perform read/write operations on content; and an auditor server, described later. The master servers handle client write requests and lazily propagate updates to slave servers. Master servers also elect one of themselves to serve as an auditor, which performs background checking of computations performed by slaves, taking corrective action when a slave is found to be acting maliciously. Slave servers handle client read requests; they may use stale data to handle requests, but clients are guaranteed that once the time parameter maxLatency has passed since a write was committed by a master, no other client will accept a read that is not dependent on the write. All content in the system is versioned; the content version of a piece of data is initialized to zero when it is created and is incremented each time the data item is updated.

The key challenge in building this system is to enable clients to feel safe having their queries handled by untrusted slave hosts. This is accomplished probabilistically, by allowing clients at any time to send the same request to a (trusted) master and (untrusted) slave and to compare the results. When a slave returns the result of a read, it attaches a signed "pledge" packet containing a copy of the request, the content version timestamped by the master, and the secure hash of the result computed by the slave. If the slave returns an incorrect answer, the "pledge" packet can be used as proof of the slave's malfeasance. This probabilistic checking mechanism is augmented by an auditing mechanism in which, after a client accepts a result from a slave, it forwards the slave's "pledge" packet to a special auditor server. The auditor server is a trusted server that does not have a slave set but just checks the validity of "pledge" packets by reexecuting the requests and verifying that the secure hash of the result matches the secure hash in the packet. The auditor is expected to lag behind when executing write requests, executing writes only after having audited all the read requests for the content version preceding the write.

Discussion

During the question and answer period, Peter Druschel asked what happens to a slave that has been detected to have cheated. Popescu replied that the action is application-specific; once the slave has been taken out of the system, out-of-band means such as a lawsuit can be employed. Rob von Behren asked why you wouldn't need as many auditors as masters. Popescu responded that auditors should be more efficient than masters, because they don't have to communicate back to clients and they can take advantage of techniques such as query optimization, batching, and result reuse. Rob von Behren suggested that the system might be designed to let auditors be untrusted, with numbers of them used in a Byzantine agreement–type configuration. Popescu agreed that this would be a stronger guarantee, but he pointed out that the only weakness in the system as it stands is that an auditor could collude with a slave.

Palimpsest: Soft-Capacity Storage for Planetary-Scale Services
Timothy Roscoe, Intel Research at Berkeley; Steven Hand, University of Cambridge Computer Laboratory

Timothy Roscoe described Palimpsest, a "soft-capacity storage service for planetary-scale applications." Palimpsest is designed to serve as a storage service for ephemeral data from planetary-scale applications running on a shared hosting platform such as PlanetLab, XenoServers, or Denali. Examples of the type of data to be stored are static code and data for services, application logs of various sorts, and ephemeral system state such as checkpoints. Despite the temporary nature of the data, it must be highly available during its desired lifetime, thus making single-node local disk storage unsuitable. Traditional filesystems such as NFS and CIFS provide facilities unnecessary for planetary-scale applications and don't meet service provider requirements such as space management, billing, and security mechanisms that allow users to store their data on a shared infrastructure without having to trust the infrastructure provider. Palimpsest aims to provide high data availability for limited periods of time, data protection and resistance to Denial of Service attacks, flexible cost/reliability/performance tradeoffs, charging mechanisms that make sense for service providers, capacity planning, and simplicity of operation and billing. To achieve these goals it uses soft capacity, congestion-based pricing, and automatic space reclamation.

To write a file, a Palimpsest client erasure-codes the file, encrypts each resulting fragment, and sends each encrypted fragment to a fixed-length FIFO queue at the Distributed Hash Table node corresponding to the hash of the concatenation of the filename and the fragment identifier. To retrieve a file, a client generates a sufficient number of fragment IDs, requests them from the block stores, waits until a sufficient number of the requested fragments have been returned, decrypts and verifies them, and recreates the original file. Because the queues are fixed-length, all files stored in Palimpsest are guaranteed to disappear eventually. To keep a file alive the client periodically refreshes it, and to delete the file the client simply abandons it.

The key to predictable file lifetimes is the time constant (T) associated with each block store. T measures how long it takes a fragment to reach the end of the queue and disappear. Clients piggyback requests for information about T on read and write requests, and block stores provide a recent estimate of T by piggybacking on responses. Palimpsest providers charge per (fixed-length) write transaction. Clients can use information about each block store's T value and fee per write transaction to flexibly trade off data longevity, availability, retrieval latency, and robustness. Clients pay providers using anonymous cash transactions. DoS attacks are discouraged by charging for writes. Providers can perform traffic engineering by advertising values of T that deviate from the true value. Congestion pricing is used to encourage users to attain an efficient write rate.

Discussion

During the question and answer period, David Wetherall asked about the possibility of providing a callback to clients when a block they have stored is about to disappear. Roscoe responded that this would add the complexity of having to know about the blocks that are being stored. Sameer Ajmani observed that a load spike might change T significantly. Roscoe agreed; it is up to clients to factor in their belief as to how much the time constant will change when they store their data. Ethan Miller observed that for enough money, a malicious hacker could make people's data disappear. Roscoe replied that one way around that situation would be to dampen the rate at which the time constant could change, by limiting the write rate. Benjamin Ling asked what would happen if a provider lied, advertising a larger T than was true. Roscoe responded that third-party auditors can monitor the time constant and provide a Consumer Reports-like service to evaluate providers' claims. Finally, Jay Lepreau asked whether nonmalicious traffic might cause temporary load spikes, for example, an intrusion detection service using Palimpsest at the time a distributed worm such as SQL Slammer was unleashed. Roscoe replied that this would be a problem, but that hopefully the Palimpsest pricing scheme could help here.

General Discussion

During the general question and answer period following the session's talks, Andrew Hume asked Timothy Roscoe whether a simpler scheme than Palimpsest could be used to store ephemeral files just like regular files and cycle them using a generational scheme, eventually deleting the files that graduated from the oldest generation. Roscoe responded that Palimpsest provides an easy charging and pricing mechanism, while standard network filesystems such as NFS do not. Val Henson asked Popescu what he thought of the "High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two" talk, in light of the fact that his system is targeted toward storing data on untrusted hosts. Popescu responded that they're more interested in environments in which hosts are less transient than in standard peer-to-peer networks. Jeff Chase observed that Palimpsest clients have no control over the placement of their data, and he asked Roscoe whether he thought that was significant. Roscoe responded that the ability to select specific nodes on which to store data could be provided by an orthogonal mechanism. Benjamin Ling asked whether widespread adoption of Palimpsest would be hindered by the lack of hard guarantees about data longevity. Roscoe responded that legal contracts akin to service level agreements could be layered on top of Palimpsest to ease users' concerns about the inherent risk of data loss. Furthermore, this is really a futures market: third parties can charge premiums for providing guarantees and taking on the risk of data loss themselves.

4:00 p.m.–5:45 p.m.: Trusting Hardware (Chair: Dan Wallach; scribe: Matt Welsh)

This session turned out to be the most controversial of the conference, as two of the three talks discussed the use of secure hardware and systems such as Microsoft's Palladium architecture.

Certifying Program Execution with Secure Processors
Benjie Chen and Robert Morris, MIT Laboratory for Computer Science

Benjie Chen is motivated by potential uses for trusted computing hardware other than digital rights management. Since all PCs may include this hardware in the future, he is interested in exploring the hardware and software design for such systems. His running example was secure remote login, such as from a public terminal at an Internet cafe, where the client machine can attest to the server that it is running unadulterated software (OS, ssh client, etc.) Of course, this does not preclude low-tech attacks such as a keyboard dongle that captures keystrokes, but that is outside the immediate problem domain.

Benjie presented an overview of the Microsoft Palladium, or Next Generation Secure Computing Base, architecture, which uses a secure "Nexus" kernel and a secure chip that maintains the fingerprints of the BIOS, bootloader, and Nexus kernel. A remote login application would send an attestation certificate, generated by Nexus and the secure chip, to the server. The issues here are how to keep the Nexus kernel small and how to verify OS services such as memory paging or the network stack. Some ways to improve Palladium's security and verifiability were discussed, such as using a small microkernel that allows attestation of all OS modules above it, as well as a flexible security boundary (where some, not all, of the OS modules are verified). There is a connection with the XOM secure processor work, which prevents physical attacks on DRAM by storing data in encrypted form in memory and only decrypting it into a physically secure cache. Borrowing some of these ideas, one could run the microkernel within the secure processor, which would authenticate all data transfers to DRAM, and the application, hardware drivers, network stack, etc., could all be encrypted in DRAM.

Discussion

Constantine Sapuntzakis asked whether Nexus was not a microkernel. Benjie answered that Nexus implements many drivers inside, which raises the problem of how to verify them all and still keep the security boundary small. Constantine noted that one difficulty with XOM is that replay attacks are possible, and Benjie agreed.

Mendel Rosenblum said that secure remote login is not a good example application, since it doesn't solve the overall security problem (for example, a camera looking over the user's shoulder can capture keystrokes). Benjie agreed that not everyone believes in this application, but there are other applications that drive their design.

Jay Lepreau argued that the real application for this work is DRM. Benjie said they will continue to support DRM, but they are interested in exploring other applications as well.

Constantine pointed out that XBOX is supposedly secure hardware, but a buffer overrun exploit was recently discovered. Benjie replied that they haven't tried to address this problem—using attestation just tells you that the code you thought was running is indeed running. It also turns out that attestation can't deal with the code being modified at runtime.

Hardware Works, Software Doesn't: Enforcing Modularity with Mondriaan Memory Protection
Emmett Witchel and Krste Asanovic, MIT Laboratory for Computer Science

Emmett proposed efficient, word-level memory protection to replace page- or segment-based protection mechanisms. This is motivated by the use of fine-grained modules in software, with narrow interfaces both in terms of APIs and memory sharing. He argued that safe languages are not the answer, in part because it is difficult to verify the compiler and runtime system. Rather, allowing hardware protection to operate at the word level is much simpler and permits a wide range of sharing scenarios. For example, when loading a new device driver into the kernel, the MMP hardware would be configured to permit the module to access its own code and data, as well as to make calls to other modules and share very fine-grained memory (e.g., part of a larger data structure in the kernel). Some of the challenges involve cross-domain calls through call gates; dealing with the stack (since no single protection domain "owns" the stack); and automatically determining module boundaries through information already present in code, such as symbol import/export information in kernel modules. Other potential uses include elimination of memory copies on system calls, specialized kernel entry points, and optimistic compiler optimizations (e.g., write-protecting an object and running cleanup code if a write fault occurs).

Discussion

Mike Swift asked for clarification of the situation where a cross-module call might have multiple targets, as with dynamic dispatch. Emmett replied that the call gate is on the first instruction of the target routine, not the call site itself.

George Candea asked whether the difference between this technique and software fault isolation is simply performance. Emmett answered that SFI is great if you have a very stylized problem, but it has high overhead and has issues with protecting multiple discontiguous regions of data. Also, SFI is just not fast—for Emmett's tests, Purify yields a slowdown of 10–50x, but with MMP it's less than 10%.

Timothy Roscoe asked about the relationship to the SOSP '99 paper on the EROS system, which had a fast capability system running on normal PCs with revocation. Emmett replied that EROS had a nice design, and that the big difference is the level of indirection and how to make it really efficient.

Eric Brewer asked how this approach relates to Multics. Emmett responded that essentially this is a way to do Multics for multi-GHz processors in a modern context. Mendel Rosenblum noted that the people who designed x86 segmentation paid attention to the Multics design, but what they built was the unusable-to-most-software x86 segmentation hardware. Furthermore, he believed that if they looked at Mondriaan, they would pervert it into something bad. Emmett didn't find this a substantive criticism, believing that Intel engineers are more careful these days.

Jay Lepreau argued that MMP does not solve the larger protection domain issue. For example, a thread may call into another module which ends up spinning in an infinite loop or deadlocking. Emmett responded that this is a not a problem that he was attempting to address; he was focusing on the memory protection issue alone.

Flexible OS Support and Applications for Trusted Computing
Tal Garfinkel, Mendel Rosenblum, and Dan Boneh, Stanford University

Tal's talk returned to the question of using secure hardware for applications other than DRM, and shared much of the motivation and background of Benjie's talk. The core problem with open platforms (as opposed to closed platforms, such as ATMs and cell phones), is that applications can be deployed across a wide range of existing hardware but it is difficult to manage trust. Tal proposed the use of virtual machine monitors to provide a trusted OS environment. The VMM can run either an "open box" VM (such as a standard OS) or a "closed box" VM (a trusted system).

For example, one closed box VM might be a virtual Playstation game console, which uses attestation to prevent cheating in a multiplayer game. Another potential application is a customized distributed firewall pushed into the VMM to protect the network from the host, e.g., by preventing port scanning or IP spoofing or by enforcing connection rate limits. Tal also discussed applications to reputation systems and third-party computing (a la SETI@Home). He concluded the talk with a review of current efforts in this area, including TCPA, Palladium, and LaGrande.

Discussion

Mike Swift admitted that he is not a convert to the VMM concept, claiming that this approach requires that the OS has no bugs. Tal's response was that attestation tells you what the software stack is, allowing you to have more confidence in the system, but Mike countered that this does not make it inherently more robust. Mendel raised the point that the idea behind VMMs is not to have one secure OS (for example, so you don't need to use Nexus necessarily), but to allow different people to use different secure OSes. Jay Lepreau didn't buy the argument that VMMs are simpler, and wondered how much code there is in VMware. Mendel explained that for architectures that are virtualizable (which, unfortunately, does not include the x86), you can get away with about 10,000 lines of code, not including device drivers. Mendel would expect future architectures to be more virtualization-friendly.

General Discussion

The political issues surrounding trusted computing platforms raised a number of interesting—and heated—questions from the audience. Dan Wallach started off by describing the recent XBOX hack, where that system was supposedly trusted hardware. Tal and Andrew Hume countered that there are no guarantees of correctness, just tradeoffs in terms of risk assessment. Mendel suggested that cheating at Quake was not a big concern for industry, so this was not of paramount concern in the XBOX.

Timothy Roscoe was disturbed that the three speakers seemed to be too much in agreement, and raised the question of big protection domains (e.g., VMs) versus tiny protection domains (e.g., Mondriaan memory protection). They didn't take the bait, though—Tal said that these approaches were not mutually exclusive.

Jay Lepreau jumped in at this point and raised the concern that nobody has yet demonstrated an entire (real) operating system based on the microkernel model. He again argued that MPP does not solve the whole domain protection issue, since control flow protection is just as important as memory protection. Emmett admitted that there is some complexity involved, but by making memory protection domains both finer-grained and more explicit, programmers have to think about them more carefully and will document them in the code. Jay argued that chopping up a monolithic system in a "half-assed way" makes it more complex, but Emmett countered that most systems software is already making use of well-defined modules, simply without strong protection between them.

Val Henson returned to the trusted computing discussion, arguing that these platforms are more useful for restricting rights (as with DRM) than for giving us more rights. Benjie argued that the bleak view is that this hardware is going to get pushed out regardless, so we should be considering how to use it for something other than DRM. Tal concurred and said that this work was also about intellectual honesty.

Jay Lepreau argued that Tal and Benjie were really just giving cover to the real commercial driver for this technology, which is DRM. Mike Swift wanted to know where the line between research and the corporate world should be drawn, and whether our community should support this work at all. Tal mentioned again that their work is just bringing more information to the table, but Mike drew an analogy with nuclear weapons research, claiming that expanding knowledge is not the only side effect of this work.

Dirk Grunwald wondered whether calling this "trusted computing" was like the rebranding of "out of order execution" to "dynamic execution"—trusted computing cannot deal with hardware attacks, so consumers may be misled into believing it's safer. Tal pointed out that this is no different from calling a protocol "secure."

Jeff Mogul made the point that technology tends to reinforce existing power structures, and that the issue with trusted computing is not security but, rather, whether the technology reinforces existing power relationships or diffuses them. He summarized, "Are you advocating a system that helps the powerful or that helps the weak?" Eric Brewer claimed that he didn't want "trusted" software on his PC, since it only enforces a particular kind of trust relationship between two parties. He would rather have control over his trust relationships, but trusted computing platforms remove that control. Tal admitted that there is a real concern about using Palladium as an industry control point. Timothy Roscoe wrapped up the session by pointing out that this discussion had been rather "sterile" and had not touched on any of the issues of politics, lawmaking, or market forces surrounding the DRM and trusted technology debate.

Wednesday, May 21, 2003

9:00 a.m.–10:25 a.m.: Pervasive Computing (Chair: Geoff Voelker; scribe: Ranjita Bhagwan)

Sensing User Intention and Context for Energy Management
Angela B. Dalton and Carla S. Ellis, Duke University

Angela Dalton talked about employing low-power sensors to monitor user behavior and then using this information to reduce system energy consumption. She described a case study called faceoff, which uses a camera to perform image capture. This is followed by face detection, and the information is fed into a control loop that does the system-level energy management by turning off the display. The authors have built a prototype which uses image capture and skin detection. A feasibility study to determine the best-case energy savings faceoff can provide showed that significant savings are possible. Also, an important issue is responsiveness in the system. The paper discusses various ways of increasing responsiveness and describes several optimizations with that objective.

Discussion

Geoff Voelker asked if the authors had considered use of the same techniques for security, such as face recognition rather than face detection. Angela agreed that the system offers other benefits, such as privacy, but they have not yet explored them.

Mike Swift asked how robust skin detection is, and whether the detector can be confused. Angela said that the system should detect different skin colors, but it is not a good way of doing face detection. Ideally, you would use something more than skin detection for this purpose. However, it sufficed for the case study.

Access Control to Information in Pervasive Computing Environments
Urs Hengartner and Peter Steenkiste, Carnegie Mellon University

Locating people in a pervasive computing environment can be done at many levels and many granularities. An access control mechanism is required. However, access control mechanisms for conventional information cannot be used as is for such environments. Urs Hengartner described a people locator service that uses digital certificates for defined location policies. The authors use three design policies: identify information early, design policies at the information level, and exploit information relationships. Their approach is to use a step-by-step model, where validation of a client node is done at every server node. They also use a policy description language and an information description language in their people locator service.

Discussion

Sameer Ajmani brought up the issue of certificate systems being an administrative nightmare. He said that this is going to create lots of privacy issues, and that lots of trust issues are assumed. Most individuals are not going to even look at the policies. In response, Urs said that policy generation is done behind the scenes. The front end can be designed depending on the user. It could be a Web browser for an individual, while it could be something else for service developers.

Andrew Hume said that these systems break the notion of practical obscurity. Urs said that you could filter data so that a single node could not do serious damage.

Margo Seltzer asked if the speaker could compare this access control mechanism to that of the mobile people architecture at Stanford. Urs said he was not aware of the Stanford system.

Tal Garfinkel said that the information in the system would be regulated by many different entities. The speaker responded that there are not going to be that many entities. For example, a university runs the service, and you have to trust them anyway.Tal asked if there is any way to shield information in this scenario. The speaker said this could be done only in a legal way.

* Privacy-Aware Location Sensor Networks
Marco Gruteser, Graham Schelle, Ashish Jain, Rick Han, and Dirk Grunwald, University of Colorado at Boulder

Marco Gruteser gave a brief description of a system that uses sensor networks to gather information, while making sure that some degree of anonymity is maintained. Describing the problem, Marco said that sensor networks could be used to identify the precise location of certain entities, which could be undesirable. With an anonymous data collection scheme, you do not need to negotiate a privacy policy and the risk of accidental data disclosures is reduced, since databases no longer store the information. Marco described k-anonymity in database systems and said that the concept could be applied to location information.

Discussion

Ethan Miller asked how this interacts with finer-grained data than location information. Would k-anonymity still be maintained? Marco said that the current work focuses on location information at this granularity. If you look at different pieces of information, the problem becomes more complex.

Margo Seltzer asked whether k-anonymity can be preserved when you aggregate location information obtained from different sensor networks. Marco said that with multiple sensor networks, maintaining k-anonymity could be a problem.

Andrew Hume said that k-anonymity is not sufficient to preserve anonymity. He suggested that Marco consult statisticians about a better means for doing this. Andrew also asked, for reasonable values of k, what applications would this have? Marco responded that for k=10, accuracy is on the order of 100 meters. For many location-based services, that would be sufficient, for instance, detecting whether a meeting room is busy.

Ethan Miller asked how one decides a good value of k. Marco said that it depends on user preferences and on circumstances.

Sameer Ajmani brought up a problem: where one trusts sensor networks, for example, in an office, one usually does not care about anonymity. On the other hand, outside the office, one would not trust a sensor network. Marco suggested a certification process, in which you trust the company building the sensor network, or you trust the government using the sensor network. Essentially, you need to trust the people who deploy the sensor network.

Bogdan Popescu asked whether the sensors need to talk to each other. Marco said yes, they assume a standard multi-mode wireless sensor network, such as Berkeley motes.

General Discussion

Urs asked Angela whether, when there are multiple windows open and only one active, has she thought about turning off only parts of the screen? Angela said that there is related work that deals with energy adaptive displays. Presently this cannot be done. But you could use some form of gaze tracking to do this. Andrew Hume asked whether the partial screen turn-off works at the hardware level. Angela responded that currently there is no hardware that does that. New kinds of displays are assumed. But you would still need to control it through the system.

Andrew Hume commented that this could be used for covert operations. For example, if a laptop screen shuts down, it means that no one is close to it. Is there a security aspect to that, because you can detect location to some extent? Angela said that larger networks could do this, and, yes, there are lots of security implications.

Val Henson asked what the trend is for built-in sensors. Angela replied that most devices can now have built-in cameras. In general, these sensors are low-power, are cheap, and are becoming pervasive, especially the cameras.

Andrew Hume asked what kind of resolution of the camera is needed to make this work well. Angela said that detection currently is skin-color-based, and you could do this even with a low-resolution black-and-white camera.

Geoff Voelker asked Urs if he had thought of foreign users coming into an administrative domain. Urs responded that since they use SPKI/SDSI digital certificates, they do not require a PKI and could give access to anyone.

Val Henson asked Marco how you decide the shape of the location. Marco said that the current assumption is that the sensors are preconfigured in a room with a preset room ID; the same applies to a building, and so on.

Jay Lepreau asked Angela whether she had data on how people use laptops, so that she could evaluate how well her scheme would work with user behavior patterns. Angela said that more and more people are using laptops as their primary system. Moreover, energy awareness is important generically. However, this question is valid for cell phones, PDAs, etc., which could have widely varying usage patterns. Jay said that the power management on his laptop is frustrating, because it is stupid. It would be good to have a diagnostic tool, with the user being able to guide the system to some extent. Has Angela considered providing the user with a diagnostic tool? Angela said that you could imagine a user interface, which could be used to measure the annoyance factor of the power management, and she agreed that things kicking in at the wrong time can cause problems.

10:30 a.m.–11:15 a.m.: Storage 2 (Chair: Margo Seltzer; scribe: David Oppenheimer)

* FAB: Enterprise Storage Systems on a Shoestring
Svend Frølund, Arif Merchant, Yasushi Saito, Susan Spence, and Alistair Veitch, Hewlett Packard Laboratories

Alistair Veitch of HP Labs presented "FAB: Federated Array of Bricks." He described a project that is aimed at making a set of low-cost storage bricks behave, in terms of reliability and performance, like an enterprise disk array, but at lower cost and with greater scalability. The FAB array is built from bricks each of which consists of a CPU, memory, NVRAM, a RAID-5 array of 12 disks, and network cards. Clients connect to the array using a standard protocol such as iSCSI, Fibre Channel, or SCSI, and the storage bricks communicate among themselves using a special FAB protocol running on Gigabit Ethernet. The requirements of the array are zero data loss, continuous availability, competitive performance, scalable performance and capacity, management as a single entity, online upgrade and replacement of all components, and higher-level features such as efficient snapshots and cloning. The principal research challenges are failure tolerance without losing data or delaying clients, asynchronous coordination that does not rely on timely responses from disks or the operating system, and the ability to maximize performance and availability in the face of heterogeneous hardware. The techniques FAB incorporates to achieve these goals are a quorum-based replication scheme, dynamic load balancing, and online reconfiguration.

After outlining FAB's goals and the high-level techniques used to achieve those goals, Veitch described the quorum-based replication scheme used to achieve reliability. It uses at least three replicas for each piece of data, and reads and writes a majority of replicas. It survives arbitrary sequences of failures, achieves fast recovery, and can be lazy about failure detection and recovery, as opposed to needing to do explicit failover. The array configuration that the designers envision achieves a mean time to data loss of about a million years, which Veitch described as "at the bottom end of what's acceptable."

Discussion

George Candea asked whether Veitch could estimate the cost savings of FAB over traditional enterprise disk arrays. Veitch estimated a 20–30% cost savings but emphasized that it depends on the development time and cost. Dan Wallach asked whether today's enterprise disk arrays are truly expensive to develop and produce, or whether the prices of arrays are really just a huge markup over the prices of disks. Veitch replied that disks really are expensive, especially the kinds used in enterprise arrays, such as dual-ported Fibre Channel disks. He added that customized hardware ASICs and integration with firmware contribute to the six-year development cycle between product inception and delivery, requiring work by hundreds of people. FAB is trying to save money by doing everything possible in software.

* The Case for a Session State Storage Layer
Benjamin C. Ling and Armando Fox, Stanford University

Benjamin Ling presented "SSM, a recovery-friendly, self-managing session state store." SSM is specialized for storing session state typically associated with user interactions with e-commerce systems. The system assumes a single user making serial access to semi-persistent data. Ling explained that existing solutions such as filesystems, databases, and in-memory replication exhibit poor failure behavior or recovery performance, or both, and that they are difficult to administer and tune. SSM is designed to be recovery-friendly, that is, it can recover instantly without data loss, and it is self-managing in terms of handling overload and performance heterogeneity of components.

SSM is based on a redundant, in-memory hash table distributed across nodes, called bricks, and stateless client stubs, linked with application servers, that communicate with the bricks. Bricks consist of RAM, CPU, and a network interface (no disk). SSM uses a redundancy scheme similar to, but slightly different from, quorums. On writes, a stub writes to some N random nodes (4, in Ling's example) and waits for the first M of the writes to complete (2, in Ling's example); this scheme is used to avoid performance coupling. The remaining writes are ignored. Reads are issued to a single brick. SSM is recovery-friendly in that no data is lost so long as no more than M – 1 disks in a write quorum fail. Moreover, state is available for reading and writing during a brick failure. SSM is crash-only, using no special-case recovery code: when a brick is added to the system or returns to service after a failure, it is simply started up without any need to run a recovery procedure. SSM is self-managing in that it dynamically discovers the performance capabilities of each brick, as follows. Each stub maintains a count of the maximum allowable inflight requests to each brick, using additive increase to grow this window upon each successful request and multiplicative decrease to shrink the window when a brick operation times out. If an insufficient number of bricks are available for writing, either due to failures or due to a full window, the stub refuses the request issued by the client of the stub, thereby propagating backpressure from bricks to clients.

Discussion

Matt Welsh asked about the capacity of the system, and in particular the expected lifetime of the data to be stored in the system, especially in light of the absence of disks. Ling replied that session state generally lives a few hours to a few days. It is usually about 3KB–200KB per user, and Ling estimated that 10–15 bricks would be able to support 400,000 users. Andrew Hume asked what is done with the N – M requests that are not among the first M writes to complete. Ling explained that these requests are just forgotten, and that they are handled using a garbage-collection–like mechanism.

* Towards a Semantic-Aware File Store
Zhichen Xu and Magnus Karlsson, HP Laboratories; Chunqiang Tang, University of Rochester; Christos Karamanolis, HP Laboratories

Zhichen Xu described pStore. He explained that storage systems are in some ways an extension to human memory, but that computer-based storage systems have been "dumb" because, unlike humans, they do not associate meanings (semantics) with the stored data. For example, when humans search, they consider abstract properties and relationships among objects, and when they store data, they may group objects into categories or record only the differences between objects. Moreover, humans discover the meanings of objects incrementally. These observations motivate the incorporation of versions, deltas, and dependencies among objects stored in the semantic-aware file store.

The semantic-aware file store is intended to create a framework that captures and uses semantic information for fast searching and retrieval of data; stores data efficiently by using data compression based on semantic relationships of data; provides high performance through semantic data hoarding, data placement, replication, and caching; and enables highly available data sharing by balancing consistency and availability according to the data semantics. The semantic-aware file store uses a generic data model based on RDF to capture semi-structured semantic metadata. The challenges Xu described were how to identify the common semantic relations of interest, finding ways to capture semantic information, how to handle dynamic evolution of semantics, and how to define the tools and APIs that users and applications require.

Discussion

George Candea asked Xu to compare his system to existing products that store and track metadata about files. Xu said the main difference is that those systems store metadata in a database, which has a fixed schema, whereas his system uses semi-structured data and therefore allows evolution of semantics. Val Henson asked whether the semantic-aware file store can do better than Citeseer—if she had a number of research papers in her home directory, would the semantic-aware file store be able to relate them to each other as well as Citeseer does? Xu said the answer depends on the tools used to extract the information from the papers; pStore is a framework into which you could plug any number of data-extraction tools.

During the discussion that followed the talks in this session, Margo Seltzer asked Veitch whether in ten years HotOS will have a paper explaining why FABs are just as expensive as today's storage arrays. Veitch answered, "I hope not, but who knows?" Andrew Hume asked Veitch whether the FAB design makes it difficult to predict performance, as compared to a disk connected to a single machine, especially in the face of failures due to the large number of potential interactions among bricks in the FAB. Veitch responded that the performance can in fact be modeled as a function of factors such as the overhead of doing a disk I/O, the number of nodes, the desired mean time to data loss, and so on. He added that even today no storage system gives you hard-and-fast performance guarantees and that the more difficult question is how to model overall system reliability and the optimal tradeoffs among design parameters. Rob von Behren asked Veitch whether he had considered using non-RAID disks instead of RAID arrays inside each brick. Veitch responded that by using RAID, the mean time to data loss is controlled by the failure rate of each brick rather than the failure rate of individual disks, thereby increasing reliability. He added that it's very difficult to get good numbers on actual failure rates of system components, so estimating overall reliability is difficult.