Check out the new USENIX Web site.

Home About USENIX Events Membership Publications Students
MobiSys '05 Paper    [MobiSys '05 Technical Program]

Slingshot: Deploying Stateful Services in Wireless Hotspots


Ya-Yunn Su and Jason Flinn
Department of Electrical Engineering and Computer Science
University of Michigan

Abstract

Given a sufficiently good network connection, even a handheld computer can run extremely resource-intensive applications by executing the demanding portions on a remote server. At first glance, the increasingly ubiquitous deployment of wireless hotspots seems to offer the connectivity needed for remote execution. However, we show that the backhaul connection from the hotspot to the Internet can be a prohibitive bottleneck for interactive applications. To eliminate this bottleneck, we propose a new architecture, called Slingshot, that replicates remote application state on surrogate computers co-located with wireless access points. The first-class replica of each application executes on a remote server owned by the handheld user; this offers a safe haven for application state in the event of surrogate failure. Slingshot deploys second-class replicas on nearby surrogates to improve application response time. A proxy on the handheld broadcasts each application request to all replicas and returns the first response it receives. We have modified a speech recognizer and a remote desktop to use Slingshot. Our results show that these applications execute 2.6 times faster with Slingshot than with remote execution.

1 Introduction

Creating applications that execute on small, mobile computers is challenging. On one hand, the size and weight constraints of handheld and similar computers limit their processing power, battery capacity, and memory size. On the other hand, user's appetites are driven by the applications that run on desktops; these often require more resources than a handheld provides. A solution to this challenge is remote execution using wireless networks to access compute servers; this combines the mobility of handhelds and the processing power of desktops.

Although Internet connectivity is increasingly ubiquitous due to widespread deployment of wireless hotspots, the backhaul connections between hotspots and the Internet are communication bottlenecks. The uplink bandwidth from a wireless hotspot can be quite limited (e.g. 1.5 Mb/s for a T1 line). Further, this bandwidth must be shared by all hotspot users. The network round-trip time between a hotspot and a remote server may be large due to the use of firewalls and other middleboxes, as well as the vagaries of Internet routing. For interactive applications such as speech recognition and remote desktops, the combination of high latency and low bandwidth is prohibitive; mobile users cannot achieve acceptable response times when communicating with remote servers.

In this paper, we describe Slingshot, a new architecture for deploying mobile services at wireless hotspots. Slingshot replicates applications on surrogate computers [2] located at hotspots. A first-class replica of each application executes on a remote server owned by the mobile user. Slingshot instantiates second-class replicas on surrogates at or near the hotspot where the user is located. A proxy running on a handheld broadcasts each application request to all replicas; it returns the first response it receives to the application. Second-class replicas improve interactive response time since they are reachable through low-latency, high-bandwidth connections (e.g. 54 Mb/s for 802.11g). At the same time, the first-class replica is a trusted repository for application state that is not lost in the event of surrogate failure.

Slingshot also simplifies surrogate management. It uses virtual machine encapsulation to eliminate the need to install application-specific code on surrogates. Further, replication prevents the loss of application state when a surrogate crashes or even permanently fails. The performance impact of surrogate failure is mitigated by other replicas, which continue to service client requests.

The harnessing of surrogate computation is a multi-faceted problem with many challenges. This paper addresses several of these challenges, including improving interactive response time, hiding the perceived cost of migration, recovering from surrogate failure, and simplifying surrogate management. It also presents concrete results that measure the potential benefit of surrogate computation for stateless and stateful applications. Other challenges remain to be addressed. Slingshot does not yet address privacy concerns, provide protocols for secure replica management, manage surrogate load, or decide when to instantiate and destroy replicas.

We have implemented two Slingshot services: a speech recognizer and a remote desktop. Our results show that instantiating a second-class replica on a surrogate lets these applications run 2.6 times faster. Our results also show that replication lets Slingshot move services between surrogates with little user-perceived latency and recover gracefully from surrogate failure.

2 Design principles

We begin by discussing the three design principles we have followed in the design of Slingshot.

2.1 Location, location, location

Server location can be critical to the performance of remote execution. Consider a handheld connected to the Internet at a wireless hotspot. If the handheld executes code on a remote server, its network communication not only passes through the wireless medium; it also traverses the hotspot's backhaul connection and the wide-area Internet link. In a typical hotspot, the backhaul connection is the bottleneck. For instance, the nominal bandwidth of a 802.11g network (54 Mb/s) is more than an order of magnitude greater than that of a T1 connection. If the handheld could instead execute code on a server located at the hotspot, it could avoid the communication delay associated with the bottleneck link. For interactive applications that require sub-second response time, server location can make the difference between acceptable and unacceptable performance.

Network latency is also a concern. A server that is nearby in physical distance can often be quite distant in network topology due to the vagaries of Internet routing. Firewalls, VPNs, and NAT middleboxes add additional latency when connections cross administrative boundaries. For mobile users, a journey of only a few hundred yards can dramatically increase the round-trip time for communication with a remote server. In contrast, a server located at the current hotspot is only a network hop away.

2.2 Replicate rather than migrate

The desire to locate services near mobile users implies that services need to move over time. When a handheld user moves to a new location, a surrogate at the new hotspot will often offer better response time than a surrogate at the previous hotspot.

What is the best method to move functionality? One option is migration: suspend the application on the previous surrogate, transmit its state to the new surrogate, and resume it there. This approach has a substantial drawback: the application is unavailable while it is migrating. Slingshot uses an alternative strategy that instantiates multiple replicas of each service. While a new replica is being instantiated, existing replicas continue to serve the user.

Slingshot replication is a form of primary-backup fault tolerance; i.e. it tolerates the failure of any number of surrogates. For each application, Slingshot creates a first-class replica on a reliable server known to the mobile user - this server is referred to as the home server. Slingshot ensures that all application state can be reconstructed from information stored on the client and the home server. This allows all state on a surrogate to be regarded as soft state. Even if all surrogates crash, Slingshot continues to service requests using the first-class replica on the home server. In contrast, a naive approach that migrates applications between surrogates might lose state when a surrogate fails.

We note that Slingshot handles both stateful and stateless applications. The result of a remote operation for a stateful application depends upon the operations that have previously executed. Slingshot assumes that applications are deterministic; i.e. that given two replicas in the same initial state, an identical sequence of operations sent to each replica will produce identical results. As we discuss in Section 4.2, Slingshot adopts an approach similar to that of Rodrigues' BASE [24] in eliminating non-determinism with wrapper code. Slingshot instantiates a new replica by checkpointing the first-class replica, shipping its volatile state to a surrogate, and replaying any operations that occurred after the checkpoint. Instantiation of a new replica takes several minutes since the volatile state must travel through the bandwidth-constrained backhaul connection. However, existing replicas mitigate the perceived performance impact. Until the new replica is instantiated, existing replicas service application requests.

2.3 Ease of maintenance

We see the business case for deploying a surrogate as being similar to that of deploying a wireless access point. Desktop computers (without monitors) cost only a few hundred dollars today, not much more than an access point. Further, our results show that surrogates can provide significant value-add to wireless customers in terms of improved interactive performance.

However, surrogates must be easy to manage if they are to be widely deployed. Since we envision surrogates at hotspots in airport lounges, coffee shops, and bookstores, they must require little to no supervision. They should be appliances that require little configuration; most problems should be fixable with a reboot.

To make surrogates easy to manage, Slingshot:

  • minimizes the surrogate computing base. Each replica runs within its own virtual machine, which encapsulate all-application specific state such as a guest OS, shared libraries, executables, and data files. The surrogate computing base consists of only the host operating system (Linux), the virtual machine monitor (VMware), and Slingshot. No configuration or setup is needed to enable a surrogate to run new applications - each VM is self-contained.
  • uses a heavyweight virtual machine. While para-virtualization and other lightweight approaches to virtualization offer scalability and performance benefits [4, 23, 29], they also restrict the type of applications that can run within a VM. In contrast, by using a heavyweight VMM (VMware), Slingshot runs the two applications described in Section 4 without modifying source code, even though their guest OS (Windows XP) differs substantially from the surrogate host OS (Linux).
  • places no hard state on surrogates. Because surrogates have only soft state, a reboot does not lead to incorrect application behavior or data loss. If a surrogate crashes or is rebooted, the only impact a user sees is that performance declines to the level that would have been available had the surrogate never been present.

3 Slingshot implementation

3.1 Overview

Figure 1 shows an overview of Slingshot. For simplicity of exposition, this figure assumes that the mobile client is executing a single application and that a single surrogate is being used. In practice, we expect a Slingshot user to run only one or two applications concurrently, with each service replicated two or three times.


       Figure 1. Slingshot architecture

Each Slingshot application is partitioned into a local component that runs on the mobile client and a remote service that is replicated on the home server and surrogates. Ideally, we partition the application so that resource-intensive functionality executes as part of the remote service; the local component typically contains only the user interface. This partitioning enables demanding applications to be run on clients such as handhelds that are highly portable but also resource-impoverished. The applications that we have studied so far (speech recognition and remote desktops) already had a client-server partitioning that fit this model. For some applications, the best partitioning may not be immediately clear - in these cases, we could leverage prior work [2, 12, 18] to choose a partition that fits our model.

In Figure 1, a first-class replica executes on the home server and a second-class replica executes on the surrogate. The home server, described in Section 3.2, is a well-maintained server under the administrative control of the user, e.g. the user's desktop or a shared server maintained by the user's IT department. In contrast, surrogate computers, described in section 3.3, are co-located with wireless access points. They are administered by third parties and are not assumed to be reliable.

Slingshot creates the first-class replica when the user first starts the application - this replica is required for execution of stateful services. As the application runs, Slingshot dynamically instantiates one or more second-class replicas on nearby surrogates. These replicas improve interactive performance because they are located closer to the user and respond faster than the first-class replica on the distant home server. If no second-class replicas are instantiated, Slingshot's behavior is identical to that of remote execution.

Each replica executes within its own virtual machine. Replica state consists of the persistent state, or disk image of the virtual machine, and the volatile state, which includes its memory image and registers. To handle persistent state, we use the Fauxide and Vulpes modules developed by Intel Research's Internet/Suspend Resume (ISR) project [20]. These modules intercept VMware disk I/O requests. On the home server, we redirect these requests to a service database that stores the disk blocks of every remote service. On a surrogate, VMware reads are first directed to a service cache - if the block is not found in the cache, it is fetched from the service database on the home server.

The client proxy is responsible for locating surrogates, instantiating second-class replicas, and managing communication with all replicas. It presents the local component with the illusion that it is using a single remote service by broadcasting each request to all replicas and forwarding the first reply it receives to the local component. Later replies from other replicas are checked for consistency, as described in Section 3.4.

If a mobile computer has a high-capacity storage device such as a flash card or a microdrive, Slingshot reduces the time to instantiate replicas by storing checkpoints on the mobile computer. As described in Section 3.6, the client logs all operations that occur after the checkpoint, and replays them to bring a new surrogate up-to-date.

3.2 The home server

A Slingshot user defines a single, well-known home server that stores and executes the first-class replicas for all of her remote services. Each service is uniquely identified by a serviceid string assigned by the user when the service is created. The service database on the home server manages the persistent and volatile state associated with each service. The director instantiates and terminates first-class replicas. We describe these components in the next two sections. The home server also runs the VMware virtual machine monitor. Each Slingshot service runs within a separate VM that is dynamically instantiated when a user starts its associated application on the client.

3.2.1 The service database

The home server stores the state of every service under its purview in its service database. Previous research in virtual machine migration by Sapuntzakis [25] and Tolia [26] has shown that content-addressable storage is highly effective in reducing the storage costs of virtual machine disk images. We adopt their approach by dividing the disk image of each virtual machine into 4 KB chunks and indexing each chunk by its SHA-1 hash value. As shown in Figure 2, each service has a chunk table that maps the chunks in its virtual disk image to the SHA-1 hash of the data stored at each location.


       Figure 2. Reading a chunk from the service database

The service database assumes that any two blocks that hash to the same value are identical. It maintains a hash table of the SHA-1 values of all chunks that it currently stores. When it receives a request to store a new chunk whose SHA-1 value matches that of a chunk it already has stored, it increments a reference count on the existing chunk. This method of eliminating duplicate storage has been shown to substantially reduce disk usage [10] due to similarities between the disk state of different computers. We expect such similarities to be common in our environment, since a single user may create many remote services from a generic OS image. For example, we created the speech recognizer and VNC services discussed in Section 4 from the same Windows XP image.

As shown in Figure 2, when a first-class replica reads a block from its virtual disk, the Fauxide/Vulpes ISR modules intercept the request and pass the associated logical block number to the service database. The database looks up the block number in the service's chunk table to determine the SHA-1 value of the chunk stored at that location. It then looks up the SHA-1 value in the hash table to find the location of the chunk in the database.

Requests that modify blocks follow a similar path. The database locates the chunk in the service's chunk table. It then indexes on the old SHA-1 value and decrements the reference count associated with the chunk in its hash table. If the reference count drops to zero, it deletes the chunk. The service database next looks up the new SHA-1 value of the modified block in its hash table. If the modified chunk is a duplicate of an existing chunk, the service database increments the reference count of the existing chunk. Otherwise, it stores the chunk and adds its SHA-1 value to its hash table.

Since the volatile state is likely unique to each service, content-addressable storage offers little benefit. Thus, the service database stores the volatile state of each remote service in a file named by its serviceid.

3.2.2 The home server director

When a mobile user starts a Slingshot application, the client proxy asks the director on the home server to instantiate the first-class replica. The director uses the serviceid provided by the client proxy to retrieve the volatile state from the service database. It starts a VMware process, resumes the virtual machine with the volatile state, and replies to the client proxy. The persistent state is retrieved on demand from the service database as the first-class replica executes.

When the user terminates the application, the client proxy tells the director to halt the replica. The director suspends the virtual machine, which causes VMware to write its volatile state to disk. It then terminates the virtual machine.

The volatile state is large (e.g. 128 MB) because it contains the entire memory image of the virtual machine. We use Waldspurger's ballooning technique [28] to reduce its size. When we create a new service, we place a script in the guest OS that allocates a large amount of highly-compressible (e.g. almost entirely zeroed out) memory pages. When VMware suspends the virtual machine, this script runs to force unused memory pages to disk and replace them with more compressible pages. The director compresses the volatile state with gzip before storing it in the service database - this reduces storage and network costs.

3.3 Surrogates

The surrogate architecture is similar to that of the home server, except that we replace the service database with the service cache described in Section 3.3.2. The director also plays a slightly different role on a surrogate.

3.3.1 The surrogate director

The surrogate director currently accepts connections from any client that wishes to instantiate a second-class replica. Potentially, the director could enforce access-control policies similar to those enforced by access points today. The client proxy passes the director the IP address of its home server and the serviceid of the remote service it wishes to instantiate. The director contacts the home server and requests the volatile state and chunk table for the requested service.

Usually, the home server is already executing the first-class replica of the service in question. For a stateful service, this means that Slingshot must generate a coherent checkpoint that represents the current execution state of that replica. The home server creates this checkpoint by suspending and resuming the virtual machine containing the replica; this causes a new volatile state and chunk table to be written to the service database.

The home servers ships copies of the volatile state and chunk table to the surrogate. Even after compression, this information is quite large (e.g. 32 MB for a VNC service) - thus, it can take several minutes to transfer. Next, the director starts a new virtual machine and resumes it using the volatile state. As the replica executes, its disk I/O is intercepted by the ISR modules and redirected to the service cache described below.

When the client disconnects from the surrogate, the director terminates the virtual machine. Since surrogate replicas are second-class, the service state is logically discarded at this point. However, persistent state chunks remain in the service cache until they are evicted due to storage limitations. This improves response time if the service is later restarted on the surrogate.

3.3.2 The service cache

The service cache is a content-addressable store of data chunks. As with the service database, each chunk is indexed by its SHA-1 hash value, and storage of duplicate chunks is eliminated. This lets users benefit from similarities among the disk images of their replicas. For instance, two people using Windows-based services may have similar disk images. Chunks cached by one user can be used by the other.

When the service cache receives a request to read a chunk, it first tries to service it locally. If the chunk is not cached, it asks the service database associated with the replica for the chunk.

A subtle problem occurs because Slingshot enforces determinism at the application level. There is no guarantee that two different replicas will write the same data to the same location on disk. A naive implementation might ask the database for an uncached chunk, only to find that it had been over-written by a store performed by the first-class replica. We therefore need to ensure that the service database keeps all chunks that might potentially be requested by second-class replicas.

Slingshot uses a copy-on-write approach for stateful services. When a surrogate starts a second-class replica, the database copies the service's chunk table - the new copy increments the reference count for each of its entries. When the second-class replica is terminated, the database deletes its chunk table and decrements the reference count for each entry. Thus, even if the first-class replica modifies or deletes a chunk, that chunk is not deleted until after the second-class replica terminates.

A similar concern arises for chunks modified by the second-class replica. The modified chunks may not be written to the service database by the first-class replica due to non-determinism at the disk I/O level. The surrogate cache therefore pins modified chunks for the duration of a replica's execution - this ensures that they will never need to be fetched from the service database.

The surrogate cache uses an LRU eviction algorithm that exempts chunks that are currently pinned. Since chunks remain cached even after a service is terminated, it is likely that the chunks of a frequent visitor to a hotspot will remain cached between visits.

3.4 The client proxy

The client proxy is a stand-alone process responsible for surrogate discovery, instantiating and destroying replicas, and coordinating communication with each replica. It uses UPnP [22] to discover new surrogates in its surrounding network environment. Currently, the decision to instantiate a new second-class replica is a made by the user. In the future, we plan to add heuristics for monitoring network performance and automatically deciding when new replicas are needed.

On startup, the local component sends the client proxy its serviceid. The proxy immediately instantiates a first-class replica on the home server. It subsequently instantiates second-class replicas on nearby surrogates when requested by the user.

The client proxy maintains an event log of requests sent by the local application component. The client proxy spawns a thread for each replica; the thread sends logged events to the replica in the order they were received. Events may optionally have application-specific preconditions that must be satisfied before they can be sent to a replica. For instance, our VNC application specifies a precondition that ensures that the remote desktop is ready to accept each key stroke and mouse click event before that event is sent. Services that must process events sequentially to ensure determinism specify that the previous event must complete before an event is sent.

The client proxy records the replies received from each replica in the event log. When the first reply is received, it is returned to the local component. Later replies are compared with the first reply to ensure that the replicas are behaving deterministically. If the reply from a second-class replica differs significantly (as determined by an application-specific function) from the reply from the first-class replica, the second-class replica is terminated. Such divergence could be due to a bug in the wrapper code enforcing determinism, or it could be the result of a faulty or malicious surrogate. Note that the client proxy may already have received a reply that is later determined to be faulty. In this case, the application is notified via an upcall so that corrective action can be taken. This strategy is similar to those employed in the SUNDR file system [21] and in Brown's operator undo [7]. Alternatively, we could try to prevent malicious surrogate behavior using a trusted computing architecture [15].

3.5 Instantiating new replicas

In Figure 3, we show how Slingshot responds to a user moving between hotspots, assuming that a surrogate is located at each hotspot. The client proxy first asks the nearby surrogate to instantiate a replica. The surrogate requests the service state from the home server; the home server checkpoints the first-class replica and ships the compressed chunk table and volatile state to the surrogate. Note that the old surrogate can process events for the client during checkpointing. This hides almost all delay associated with suspending and resuming the VM on the home server. The client proxy queues events for the first-class replica while it is being checkpointed and sends them after the replica resumes execution.


       Figure 3. Instantiating a new replica

The new surrogate uses the checkpoint to resume the service within a new virtual machine. The client proxy brings the new replica up-to-date by replaying all events in the event log that were sent by the application after the checkpoint was created. Once the new replica is up-to-date, it improves interactive response time for the application by responding more swiftly to new events sent by the local component. At this point, the client proxy terminates the replica at the previous hotspot.

The benefit of replication is that the user sees little foreground performance impact due to the use of a new surrogate. After checkpointing, the first-class replica on the home server services requests while the new second-class replica is being instantiated and brought up-to-date. In contrast, a naive migration approach would leave the service unavailable while state is being shipped - as we show in Section 5, application state can take several minutes to ship over limited backhaul connections. Although the first-class replica is unavailable while it is being checkpointed, that operation is relatively short (i.e. approximately 10 seconds). Even that cost can be masked if another second-class replica exists.

Slingshot performs two optimizations if a service is marked as stateless. It skips checkpointing the service on the home server (since its state is static). It also does not replay events (since the replica is up-to-date).

3.6 Leveraging mobile storage

Migration can be time consuming due to the need to ship state from the home server (step 4 in Figure 3). For a typical service, the size of the compressed volatile state and chunk table is 30 - 40 MB. If the home server is connected to the Internet via a DSL link with 256 Kb/s uplink bandwidth, it takes over 20 minutes to ship the state.

Given sufficient storage capacity, Slingshot reduces the time to ship state by storing checkpoints on the mobile computer. We observed that the service-specific event log can be used to roll forward replica state from any prior checkpoint, not just one that is created at the beginning of replica instantiation. Thus, by storing a checkpoint on a mobile computer and logging all events that occur after that checkpoint, Slingshot can instantiate a replica without shipping state from the home computer. Instead, it ships the state from the mobile computer over the high-bandwidth wireless network at the hotspot. Most of this bandwidth should be unused since the capacity of the wireless network is typically much greater than that of the backhaul connection, yet most communication from computers located at the hotspot is with endpoints located outside the hotspot (and thus limited by the backhaul bandwidth).

When a user returns to her home server, she can tell Slingshot to create new checkpoints of her applications on a high-capacity storage device such as a microdrive. Each checkpoint contains the volatile and persistent state. The volatile state and chunk table are stored in separate files; the chunks that comprise the persistent state are stored in a content-addressable cache on the mobile storage device. The event log is empty when the user creates a new snapshot. As the application is used on the road, Slingshot appends each request to the log. This enables Slingshot to instantiate a new replica of a stateful service by first restoring the checkpoint represented by the volatile state, and then deterministically replaying the event log. For stateless services, Slingshot neither records nor replays an event log.

When a new replica is instantiated on a nearby surrogate, the mobile computer tries to find a checkpoint on its local storage device. If a checkpoint is found, the mobile computer ships the volatile state, chunk table, and hash table for its local chunk cache to the surrogate. One reason that we transmit the chunk table and the hash table to the surrogate is that the surrogate can usually maintain this information in memory, whereas a resource-constrained mobile computer cannot. When a disk I/O request misses in the service cache, the surrogate fetches the chunk from the mobile computer if it is available there; otherwise, the chunk is fetched from the home server.

As operations accumulate, so does the time to bring a new second-class replica up-to-date. This means that there exists a break-even point where it takes less time to create a new checkpoint from the first-class replica on the home server and download it over the Internet than it takes to instantiate a replica from client storage.

4 Slingshot applications

We have adapted the IBM ViaVoice speech recognizer and the VNC remote desktop to use Slingshot. Due to Slingshot's use of virtual machine encapsulation, we did not need to modify the source code of either application. All Slingshot-specific functionality is performed within proxies that intercept and redirect network traffic.

4.1 Speech recognition

We chose speech recognition as our first service because of its natural application to handheld computers. We used IBM ViaVoice in our work. We created a server-side proxy that accepts audio input from a remote client and passes it to ViaVoice through that application's API. ViaVoice returns a text string which the proxy sends to the client. ViaVoice and our server run on a Windows XP guest OS executing within a VMware virtual machine. The local component of this application displays the speech recognition output.

We chose to implement speech recognition as a stateless service. One can certainly make a reasonable argument that speech recognition should be a stateful service in order to allow a user to train the recognizer. However, we wanted to explore the optimizations that Slingshot could provide for stateless services.

4.2 Virtual desktop

VNC allows users to view and interact with another computer from a mobile device. In the case of Slingshot, the remote desktop is a Windows XP guest OS executing within a VMware virtual machine. This allows users to remotely execute any Windows application from their handhelds. This is clearly a stateful service; i.e., a user who edits a Word document expects the document to exist when the service is next instantiated.

Adapting VNC to Slingshot presented interesting challenges. First, the VNC server sends display updates to the client in a non-deterministic fashion. When pixels on the screen change, it reports the new values to the client in a series of updates. Two identical replicas may communicate the same change with a different sequence of updates. The resulting screen image at the end of the updates is identical but the intermediary states may not be equivalent. A second challenge is that some applications are inherently non-deterministic. One annoying example is the Windows system clock; two surrogates can send different updates because their clocks differ.

We noted that some non-determinism is unlikely to be relevant to the user (e.g. a slightly different clock value). Unfortunately, other non-determinism affects correct execution. For example, a key stroke or mouse click is often dependent upon the window state. If a user opens a text editor and enters some text, the key strokes must be sent to each replica only after the editor has opened on that replica. If this is not done, the key strokes will be sent to another application. To solve this problem, we associate a precondition with each input event. When the user executes the event, we log the state of the window on the client to which that event was delivered. When replaying the event on a server, we require that the window be in an identical state before the event is delivered. Since each event is associated with a screen coordinate, we check state equality by comparing the surrounding pixel values of the original execution and the background execution. In the above example, this strategy causes Slingshot to wait until the editor is displayed before it delivers the text entry events.

A second issue with VNC is that its non-determinism prevents us from mixing updates from different replicas. We designate the best-performing replica as the foreground replica and the remainder as background replicas. Only events from the foreground replica are delivered to the client. If performance changes, we quiesce the replicas before choosing a new foreground replica. Two replicas are quiesced by ensuring that the same events have been delivered to each, and by requesting a full-screen update from the new foreground replica to eliminate transition artifacts. New events are logged while quiescing replicas. Note that the foreground replica is rarely the first-class replica since nearby surrogates provide better performance in the common case.

We were encouraged that VNC can fit within the Slingshot model, since its behavior is relatively non-deterministic. Based on this result, we suspect that application-specific wrappers can be used to enforce determinism for many applications. For those applications where this approach proves infeasible, we could use a VMM that enforces determinism at the ISA level as is done in Hypervisor [6] and ReVirt [11].

5 Evaluation

Our evaluation answers the following questions:

  • How much do surrogates improve interactive response time?
  • What is the perceived performance impact of instantiating a new replica?
  • How much does the use of mobile storage reduce replica instantiation time?

5.1 Methodology

The client platform in our evaluation is an iPAQ 3970 handheld running the Linux 2.4.19-rmk6 kernel. The handheld has an XScale-PXA250 processor, 64 MB of DRAM, and 48 MB of flash. It uses a 11 Mb/s Cisco 350 802.11b PCMCIA card for network communication and a 4 GB Hitachi microdrive for bulk storage. Unless otherwise noted, the home server and surrogates are Dell Precision 350 desktops with a 3.06 GHz Pentium 4 processor running RedHat Enterprise Linux version 3.

We use a Cisco 350 802.11b wireless access point. We emulate the topology in Figure 4 by connecting all computers and the access point to a Dell desktop running the NISTNet [8] network emulator. This topology emulates a scenario where the handheld client is located at a wireless hotspot equipped with a surrogate computer. Hotspots are connected to the Internet through T1 connections with 1.5 Mb/s uplink and downlink bandwidth. A distant surrogate at another hotspot is accessible with latency of 15 ms. The home server is connected through an emulated DSL connection - the latency between the handheld's hotspot and the home server is 30 ms.


       Figure 4. Network topology used in evaluation

We execute the IBM ViaVoice speech recognizer as a stateless service, and VNC as a stateful service. For repeatability, the local component of each application executes a fixed, periodic workload. For speech, each iteration of the workload recognizes a phrase and pauses three seconds before beginning the next iteration. For VNC, each iteration opens Microsoft Word, inserts text at the beginning of a document, saves the document, closes Word, and pauses ten seconds before the next iteration begins. The client uses the same heuristics described in Section 4.2 to wait until Word opens before inserting text, and to wait until the window is fully closed before beginning the 10 second pause between iterations. Each service runs within a separate VM configured with 128 MB of memory and 4 GB of local storage. We created each service from a vanilla Windows XP installation. We installed the ballooning script described in Section 3.2.2 and the application comprising the remote service. We start the application so that it is ready to receive incoming connections, then suspend the VM. We repeat each experiment three times and report mean results over all iterations during the three trials. Figures 12 and 13 summarize all results described in this section.

5.2 Benefit of Slingshot

We first measured the benefit of using Slingshot for our two applications. The left bar in each data set in Figure 5 shows the average time to perform an iteration of the workload when the service is remotely executed on the home server. The right bar shows the average time using Slingshot when a second-class replica executes on the nearby surrogate. We let each application run for several iterations before measuring performance; this eliminates startup transients and shows steady-state performance.

This graph compares the average time to execute the speech and VNC workloads when using remote execution and when using Slingshot. Each bar shows mean response time - the error bars are 90% confidence intervals.

       Figure 5. Benefit of using Slingshot

Both the stateless speech service and the stateful VNC service execute 2.6 times faster with Slingshot than with remote execution. The shadings within each bar show the time consumed by server processing, client processing, and communication. For speech, Slingshot increases client processing time since it manages multiple network connections and aggregates responses. For VNC, it also logs requests and responses to local storage. Slingshot's performance benefit comes from reducing the time the application blocks on network communication.

Remote execution performance is affected by both high latency and limited bandwidth. For speech, a back-of-the-envelope calculation shows that 229 ms are required to transfer the 44 KB utterance through the bottleneck 1.5 Mb/s T1 link at the hotspot. Further, since communication is intermittent, TCP slow start causes several 60 ms round-trip delays during transmission. Thus, the remote execution results include 511 ms of network communication time. In contrast, Slingshot uses only 77 ms for network communication.

Latency impacts VNC performance more than bandwidth. Because the client waits for remote actions such as button clicks and key presses to complete before initiating the next action, there are many round-trip delays during the VNC workload. In addition, client polling in VNC leads to more round-trip delays than are strictly necessary. For this workload, remote execution on the home server requires 15.6 seconds for network communication, while Slingshot requires only 3.2 seconds.

5.3 Stateless service replication

We next examined the impact of instantiating stateless second-class replicas. In this experiment, a user with a first-class replica running on the home server arrives at the hotspot on the left in Figure 4 and decides to instantiate a replica on the surrogate there. For repeatability, we do not measure the latency of UPnP service discovery. We examine behavior when the surrogate cache is cold (i.e. no chunks are initially cached) and warm (i.e. all chunks are initially cached). The warm cache scenario is most likely if the user has recently visited the hotspot; the cold cache scenario is the worst cache state possible.

We first ran three trials without a microdrive attached to the iPAQ. Since the handheld has limited storage, the service state must be loaded from the home server as described in Section 3.5. We then ran three trials with the microdrive; in this case, the state of the speech service is loaded from the iPAQ as described in Section 3.6. Figures 6 and 7 show results for representative trials with a warm and cold cache, respectively.


This graph shows how response time changes during the instantiation of a speech replica on the nearby surrogate. All chunks are in the surrogate cache prior to each experiment.
       Figure 6. Speech replication with warm cache


This graph shows how response time changes during the instantiation of a speech replica on the nearby surrogate. No chunks are in the surrogate cache prior to each experiment.
       Figure 7. Speech replication with cold cache

In Figure 6, the sharp drop in response time for both lines is a result of the completion of replica instantiation. Before the replica is instantiated, speech requests must be serviced by the distant home server; after instantiation, the new second-class replica provides quicker response time. Without the microdrive, it takes 28:06 minutes to ship the service state from the home server. However, replica instantiation exhibits only a minimal impact on application performance - average response time during replication is only 2% greater than response time with remote execution on the home server.

When the replica is instantiated from state stored on the client's microdrive, the new second-class replica is instantiated in only 3:35 minutes (7.8 times faster). However, the performance impact of replica instantiation is more substantial: average response time increases by 20% compared to remote execution. Shipping a large amount of data over the wireless network causes queuing delays at the access point and on the handheld that adversely affect application performance. Currently, we are investigating whether traffic prioritization can minimize the impact of replication on foreground traffic.

The cold cache scenario in Figure 7 exhibits a less clear difference in performance before and after replication completes. After the new replica is instantiated, it fetches chunks of its persistent state on demand from the home server or iPAQ; this occasionally delays its responses. Note that the first-class replica on the home server mitigates the performance impact - if the second-class replica is substantially delayed by fetching state, the first-class replica responds faster.

5.4 Instantiation of another stateless replica

We next examined a scenario in which the user of the speech service moves from one wireless hotspot to another. This experiment begins with the user located at the middle hotspot in Figure 4. A second-class replica is running on the surrogate at that hotspot and a first-class replica is running on the home server. At the beginning of the experiment, the user moves to the left hotspot and decides to instantiate another replica on the surrogate at that hotspot. While this new replica is being created, both the second-class replica on the old surrogate and the first-class replica on the home server service application requests. As soon as the new replica is instantiated, Slingshot terminates the replica at the distant surrogate. Since we did not have three identical servers with which to run this experiment, the home server is a slightly less-powerful Dell Optiplex 370 with 2.8 GHz Pentium 4 processor running RedHat 9.

Figure 8 shows how the average time to perform an iteration of the speech recognition workload varies during this experiment - we show only warm cache results here. Compared to the previous experiment, the time to instantiate a new replica is relatively unchanged. However, response time during replication is improved because the existing second-class replica responds faster to requests than the replica on the home server. Without the microdrive, application response time is reduced by 23% compared to remote execution; with the microdrive, application response time is reduced by 2%. These results show that a surrogate can still provide significant benefit even when not located at the user's current hotspot.


This graph shows how response time changes during the instantiation of a speech replica on the nearby surrogate while another replica executes on the distant surrogate. All chunks are in the the surrogate caches prior to each experiment.
       Figure 8. Speech: Moving to a new hotspot

5.5 Stateful service replication

We next repeated the experiment in Section 5.3 for the stateful VNC service. Prior to the experiment, we perform 30 iterations of the VNC workload. We then begin the experiment by instantiating a replica on the nearby surrogate. Figures 9 and 10 show results from the warm and cold cache scenarios, respectively.


This graph shows how response time changes during the instantiation of a VNC replica on the nearby surrogate. All chunks are in the surrogate cache prior to each experiment.
       Figure 9. VNC replication with warm cache


This graph shows how response time changes during the instantiation of a VNC replica on the nearby surrogate. No chunks are in the surrogate cache prior to each experiment.
       Figure 10. VNC replication with cold cache

Without the microdrive and with a warm surrogate cache, Slingshot takes 4 seconds to checkpoint the VNC service - this is reflected in the higher response time for the first iteration. Slingshot takes 22:42 minutes to ship the checkpoint to the surrogate and 5:02 to replay the logged operations. During replication, average response time is 20% higher than when using remote execution on the home server. This increase is due to the background traffic associated with shipping state from the home server interfering with the latency-sensitive foreground traffic of the VNC application.

In contrast to the prior results for the speech service, the VNC results show little difference between the warm and cold cache scenarios. In particular, VNC performance markedly improves in the cold cache scenario as soon as the client starts using the second-class replica. Most of the chunks needed by the service are read from the service database during the replay of logged operations.

When the handheld stores a VNC service checkpoint on its microdrive, Slingshot takes 3:19 minutes to ship the state from the client, and 3:18 minutes to replay the log. These two phases are clearly visible in the ``with microdrive'' line in Figure 9, where response time degrades by 52% compared to remote execution while state is being shipped, and by 9% while the log is replayed. Note that the log replay with the microdrive includes the 30 iterations that occurred prior to the experiment. Since the microdrive checkpoint is taken when the user leaves home, all logged operations after that point must be replayed. However, since shipping state takes less time with the microdrive, the user generates fewer logged operations during migration. Overall, Slingshot instantiates the replica over 4 times faster when a checkpoint exists on the microdrive.

5.6 Instantiation of another stateful replica

We also repeated the experiment in Section 5.4 for VNC. Prior to the experiment, we create a second-class replica of the VNC service on the distant surrogate and a first-class replica on the home server. We then execute 30 iterations of the VNC workload. The experiment begins when we start to instantiate another second-class replica on the nearby surrogate.

As shown in Figure 11, the presence of another second-class replica on the distant surrogate substantially improves performance during replication. Compared to remote execution, Slingshot provides VNC response times almost twice as fast without the microdrive, and 70% faster when state is fetched from client storage.


This graph shows how response time changes during the instantiation of a VNC replica on the nearby surrogate while another replica executes on the distant surrogate. All chunks are in the the surrogate caches prior to each experiment.
       Figure 11. VNC: Moving to a new hotspot

Service Remote Execution Slingshot
Steady-State Creating 1st Replica Creating 2nd Replica
Warm Cache Cold Cache Warm Cache Cold Cache
Speech w/o microdrive
Speech with microdrive
0.67 (0.67-0.67)
0.67 (0.67-0.67)
0.24 (0.24-0.24)
0.24 (0.24-0.24)
0.69 (0.68-0.70)
0.80 (0.79-0.81)
0.69 (0.67-0.73)
0.80 (0.79-0.81)
0.52 (0.51-0.52)
0.65 (0.65-0.65)
0.52 (0.51-0.52)
0.65 (0.63-0.68)
VNC w/o microdrive
VNC with microdrive
18.9 (18.9-19.0)
18.9 (18.9-19.0)
7.4 (7.2-7.5)
7.4 (7.2-7.5)
22.8 (21.5-23.6)
24.1 (23.9-24.3)
22.2 (19.9-23.6)
23.1 (21.6-24.4)
9.8 (9.7-9.9)
13.8 (13.0-14.4)
10.0 (9.9-10.0)
14.6 (14.4-14.8)
This figure summarizes the average response time (in seconds) for all experiments. Each entry shows the mean of three trials, with the low and high trials given in parentheses. The second column shows response time for remote execution on the home server. The third column shows steady-state performance for Slingshot with a replica on the nearby surrogate. The remaining columns show response time while instantiating a replica on the nearby surrogate with and without a replica running on the distant surrogate.
       Figure 12. Summary of response time results

Service Creating 1st Replica Creating 2nd Replica
Warm Cache Cold Cache Warm Cache Cold Cache
Speech w/o microdrive
Speech with microdrive
28:06 (27:50-28:27)
3:35 (3:32-3:40)
27:55 (27:50-28:04)
3:27 (3:26-3:28)
28:10 (28:05-28:11)
3:39 (3:34-3:45)
27:57 (27:57-27:58)
3:33 (3:32-3:34)
VNC w/o microdrive
VNC with microdrive
27:48 (27:07-22:28)
6:37 (6:20-7:13)
27:58 (27:12-28:45)
7:29 (6:59-8:27)
31:16 (30:57-31:25)
8:59 (8:01-10:00)
31:08 (31:00-31:25)
8:20 (6:47-9:00)
This figure summarizes the time (in minutes) to create a new replica on the nearby surrogate for all experiments. Each entry shows the mean of three trials, with the low and high trials given in parentheses. The second and third columns show the time to instantiate a replica when no replica runs on the distant surrogate. The last two column show results with a replica on the distant surrogate.
       Figure 13. Summary of replication time results

6 Related work

To the best of our knowledge, Slingshot is the first system to dynamically instantiate replicas of stateful applications in order to improve the performance of small, resource-poor mobile computers. Our work draws upon several areas of prior work, including virtual machine and process migration, cyber foraging, fault-tolerant computing, and remote execution.

After Chen and Noble [9] first suggested that virtual machine migration could be an effective mechanism for process migration, several research groups have built working prototypes. Our research focus is not on the migration mechanism itself, but rather on how it can be used to service the needs of small, mobile clients. We use the Fauxide and Vulpes components from Intel's Internet Suspend/Resume [20] to intercept disk I/O requests made by virtual machines. The difference between ISR and Slingshot is that ISR executes a user's computing environment on a single terminal at a time. In contrast, Slingshot decomposes a user's environment into distinct services and replicates services on multiple computers. Slingshot hides the perceived latency of migration and surrogate failures, while letting a user execute applications anywhere a wireless connection exists.

Sapuntzakis [25] uses virtual machine migration, but focuses on users who compute at fixed locations, rather than the mobile users that Slingshot targets. Slingshot uses several of the optimizations suggested by Sapuntzakis, including ballooning and content-addressable storage. These optimizations have also been suggested by Waldspurger [28] and Tolia [27], respectively.

Baratto's MobiDesk [3] is similar to our VNC application in that it virtualizes a remote desktop for mobile clients. However, MobiDesk migrates its desktop service between well-connected machines in a cluster in order to minimize downtime during server maintenance or upgrades. Slingshot uses replication rather than migration, and utilizes the computational resources of surrogates located at wireless hotspots. A MobiDesk-like cluster could serve as the ideal home server for Slingshot applications. Conversely, although Baratto shows considerable improvement over VNC in remote display performance, his results indicate that network latency still degrades interactive performance. Thus, surrogates could improve MobiDesk performance for mobile clients.

Slingshot's replication strategy is a form of primary-backup fault tolerance in that the replica on the home server allows the system to tolerate a fail-stop failure of any number of second-class replicas. Our approach is most reminiscent of Hypervisor [6], which used deterministic replay to provide fault tolerance between virtual machines. In contrast to systems such as Hypervisor and ReVirt [11] which enforce determinism at the ISA level, Slingshot enforces determinism at the application level. This choice was driven by our desire to use a robust commercial virtual machine (VMware) without modification. Our approach to enforcing determinism was inspired by Rodrigues' BASE [24], which provides Byzantine fault tolerance by wrapping non-deterministic software with a layer that enforces deterministic behavior. A similar approach was used in Brown's operator undo [7].

Slingshot is an instance of cyber foraging [1], the opportunistic use of surrogates to augment the capabilities of mobile computers. Previous work in Spectra [12] examined how a cyber foraging system could locate the best server and application partitioning to use given dynamic resource constraints. In contrast, Slingshot takes this selection as a given and provides a mechanism for utilizing surrogate resources. More recently, Balan [2] and Goyal [16] have also proposed cyber foraging infrastructure. Compared to these systems, the major capability added by Slingshot is the ability to execute stateful services on surrogate computers. Data staging [13] and Fluid replication [19] use surrogates to improve the performance of distributed file systems. They share common goals with Slingshot such as minimization of latency and ease of management - however, Slingshot applies these principles to computation rather than storage.

The applications we have investigated so far have been easy to partition because they were designed for client-server computing. Potentially, Slingshot could use one of several methods that automatically partition applications. For instance, Coign [18] partitions DCOM applications into client and server components. Globus [14], Condor [5], and Legion [17] dynamically place functionality, but target grid rather than mobile computing.

7 Conclusions and future work

Handhelds can improve interactive response time by leveraging surrogate computers located at wireless hotspots. Slingshot's use of replication offers several improvements over a strategy that simply migrates remote services between computers. Replication provides good response time for mobile users who move between wireless hotspots; while a new replica is being instantiated, other replicas continue to service user requests. Replication also lets Slingshot recover gracefully from surrogate failure, even when running stateful services.

Slingshot minimizes the cost of operating surrogates. For these computers to be of maximum benefit, they must be located at wireless hotspots, rather than in machine rooms that are under the supervision of trained operators. Slingshot uses off-the-shelf virtual machine software to eliminate the need to install custom operating systems, libraries, or applications to service mobile users. All application-specific state associated with each service is encapsulated within its virtual machine. Further, Slingshot's replication strategy means that surrogates need not provide 24/7 availability. If a surrogate fails or is rebooted, no state is lost.

Harnessing surrogate computation is a complex problem. Slingshot currently provides several pieces of the puzzle, including the use of replication to improving response time and the elimination of hard surrogate state to improve ease of management. Other pieces of the puzzle remain. Slingshot does not yet address the privacy issues inherent to running computation on third-party hardware. Trusted computing efforts [15] provide promise in this area. Slingshot does not provide a mechanism for securely controlling replica instantiation and termination. Other areas of potential investigation are load management and policies for creating and destroying replicas. We believe that Slingshot will be an extremely useful platform on which to conduct such investigations.

Acknowledgments

We are grateful to Mike Kozuch and the ISR team for providing us with the Fauxide and Vulpes components used in this work. We also thank Manish Anand, Edmund B. Nightingale, Daniel Peek, the anonymous reviewers, and our shepherd, Doug Terry, for comments that helped improve this paper. This research was supported by CAREER grant CNS-0346686 from the National Science Foundation and by an equipment grant from Intel Corporation. The views and conclusions in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of NSF, Intel, the University of Michigan, or the U.S. government.

References

[1] BALAN, R., FLINN, J., SATYANRAYANAN, M., SINNAMOHIDEEN, S., AND YANG, H.-I. The Case For Cyber Foraging. In the 10th ACM SIGOPS European Workshop (Saint-Emilion, France, September 2002)

[2] BALAN, R. K., SATYANRAYANAN, M., PARK, S. AND OKOSHI, T . Tactics-Based Remote Execution for Mobile Computing. In Proceedings of the 1st Annual Conference on Mobile Computing Systems, Applications and Services (San Francisco, CA, May 2003) pp. 273-286

[3] BARATTO, R. A., POTTER, S., SU, G., NIEH, J. MobiDesk: Mobile virtual desktop computing, In Proceedings of the 10th Annual Conference on Mobile Computing and Networking, (Philadelphia, PA,Sept/Oct, 2004 ), pp. 1-16

[4] BARHAM, P., DRAGOVIC, B., FRASER, K., HAND, S., AND HARRIS, T. Xen and the Art of Virtualization, In Procedings of the 19th ACM Symp. on Operating Systems Principles, (Bolton Landing, NY, October 2003), pp. 164-177

[5] BASNEY, J., AND LIVNY, M. Improving goodput by co-scheduling CPU and network capacity, International Journal of High Performance Computing Applications 13, 3 (Fall 1999), 220-230

[6]BRESSOUD, T.C., AND SCHNEIDER, F. B. Hypervisor-Based Fault-Tolerance, In Proceedings of the 15th ACM Symposium on Operating System Principles (SOSP), (Copper Mountain, CO, December 1995), pp. 1-11

[7] BROWN, A.B., AND PATTERSON, D.A. Rewind, repair, replay: Three R's to dependability. In the 10th ACM SIGOPS European Workshop (St. Emilion, France, September 2002).

[8] CARSON, M. Adaptation and Protocol Testing thorugh Network Emulation. NIST, http://snad.ncsl.nist.gov/itg/nistnet/slides/index.htm.

[9] CHEN, P., AND NOBLE, B. When Virtual is Better Than Real. In Proceedings of the 8th IEEE Workshop on Hot Topics in Operating Systems (Schloss Elmau, Germany, May 2001).

[10] COX, L. P., MURRAY, C. D., AND NOBLE, B. D. Pastiche: Making backup cheap and easy. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (December 2002), pp. 285-298.

[11] DUNLAP, G. W., KING, S. T., CINAR, S., BASRAI, M. A., AND CHEN, P. M. Revirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (December 2002), pp. 211-224.

[12] FLINN, J., NARAYANAN, D., AND SATYANARAYANAN, M. Self-tuned remote execution for pervasive computing. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII) (Schloss Elmau, Germany, May 2001), pp. 61-66.

[13] FLINN, J., SINNAMOHIDEEN, S., TOLIA, N., AND SATYANARAYANAN, M. Data staging for untrusted surrogates. In Proceedings of the 2nd USENIX Conference on File and Storage Technology (San Francisco, CA, March/April 2003), pp. 15-28.

[14] FOSTER, I., KESSELMAN, C., NICK, J., AND TUECKE, S. Grid services for distributed system integration. Computer 35, 6 (2002).

[15] GARFINKEL, T., PFAFF, B., CHOW, J., ROSENBLUM, M., AND BONEH, D. Terra: A virtual machine-based platform for trusted computing. In Procedings of the 19th ACM Symp. on Operating Systems Principles (Bolton Landing, NY, October 2003), pp. 193-206.

[16] GOYAL, S., AND CARTER, J. A lightweight secure cyber foraging infrastructure for resource-constrained devices. In Proceedings of the 6th IEEE Workshop on Mobile Computing Systems and Applications (Lake Windermere, England, December 2004), pp. 186-195.

[17] GRIMSHAW, A. S., AND WULF, W. A. Legion: Flexible support for wide-area computing. In Proceedings of the 7th ACM SIGOPS European Workshop (Connemara, Ireland, 1996), pp. 205-212.

[18] HUNT, G. C., AND SCOTT, M. L. The Coign automatic distributed partitioning system. In Proceedings of the 3rd Symposium on Operating System Design and Implemetation (OSDI) (New Orleans, LA, February 1999), pp. 187-200.

[19] KIM, M., COX, L. P., AND NOBLE, B. D. Safety, visibility, and performance in a wide-area file system. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (Monterey, CA, January 2002), pp. 131-144.

[20] KOZUCH, M., AND SATYANARAYANAN, M. Internet Suspend/Resume. In Proceedings of the 4th IEEE Workshop on Mobile Computing Systems and Applications (Callicoon, NY, June 2002), pp. 40-48.

[21] LI, J., KROHN, M., MAZIÈRES, D., AND SHASHA, D. Secure untrusted data repository (sundr). In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (San Francisco, CA, December 2004), pp. 121-136.

[22] MICROSOFT CORPORATION. Universal Plug and Play Forum, June 1999. http://www.upnp.org.

[23] OSMAN, S., SUBHRAVETI, D., SU, G., AND NIEH, J. The Design and Implementation of Zap: A System for Migrating Computing Environments. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (December 2002), pp. 361-376.

[24] RODRIGUES, R., CASTRO, M., AND LISKOV, B. BASE: Using abstraction to improve fault tolerance. In Proceedings of the 18th Symposium on Operating Systems Principles (SOSP) (Banff, Canada, October 2001), pp. 15-28.

[25] SAPUNTZAKIS, C. P., CHANDRA, R., PFAFF, B., CHOW, J., LAM, M. S., AND ROSENBLUM, M. Optimizing the migration of virtual computers. In Proceedings of the 5th Symposium on Operating System Design and Implementation (Boston, MA, December 2002), pp. 377-390.

[26] TOLIA, N., HARKES, J., KOZUCH, M., AND SATYANARAYANAN, M. Integrating portable and distributed storage. In Proceedings of the 3rd Annual USENIX Conference on File and Storage Technologies (San Francisco, CA, March/April 2004), pp. 227-238.

[27] TOLIA, N., KOZUCH, M., SATYANARAYANAN, M., KARP, B., BRESSOUD, T., AND PERRIG, A. Opportunistic use of content addressable storage for distributed file systems. In Proceedings of the 2003 USENIX Annual Technical Conference (May 2003), pp. 127-140.

[28] WALDSPURGER, C. A. Memory Resource Management in VMware ESX server. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (Boston, MA, December 2002), pp. 181-194.

[29] WHITAKER, A., SHAW, M., AND GRIBBLE, S. D. Scale and performance in the denali isolation kernel. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (Boston, MA, December 2002), pp. 195-209.


This paper was originally published in the Proceedings of the 3rd International Conference on Mobile Systems, Applications, and Services, Applications, and Services,
June 6–8, 2005
Seattle, WA

Last changed: 20 May 2005 aw
MobiSys '05 Technical Program
MobiSys '05 Home
USENIX home