Check out the new USENIX Web site.
ConfWeek '10
USENIX ATC '11 Banner

CONFERENCE PROGRAM ABSTRACTS

Wednesday, June 15, 2011
10:30 a.m.–noon

A Case for NUMA-aware Contention Management on Multicore Systems
Back to Program
On multicore systems, contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention could be addressed via scheduling. A contention-aware scheduler separates competing threads onto separate memory hierarchy domains to eliminate resource sharing and, as a consequence, to mitigate contention. However, all previous work on contention-aware scheduling assumed that the underlying system is UMA (uniform memory access latencies, single memory controller). Modern multicore systems, however, are NUMA, which means that they feature non-uniform memory access latencies and multiple memory controllers. We discovered that state-of-the-art contention management algorithms fail to be effective on NUMA systems and may even hurt performance relative to a default OS scheduler. In this paper we investigate the causes for this behavior and design the first contention-aware algorithm for NUMA systems.

TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments
Back to Program
The Graphics Processing Unit (GPU) is now commonly used for graphics and data-parallel computing. As more and more applications tend to accelerate on the GPU in multi-tasking environments where multiple tasks access the GPU concurrently, operating systems must provide prioritization and isolation capabilities in GPU resource management, particularly in real-time setups. We present TimeGraph, a real-time GPU scheduler at the device-driver level for protecting important GPU workloads from performance interference. TimeGraph adopts a new event-driven model that synchronizes the GPU with the CPU to monitor GPU commands issued from the user space and control GPU resource usage in a responsive manner. TimeGraph supports two priority-based scheduling policies in order to address the trade-off between response times and throughput introduced by the asynchronous and non-preemptive nature of GPU processing. Resource reservation mechanisms are also employed to account and enforce GPU resource usage, which prevent misbehaving tasks from exhausting GPU resources. Prediction of GPU command execution costs is further provided to enhance isolation. Our experiments using OpenGL graphics benchmarks demonstrate that TimeGraph maintains the frame-rates of primary GPU tasks at the desired level even in the face of extreme GPU workloads, whereas these tasks become nearly unresponsive without TimeGraph support. Our findings also include that the performance overhead imposed on TimeGraph can be limited to 4-10%, and its event-driven scheduler improves throughput by about 30 times over the existing tick-driven scheduler.

Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems
Back to Program
Heterogeneous multi-cores—platforms comprised of both general purpose and accelerator cores—are becoming increasingly common. While applications wish to freely utilize all cores present on such platforms, operating systems continue to view accelerators as specialized devices. The Pegasus system described in this paper uses an alternative approach that offers a uniform resource usage model for all cores on heterogeneous chip multiprocessors. Operating at the hypervisor level, its novel scheduling methods fairly and efficiently share accelerators across multiple virtual machines, thereby making accelerators into first class schedulable entities of choice for many-core applications. Using NVIDIA GPGPUs coupled with x86-based general purpose host cores, a Xen-based implementation of Pegasus demonstrates improved performance for applications by better managing combined platform resources. With moderate virtualization penalties, performance improvements range from 18% to 140% over base GPU driver scheduling when the GPUs are shared.

1:00 p.m.–2:30 p.m.

vIC: Interrupt Coalescing for Virtual Machine Storage Device IO
Back to Program
Interrupt coalescing is a well known and proven technique for reducing CPU utilization when processing high IO rates in network and storage controllers. Virtualization introduces a layer of virtual hardware for the guest operating system, whose interrupt rate can be controlled by the hypervisor. Unfortunately, existing techniques based on high-resolution timers are not practical for virtual devices, due to their large overhead. In this paper, we present the design and implementation of a virtual interrupt coalescing (vIC) scheme for virtual SCSI hardware controllers in a hypervisor. We use the number of commands in flight from the guest as well as the current IO rate to dynamically set the degree of interrupt coalescing. Compared to existing techniques in hardware, our work does not rely on high-resolution interrupt-delay timers and thus leads to a very efficient implementation in a hypervisor. Furthermore, our technique is generic and therefore applicable to all types of hardware storage IO controllers which, unlike networking, don't receive anonymous traffic. We also propose an optimization to reduce inter-processor interrupts (IPIs) resulting in better application performance during periods of high IO activity. Our implementation of virtual interrupt coalescing has been shipping with VMware ESX since 2009. We present our evaluation showing performance improvements in micro benchmarks of up to 18% and in TPC-C of up to 5%.

Power Budgeting for Virtualized Data Centers
Back to Program
Power costs are very significant for data centers. To maximally utilize the provisioned power capacity, data centers often employ over-subscription, that is, the sum of peak consumptions of individual servers may be greater than the provisioned capacity. Power budgeting methods are employed to ensure that actual consumption never exceeds capacity. However, current power budgeting methods enforce capacity limits in hardware and are not well suited for virtualized servers because the hardware is shared among multiple applications. We present a power budgeting system for virtualized infrastructures that enforces power limits on individual distributed applications. Our system enables multiple applications to share the same servers but operate with their individual quality of service guarantees. It responds to workload and power availability changes, by dynamically allocating appropriate amount of power to different applications and tiers within applications. The design is mindful of practical constraints such the data center's limited visibility into hosted application performance. We evaluate the system using workloads derived from real world data center traces.

vIOMMU: Efficient IOMMU Emulation
Back to Program
Direct device assignment, where a guest virtual machine directly interacts with an I/O device without host intervention, is appealing, because it allows an unmodified (non-hypervisor-aware) guest to achieve near-native performance. But device assignment for unmodified guests suffers from two serious deficiencies: (1) it requires pinning all of the guest's pages, thereby disallowing memory overcommitment, and (2) it exposes the guest's memory to buggy device drivers. We solve these problems by designing, implementing, and exposing an emulated IOMMU (vIOMMU) to the unmodified guest. We employ two novel optimizations to make vIOMMU perform well: (1) waiting a few milliseconds before tearing down an IOMMU mapping in the hope it will be immediately reused ("optimistic tear-down"), and (2) running the vIOMMU on a sidecore, and thereby enabling for the first time the use of a sidecore by unmodified guests. Both optimizations are highly effective in isolation. The former allows bare-metal to achieve 100% of a 10Gbps line rate. The combination of the two allows an unmodified guest to do the same.

3:00 p.m.–4:30 p.m.

HiTune: Dataflow-Based Performance Analysis for Big Data Cloud
Back to Program
Although Big Data Cloud (e.g., MapReduce, Hadoop and Dryad) makes it easy to develop and run highly scalable applications, efficient provisioning and fine-tuning of these massively distributed systems remain a major challenge. In this paper, we describe a general approach to help address this challenge, based on distributed instrumentations and dataflow-driven performance analysis. Based on this approach, we have implemented HiTune, a scalable, lightweight and extensible performance analyzer for Hadoop. We report our experience on how HiTune helps users to efficiently conduct Hadoop performance analysis and tuning, demonstrating the benefits of dataflow-based analysis and the limitations of existing approaches (e.g., system statistics, Hadoop logs and metrics, and traditional profiling).

Taming the Flying Cable Monster: A Topology Design and Optimization Framework for Data-Center Networks
Back to Program
Data-center network designers now have many choices for high-bandwidth, multi-path network topologies. Some of these topologies also allow the designer considerable freedom to set parameters (for example, the number of ports on a switch, link bandwidths, or switch-to-switch wiring patterns) at design time. This freedom of choice, however, requires the designer to balance among bandwidth, latency, reliability, parts cost, and other real-world details. Designers need help in exploring this design space, and especially in finding optimal network designs. We describe the specific challenges that designers face, and we present Perseus, a framework for quickly guiding a network designer to a small set of candidate designs. We identify several optimization problems that can be accommodated within this framework, and present solution algorithms for many of these problems.

In-situ MapReduce for Log Processing
Back to Program
Log analytics are a bedrock component of running many of today's Internet sites. Application and click logs form the basis for tracking and analyzing customer behaviors and preferences, and they form the basic inputs to ad-targeting algorithms. Logs are also critical for performance and security monitoring, debugging, and optimizing the large compute infrastructures that make up the compute "cloud", thousands of machines spanning multiple data centers. With current log generation rates on the order of 1–10 MB/s per machine, a single data center can create tens of TBs of log data a day. While bulk data processing has proven to be an essential tool for log processing, current practice transfers all logs to a centralized compute cluster. This not only consumes large amounts of network and disk bandwidth, but also delays the completion of time-sensitive analytics. We present an in-situ MapReduce architecture that mines data "on location", bypassing the cost and wait time of this store-first-query-later approach. Unlike current approaches, our architecture explicitly supports reduced data fidelity, allowing users to annotate queries with latency and fidelity requirements. This approach fills an important gap in current bulk processing systems, allowing users to trade potential decreases in data fidelity for improved response times or reduced load on end systems. We report on the design and implementation of our in-situ MapReduce architecture, and illustrate how it improves our ability to accommodate increasing log generation rates.

Thursday, June 16, 2011
10:30 a.m.–noon

Exception-Less System Calls for Event-Driven Servers
Back to Program
Event-driven architectures are currently a popular design choice for scalable, high-performance server applications. For this reason, operating systems have invested in efficiently supporting non-blocking and asynchronous I/O, as well as scalable event-based notification systems. We propose the use of exception-less system calls as the main operating system mechanism to construct high-performance event-driven server applications. Exception-less system calls have four main advantages over traditional operating system support for event-driven programs: (1) any system call can be invoked asynchronously, even system calls that are not file descriptor based, (2) support in the operating system kernel is non-intrusive as code changes are not required for each system call, (3) processor efficiency is increased since mode switches are mostly avoided when issuing or executing asynchronous operations, and (4) enabling multi-core execution for event-driven programs is easier, given that a single user-mode execution context can generate enough requests to keep multiple processors/cores busy with kernel execution. We present libflexsc, an asynchronous system call and notification library suitable for building event-driven applications. Libflexsc makes use of exception-less system calls through our Linux kernel implementation, FlexSC. We describe the port of two popular event-driven servers, memcached and nginx, to libflexsc. We show that exception-less system calls increase the throughput of memcached by up to 35% and nginx by up to 120% as a result of improved processor efficiency.

Resizable, Scalable, Concurrent Hash Tables via Relativistic Programming
Back to Program
We present algorithms for shrinking and expanding a hash table while allowing concurrent, wait-free, linearly scalable lookups. These resize algorithms allow Read-Copy Update (RCU) hash tables to maintain constant-time performance as the number of entries grows, and reclaim memory as the number of entries decreases, without delaying or disrupting readers. We call the resulting data structure a relativistic hash table. Benchmarks of relativistic hash tables in the Linux kernel show that lookup scalability during resize improves 125x over reader-writer locking, and 56% over Linux's current state of the art. Relativistic hash lookups experience no performance degradation during a resize. Applying this algorithm to memcached removes a scalability limit for get requests, allowing memcached to scale linearly and service up to 46% more requests per second. Relativistic hash tables demonstrate the promise of a new concurrent programming methodology known as relativistic programming. Relativistic programming makes novel use of existing RCU synchronization primitives, namely the wait-for-readers operation that waits for unfinished readers to complete. This operation, conventionally used to handle reclamation, here allows ordering of updates without read-side synchronization or memory barriers.

Evaluating the Effectiveness of Model-Based Power Characterization
Back to Program
Accurate power characterization is important in computing platforms for several reasons ranging from power-aware adaptation to power provisioning. Power characterization is typically obtained through either direct measurements enabled by physical instrumentation or modeling based on hardware performance counters. We show, however, that linear-regression based modeling techniques commonly used in the literature work well only in restricted settings. These techniques frequently exhibit high prediction error in modern computing platforms due to inherent complexities such as multiple cores, hidden device states, and large dynamic power components. Using a comprehensive measurement framework and an extensive set of benchmarks, we consider several more advanced modeling techniques and observe limited improvement. Our quantitative demonstration of the limitations of a variety of modeling techniques highlights the challenges posed by rising hardware complexity and variability and, thus, motivates the need for increased direct measurement of power consumption.

1:00 p.m.–2:30 p.m.

Victim Disk First: An Asymmetric Cache to Boost the Performance of Disk Arrays under Faulty Conditions
Back to Program
The buffer cache plays an essential role in smoothing the gap between the upper-level computational components and the lower-level storage devices. A good buffer cache management scheme should be beneficial to not only the computational components, but also to the storage components by reducing disk I/Os. Existing cache replacement algorithms are well optimized for disks in normal mode, but inefficient under faulty scenarios, such as a parity-based disk array with faulty disk(s). To address this issue, we propose a novel asymmetric buffer cache replacement strategy, named Victim (or faulty) Disk(s) First (VDF) cache, to improve the relia-bility and performance of a storage system consisting of a buffer cache and disk arrays. The basic idea is to give higher priority to cache the blocks on the faulty disks when the disk array fails, thus reducing the I/Os directed to the faulty disks. To verify the effectiveness of the VDF cache, we have integrated VDF into two popular cache algorithms LFU and LRU, named VDF-LFU and VDF-LRU, respectively. We have conducted extensive simulations as well as a prototype implementation. The simulation results show that VDF-LFU can reduce disk I/Os to surviving disks by up to 42.3% and VDF-LRU can reduce those by up to 36.2%. Our measurement results also show that VDF-LFU can speed up the online recovery by up to 46.3% under a spare-rebuilding mode with online reconstruction, or improve the maximum system service rate by up to 47.7% under a degraded mode without a reconstruction workload. Similarly, VDF-LRU can speed up the online recovery by up to 34.6%, or improve the system service rate by up to 28.4%.

The Design and Evolution of Live Storage Migration in VMware ESX
Back to Program
Live migration enables a running virtual machine to move between two physical hosts with no perceptible interruption in service. This allows customers to avoid costly downtimes associated with hardware maintenance and upgrades, and facilitates automated load-balancing. Consequently, it has become a critical feature of enterprise class virtual infrastructure. In the past, live migration only moved the memory and device state of a VM, limiting migration to hosts with identical shared storage. Live storage migration overcomes this limitation by enabling the movement of virtual disks across storage elements, thus enabling greater VM mobility, zero downtime maintenance and upgrades of storage elements, and automatic storage load-balancing. We describe the evolution of live storage migration in VMware ESX through three separate architectures, and explore the performance, complexity and functionality trade-offs of each.

Online Migration for Geo-distributed Storage Systems
Back to Program
We consider the problem of migrating user data between data centers. We introduce distributed storage overlays, a simple abstraction that represents data as stacked layers in different places. Overlays can be readily used to cache data objects, migrate these caches, and migrate the home of data objects. We implement overlays as part of a key-value object store called Nomad, designed to span many data centers. Using Nomad, we compare overlays against common migration approaches and show that overlays are more flexible and impose less overhead. To drive migration decisions, we propose policies for predicting the location of future accesses, focusing on a web mail application. We evaluate the migration policies using real traces of user activity from Hotmail.

3:00 p.m.–4:30 p.m.

Slow Down or Sleep, That Is the Question
Back to Program
Energy consumption has become a major concern for all computing systems, from servers in data-centres to mobile phones. Processor manufacturers have reacted to this by implementing power-management mechanisms in the hardware and researchers have investigated how operating systems can make use of those mechanisms to minimise energy consumption. Much of this research has focused on a single class of systems and compute-intensive workloads. Missing is an examination of how much energy can actually be saved when running realistic workloads on different classes of systems. This paper compares the effects of using dynamic voltage and frequency scaling (DVFS) and sleep states on platforms using server, desktop and embedded processors. It also analyses workloads that represent real-world uses of those systems. In these circumstances, we find that usage of power-management mechanisms is not clear-cut, and that it is critical to analyse the system as a whole, including the workload, to determine whether using mechanisms such as DVFS will be effective at reducing energy consumption.

Low Cost Working Set Size Tracking
Back to Program
Efficient memory resource management requires knowledge of the memory demands of applications or systems at runtime. A widely proposed approach is to construct an LRU-based miss ratio curve (MRC), which provides not only the current working set size (WSS) but also the relationship between performance and target memory allocation size. Unfortunately, the cost of LRU MRC monitoring is nontrivial. Although optimized with AVL-tree based LRU structure and dynamic hot set sizing, the overhead is still as high as 16% on average. Based on a key insight that for most programs the working set sizes are stable most of the time, we design an intermittent tracking scheme, which can temporarily turn off memory tracking when memory demands are predicted to be stable. With the assistance of hardware performance counters, memory tracking can be turned on again if a significant change in memory demands is expected. Experimental results show that, by using this intermittent tracking design, memory tracking can be turned off for 82% of the execution time while the accuracy loss is no more than 4%. More importantly, this design is orthogonal to existing optimizing techniques, such as AVL-tree based LRU structure and dynamic hot set sizing. By combining the three approaches, the mean overhead is lowered to only 2%. We show that when applied to memory balancing for virtual machines, our scheme brings a speedup of 1.85.

FVD: A High-Performance Virtual Machine Image Format for Cloud
Back to Program
Fast Virtual Disk (FVD) is a new virtual machine (VM) image format and the corresponding block device driver developed for QEMU. QEMU does I/O emulation for multiple hypervisors, including KVM, Xen-HVM, and VirtualBox. FVD is a holistic solution for both Cloud and non-Cloud environments. Its feature set includes flexible configurability, storage thin provisioning without a host file system, compact image, internal snapshot, encryption, copy-on-write, copy-on-read, and adaptive prefetching. The last two features enable instant VM creation and instant VM migration, even if the VM image is stored on direct-attached storage. As its name indicates, FVD is fast. Experiments show that the throughput of FVD is 249% higher than that of QCOW2 when using the PostMark benchmark to create files.

Okeanos: Wasteless Journaling for Fast and Reliable Multistream Storage
Back to Program
Synchronous small writes play a critical role in the reliability and availability of file systems and applications that use them to safely log recent state modifications and quickly recover from failures. However, storage stacks usually enforce page-sized granularity in their data transfers from memory to disk. We experimentally show that subpage writes may lead to storage bandwidth waste and high disk latencies. To address the issue in a journaled file system, we propose wasteless journaling as a mount mode that coalesces synchronous concurrent small writes of data into full page-sized blocks before transferring them to the journal. Additionally, we propose selective journaling that automatically applies wasteless journaling on data writes whose size lies below a fixed preconfigured threshold. In the Okeanos prototype implementation that we developed, we use microbenchmarks and application-level workloads to show substantial improvements in write latency, transaction throughput and storage bandwidth requirements.

Toward Online Testing of Federated and Heterogeneous Distributed Systems
Back to Program
Making distributed systems reliable is notoriously difficult. It is even more difficult to achieve high reliability for federated and heterogeneous systems, i.e., those that are operated by multiple administrative entities and have numerous inter-operable implementations. A prime example of such a system is the Internet's inter-domain routing, today based on BGP. We argue that system reliability should be improved by proactively identifying potential faults using an online testing functionality. We propose DiCE, an approach that continuously and automatically explores the system behavior, to check whether the system deviates from its desired behavior. DiCE orchestrates the exploration of relevant system behaviors by subjecting system nodes to many possible inputs that exercise node actions. DiCE starts exploring from current, live system state, and operates in isolation from the deployed system. We describe our experience in integrating DiCE with an open-source BGP router. We evaluate the prototype's ability to quickly detect origin misconfiguration, a recurring operator mistake that causes Internet-wide outages. We also quantify DiCE's overhead and find it to have marginal impact on system performance.

CDE: Using System Call Interposition to Automatically Create Portable Software Packages
Back to Program
It can be painfully hard to take software that runs on one person's machine and get it to run on another machine. Online forums and mailing lists are filled with discussions of users' troubles with compiling, installing, and configuring software and their myriad of dependencies. To eliminate this dependency problem, we created a system called CDE that uses system call interposition to monitor the execution of x86-Linux programs and package up the Code, Data, and Environment required to run them on other x86-Linux machines. Creating a CDE package is completely automatic, and running programs within a package requires no installation, configuration, or root permissions. Hundreds of people in both academia and industry have used CDE to distribute software, demo prototypes, make their scientific experiments reproducible, run software natively on older Linux distributions, and deploy experiments to compute clusters.

Vsys: A Programmable sudo
Back to Program
We present Vsys, a mechanism for restricting access to privileged operations, much like the popular sudo tool on UNIX. Unlike sudo, Vsys allows privileges to be constrained using general-purpose programming languages and facilitates composing multiple system services into powerful abstractions for isolation. In use for over three years on PlanetLab, Vsys has enabled over 100 researchers to create private overlay networks, user-level file systems, virtual switches, and TCP-variants that function safely and without interference. Vsys has also been used by applications such as whole-system monitoring in a VM. We describe the design of Vsys and discuss our experiences and lessons learned.

Internet-scale Visualization and Detection of Performance Events
Back to Program
Operators typically monitor the performance of network server farms using rule-based scripts to automatically flag "events of interest" on an array of active and passive performance measurement feeds. However, such automatic detection is typically limited to events with known properties. A different challenge involves detecting the "unknown unknowns"—the events of interest whose properties are unknown, and therefore, cannot be defined beforehand. Visualization can significantly aid the rapid discovery of such unknown patterns, as network operators, with domain expertise, may quickly notice unexpected shifts in traffic patterns when represented visually. However, the volume of Internet-wide raw performance data can easily overwhelm human comprehension, and therefore, an effective visualization needs to be sparse in representation, yet discriminating of good and poor performance. This paper presents a tool that can be used to visualize performance metrics at Internet-scale. At its core, the tool builds decision trees over the IP address space using performance measurements, so that IP addresses with similar performance characteristics are clustered together, and those with significant performance differences are separated. These decision trees are dynamic—they are learnt online, and adapt to changes in the underlying network, and therefore, they can reflect significant changes in the performance. Our tool visualizes these adaptive decision trees, distinguishing parts of the network with good performance from those with poor performance. We show that the differences in the visualized decision trees helps us quickly discover new patterns of usage and novel anomalies in latency measurements at a large server farm.

Polygraph: System for Dynamic Reduction of False Alerts in Large-Scale IT Service Delivery Environments
Back to Program
In order to avoid critical SLA violations, service providers use monitoring technology to automate the identification of relevant events in the performance of managed components and forward them as incident tickets to be resolved by system administrators (SAs) before a critical failure occurs. For optimal cost and performance, monitoring policies must be finely tuned to the behavior of the managed components, such that SAs are not engaged for investigation of false alerts. Existing approaches to tuning monitoring policy rely heavily on high skilled SA work, with high costs and long completion times. Polygraph is a novel architecture for automated tuning of monitoring policies towards reducing false alerts. Polygragh integrates multiple types of service management data into an active-learning approach to automated generation of new monitoring policies. SAs can only be involved in the verification of policies with low projected scores. Experiments with a trace of 60K monitoring events from a large IT service delivery infrastructure compare methods for threshold adjustment in alert policy predicates with respect to potential for false alert reduction. Select methods reduce false alerts by up to 50% compared to baseline methods.

Friday, June 17, 2011
8:30 a.m.–9:30 a.m.

Building a High-performance Deduplication System
Back to Program
Modern deduplication has become quite effective at eliminating duplicates in data, thus multiplying the effective capacity of disk-based backup systems, and enabling them as realistic tape replacements. Despite these improvements, single-node raw capacity is still mostly limited to tens or a few hundreds of terabytes, forcing users to resort to complex and costly multi-node systems, which usually only allow them to scale to single-digit petabytes. As the opportunities for deduplication efficiency optimizations become scarce, we are challenged with the task of designing deduplication systems that will effectively address the capacity, throughput, management and energy requirements of the petascale age. In this paper we present our high-performance deduplication prototype, designed from the ground up to optimize overall single-node performance, by making the best possible use of a node's resources, and achieve three important goals: scale to large capacity, provide good deduplication efficiency, and near-raw-disk throughput. Instead of trying to improve duplicate detection algorithms, we focus on system design aspects and introduce novel mechanisms—that we combine with careful implementations of known system engineering techniques. In particular, we improve single-node scalability by introducing progressive sampled indexing and grouped mark-and-sweep, and also optimize throughput by utilizing an event-driven, multi-threaded client-server interaction model. Our prototype implementation is able to scale to billions of stored objects, with high throughput, and very little or no degradation of deduplication efficiency.

SiLo: A Similarity-Locality based Near-Exact Deduplication Scheme with Low RAM Overhead and High Throughput
Back to Program
Data Deduplication is becoming increasingly popular in storage systems as a space-efficient approach to data backup and archiving. Most existing state-of-the-art deduplication methods are either locality based or similarity based, which, according to our analysis, do not work adequately in many situations. While the former produces poor deduplication throughput when there is little or no locality in datasets, the latter can fail to identify and thus remove significant amounts of redundant data when there is a lack of similarity among files. In this paper, we present SiLo, a near-exact deduplication system that effectively and complementarily exploits similarity and locality to achieve high duplicate elimination and throughput at extremely low RAM overheads. The main idea behind SiLo is to expose and exploit more similarity by grouping strongly correlated small files into a segment and segmenting large files, and to leverage locality in the backup stream by grouping contiguous segments into blocks to capture similar and duplicate data missed by the probabilistic similarity detection. By judiciously enhancing similarity through the exploitation of locality and vice versa, the SiLo approach is able to significantly reduce RAM usage for index-lookup and maintain a very high deduplication throughput. Our experimental evaluation of SiLo based on real-world datasets shows that the SiLo system consistently and significantly outperforms two existing state-of-the-art system, one based on similarity and the other based on locality, under various workload conditions.

10:00 a.m.–noon

G2: A Graph Processing System for Diagnosing Distributed Systems
Back to Program
G2 is a graph processing system for diagnosing distributed systems. It works on execution graphs that model runtime events and their correlations in distributed systems. In G2, a diagnosis process involves a series of queries, expressed in a high-level declarative language that supports both relational and graph-based operators. Each query is compiled into a distributed execution. G2's execution engine supports both parallel relational data processing and iterative graph traversal. Execution graphs in G2 tend to have long paths and are in structure distinctly different from other large-scale graphs, such as social or web graphs. Tailored for execution graphs and graph traversal operations on those graphs, G2's graph engine distinguishes itself by embracing batched asynchronous iterations that allows for better parallelism without barriers, and by enabling partition-level states and aggregation. We have applied G2 to diagnosis of distributed systems such as Berkeley DB, SCOPE/Dryad, and G2 itself to validate its effectiveness. When co-deployed on a 60-machine cluster, G2's execution engine can handle execution graphs with millions of vertices and edges; for instance, using a query in G2, we traverse, filter, and summarize a 130 million-vertex graph into a 12 thousand-vertex graph within 268 seconds on 60 machines. The use of an asynchronous model and a partition-level interface delivered a 66% reduction in response time when applied to queries in our diagnosis tasks.

Context-based Online Configuration-Error Detection
Back to Program
Software failures due to configuration errors are commonplace as computer systems continue to grow larger and more complex. Troubleshooting these configuration errors is a major administration cost, especially in server clusters where problems often go undetected without user interference. This paper presents CODE—a tool that automatically detects software configuration errors. Our approach is based on identifying invariant configuration access rules that predict what access events follow what contexts. It requires no source code, application-specific semantics, or heavyweight program analysis. Using these rules, CODE can sift through a voluminous number of events and detect deviant program executions. This is in contrast to previous approaches that focus on only diagnosis. In our experiments, CODE successfully detected a real configuration error in one of our deployment machines, in addition to 20 user-reported errors that we reproduced in our test environment. When analyzing month-long event logs from both user desktops and production servers, CODE yielded a low false positive rate. The efficiency of CODE makes it feasible to be deployed as a practical management tool with low overhead.

OFRewind: Enabling Record and Replay Troubleshooting for Networks
Back to Program
Debugging operational networks can be a daunting task, due to their size, distributed state, and the presence of black box components such as commercial routers and switches, which are poorly instrumentable and only coarsely configurable. The debugging tool set available to administrators is limited, and provides only aggregated statistics (SNMP), sampled data (NetFlow/sFlow), or local measurements on single hosts (tcpdump). In this paper, we leverage split forwarding architectures such as OpenFlow to add record and replay debugging capabilities to networks—a powerful, yet currently lacking approach. We present the design of OFRewind, which enables scalable, multi-granularity, temporally consistent recording and coordinated replay in a network, with fine-grained, dynamic, centrally orchestrated control over both record and replay. Thus, OFRewind helps operators to reproduce software errors, identify data-path limitations, or locate configuration errors.

ORDER: Object centRic DEterministic Replay for Java
Back to Program
Deterministic replay systems, which record and replay non-deterministic events during program execution, have many applications such as bug diagnosis, intrusion analysis and fault tolerance. It is well understood how to replay native (e.g., C) programs on multi-processors, while there is little work for concurrent java applications on multicore. State-of-the-art work for Java either assumes data-race free execution, or relies on static instrumentation, which leads to missing some necessary non-deterministic events. This paper proposes the ORDER framework to record and reproduce non-deterministic events inside Java virtual machine (JVM). Based on observations of good locality at object level for threads and frequent object movements due to garbage collection, ORDER records and replays non-deterministic data accesses by logging and enforcing the order in which threads access objects. This essentially eliminates unnecessary dependencies introduced by address changes of objects during garbage collection and enjoys good locality as well as less contention, which may result in scalable performance on multicore. Further, by dynamically instrumenting Java code in the JVM compilation pipeline, ORDER naturally covers non-determinism in dynamically loaded classes. We have implemented ORDER based on Apache Harmony. Evaluation on SPECjvm2008, PseudoJBB2005, and JRuby shows that ORDER only incurs 108% performance overhead on average and scales well on a 16-core Xeon testbed. Evaluation with a real-world application, JRuby, shows that several real-world concurrency bugs can be successfully reproduced.

1:00 p.m.–2:00 p.m.

Enabling Security in Cloud Storage SLAs with CloudProof
Back to Program
Several cloud storage systems exist today, but none of them provide security guarantees in their Service Level Agreements (SLAs). This lack of security support has been a major hurdle for the adoption of cloud services, especially for enterprises and cautious consumers. To fix this issue, we present CloudProof, a secure storage system specifically designed for the cloud. In CloudProof, customers can not only detect violations of integrity, write-serializability, and freshness, they can also prove the occurrence of these violations to a third party. This proof-based system is critical to enabling security guarantees in SLAs, wherein clients pay for a desired level of security and are assured they will receive a certain compensation in the event of cloud misbehavior. Furthermore, since CloudProof aims to scale to the size of large enterprises, we delegate as much work as possible to the cloud and use cryptographic tools to allow customers to detect and prove cloud misbehavior. Our evaluation of CloudProof indicates that its security mechanisms have a reasonable cost: they incur a latency overhead of only ∼15% on reads and writes, and reduce throughput by around 10%. We also achieve highly scalable access control, with membership management (addition and removal of members' permissions) for a large proprietary software with more than 5000 developers taking only a few seconds per month.

jVPFS: Adding Robustness to a Secure Stacked File System with Untrusted Local Storage Components
Back to Program
The Virtual Private File System (VPFS) was built to protect confidentiality and integrity of application data against strong attacks. To minimize the trusted computing base (i.e., the attack surface) it was built as a stacked file system, where a small isolated component in a microkernel-based system reuses a potentially large and complex untrusted file system; for example, as provided by a more vulnerable guest OS in a separate virtual machine. However, its design ignores robustness issues that come with sudden power loss or crashes of the untrusted file system. This paper addresses these issues. To minimize damage caused by an unclean shutdown, jVPFS carefully splits a journaling mechanism between a trusted core and the untrusted file system. The journaling approach minimizes the number of writes needed to maintain consistent information in a Merkle hash tree, which is stored in the untrusted file system to detect attacks on integrity. The commonly very complex and error-prone recovery functionality of legacy file systems (in the order of thousands of lines of code) can be reused with little increase of complexity in the trusted core: less than 350 lines of code deal with the security-critical aspects of crash recovery. jVPFS shows acceptable performance better than its predecessor VPFS, while providing much better protection against data loss.

2:30 p.m.–3:30 p.m.

Semantics of Caching with SPOCA: A Stateless, Proportional, Optimally-Consistent Addressing Algorithm
Back to Program
A key measure for the success of a Content Delivery Network is controlling cost of the infrastructure required to serve content to its end users. In this paper, we take a closer look at how Yahoo! efficiently serves millions of videos from its video library. A significant portion of this video library consists of a large number of relatively unpopular user-generated content and a small set of popular videos that changes over time. Yahoo!'s initial architecture to handle the distribution of videos to Internet clients used shared storage to hold the videos and a hardware load balancer to handle failures and balance the load across the front-end server that did the actual transfers to the clients. The front-end servers used both their memory and hard drives as caches for the content they served. We found that this simple architecture did not use the front-end server caches effectively. We were able to improve our front-end caching while still being able to tolerate faults, gracefully handle the addition and removal of servers, and take advantage of geographic locality when serving content. We describe our solution, called SPOCA (Stateless, Proportional, Optimally-Consistent Addressing), which reduce disk cache misses from 5% to less than 1%, and increase memory cache hits from 45% to 80% and thereby resulting in the overall cache hits from 95% to 99.6%. Unlike other consistent addressing mechanisms, SPOCA facilitates nearly-optimal load balancing.

TidyFS: A Simple and Small Distributed File System
Back to Program
This paper describes TidyFS, a simple and small distributed file system that provides the abstractions necessary for data parallel computations on clusters. In recent years there has been an explosion of interest in computing using clusters of commodity, shared nothing computers. Frequently the primary I/O workload for such clusters is generated by a distributed execution engine such as MapReduce, Hadoop or Dryad, and is high-throughput, sequential, and read-mostly. Other large-scale distributed file systems have emerged to meet these workloads, notably the Google File System (GFS) and the Hadoop Distributed File System (HDFS). TidyFS differs from these earlier systems mostly by being simpler. The system avoids complex replication protocols and read/write code paths by exploiting properties of the workload such as the absence of concurrent writes to a file by multiple clients, and the existence of end-to-end fault tolerance in the execution engine. We describe the design of TidyFS and report some of our experiences operating the system over the past year for a community of a few dozen users. We note some advantages that stem from the system's simplicity and also enumerate lessons learned from our design choices that point out areas for future development.

4:00 p.m.–5:00 p.m.

Eyo: Device-Transparent Personal Storage
Back to Program
Users increasingly store data collections such as digital photographs on multiple personal devices, each of which typically offers a storage management interface oblivious to the contents of the user's other devices. As a result, collections become disorganized and drift out of sync. This paper presents Eyo, a novel personal storage system that provides device transparency: a user can think in terms of "file X", rather than "file X on device Y ", and will see the same set of files on all personal devices. Eyo allows a user to view and manage the entire collection of objects from any of their devices, even from disconnected devices and devices with too little storage to hold all the object content. Eyo synchronizes these collections across any network topology, including direct peer-to-peer links. Eyo provides applications with a storage API with first-class access to object version history in order to resolve update conflicts automatically. Experiments with several applications using Eyo—media players, a photo editor, a podcast manager, and an email interface—show that device transparency requires only minor application changes, and matches the storage and bandwidth capabilities of typical portable devices.

Autonomous Storage Management for Personal Devices with PodBase
Back to Program
People use an increasing number of personal electronic devices like notebook computers, MP3 players and smart phones in their daily lives. Making sure that data on these devices is available where needed and backed up regularly is a time-consuming and error-prone burden on users. In this paper, we describe and evaluate PodBase, a system that automates storage management on personal devices. The system takes advantage of unused storage and incidental connectivity to propagate the system state and replicate files. PodBase ensures the durability of data despite device loss or failure; at the same time, it aims to make data available on devices where it is useful. PodBase seeks to exploit available storage and pairwise device connections with little or no user attention. Towards this goal, it relies on a declarative specification of its replication goals and uses linear optimization to compute a replication plan that considers the current distribution of files, availability of storage, and history of device connections. Results from a user study in ten real households show that, under a wide range of conditions, PodBase transparently manages the durability and availability of data on personal devices.

?Need help? Use our Contacts page.

Last changed: 3 May 2011 jel