Introduction

Federated geographically-distributed computing platforms such as PlanetLab [3] and the Grid [7,8] have recently become popular for evaluating and deploying network services and scientific computations. As the size, reach, and user population of such infrastructures grow, resource discovery and resource selection become increasingly important. Although a number of resource discovery and allocation services have been built [1,11,15,22,28,33], there is little data on the utilization of the distributed computing platforms they target. Yet the design and efficacy of such services depends on the characteristics of the target platform. For example, if resources are typically plentiful, then there is less need for sophisticated allocation mechanisms. Similarly, if resource availability and demands are predictable and stable, there is little need for aggressive monitoring.

To inform the design and implementation of emerging resource discovery and allocation systems, we examine the usage characteristics of PlanetLab, a federated, time-shared platform for ``developing, deploying, and accessing'' wide-area distributed applications [3]. In particular, we investigate variability of available host resources across nodes and over time, how that variability interacts with resource demand of several popular long-running services, and how careful application placement and migration might reduce the impact of this variability. We also investigate the feasibility of using stale or predicted measurements to reduce overhead in a system that automates service placement and migration.

Our study analyzes a six-month trace of node, network, and application-level measurements. In addition to presenting a detailed characterization of application resource demand and free and committed node resources over this time period, we analyze this trace to address the following questions: (i) Could informed service placement--that is, using live platform utilization data to choose where to deploy an application--outperform a random placement? (ii) Could migration--that is, moving deployed application instances to different nodes in response to changes in resource availability--potentially benefit some applications? (iii) Could we reduce the overhead of a service placement service by using stale or predicted data to make placement and migration decisions? We find:

CPU and network resource usage are heavy and highly variable, suggesting that shared infrastructures such as PlanetLab would benefit from a resource allocation infrastructure. Moreover, available resources across nodes and resource demands across instances of an application both vary widely. This suggests that even in the absence of a resource allocation system, some applications could benefit from intelligently mapping application instances to available nodes.
Node placement decisions can become ill-suited after about 30 minutes, suggesting that a resource discovery system should not only be able to deploy applications intelligently, but should also support migrating performance-sensitive applications whose migration cost is acceptable.
Stale data, and certain types of predicted data, can be used effectively to reduce measurement overhead. For example, using resource availability and utilization data up to 30 minutes old to make migration decisions still enables our studied applications' resource needs to be met more frequently than not migrating at all; this suggests that a migration service for this workload need not support a high measurement update rate. In addition, we find that inter-node latency is both stable and a good predictor of available bandwidth; this observation argues for collecting latency data at relatively coarse timescales and bandwidth data at even coarser timescales, using the former to predict the latter between measurements.
Significant variability in usage patterns across applications, combined with heavy sharing of nodes, precludes significant correlation between the availability of different resources on the same node or at the same site. For example, CPU availability does not correspond to memory or network availability on a particular node, or to CPU availability on other nodes at the same site. Hence, it is not possible to make accurate predictions based on correlations within a node or a site. Furthermore, because PlanetLab's user base is globally distributed and applications are deployed across a globally distributed set of nodes, we note an absence of the cyclic usage pattern typical of Internet services with geographically colocated user populations. As a result, it is not possible to make accurate resource availability or utilization predictions for this platform based on time-series forecasts that assume a daily or weekly regression to the mean.

The remainder of this paper is organized as follows. Section 2 describes our data sources and methodology. Section 3 surveys platform, node, and network resource utilization behavior; addresses the usefulness of informed service placement; and describes resource demand models for three long-running PlanetLab services--CoDeeN [26], Coral [10], and OpenDHT [20]--that we use there and in subsequent sections. Section 4 investigates the potential benefits of periodically migrating service instances. Section 5 analyzes the feasibility of making placement and migration decisions using stale or predicted values. Section 6 discusses related work, and in Section 7 we conclude.

David Oppenheimer 2006-04-14