Introduction
Federated geographically-distributed computing platforms such as
PlanetLab [3] and the
Grid [7,8] have recently become popular
for evaluating and deploying network services and scientific
computations.
As the size, reach, and user population of such infrastructures
grow, resource discovery and resource selection become increasingly
important. Although a number of resource discovery and allocation
services have been built [1,11,15,22,28,33], there
is little data on the utilization of the distributed computing
platforms they target. Yet the design and efficacy of such services depends on
the characteristics of the target platform. For example, if resources are
typically plentiful, then there is less
need for sophisticated allocation mechanisms. Similarly,
if resource availability and demands are predictable and stable, there
is little need for aggressive monitoring.
To inform the design and implementation of emerging resource discovery
and allocation systems,
we examine the usage characteristics of PlanetLab, a federated,
time-shared platform for ``developing, deploying, and
accessing'' wide-area distributed applications [3].
In particular, we investigate variability of available host resources
across nodes and over time, how that variability interacts with
resource demand of several popular long-running services, and how
careful application placement and migration might reduce the
impact of this variability. We also investigate the feasibility of
using stale or predicted measurements to reduce overhead in a system
that automates service placement and migration.
Our study analyzes a six-month trace of node, network, and application-level
measurements. In addition to presenting a detailed characterization of
application resource demand and free and committed node
resources over this time period, we analyze this trace to address
the following questions: (i) Could
informed service placement--that is, using live platform utilization
data to choose where to deploy an application--outperform a random placement? (ii) Could
migration--that is, moving deployed application instances to
different nodes in response to changes in resource
availability--potentially benefit some applications? (iii) Could we reduce the overhead of a service
placement service by using stale or predicted data to make placement and migration
decisions? We find:
- CPU and network resource usage are heavy and highly
variable, suggesting that shared
infrastructures such as PlanetLab would benefit from a resource
allocation infrastructure. Moreover,
available resources across nodes and resource demands across
instances of an application both vary widely. This suggests that
even in the absence of a
resource allocation system, some applications could benefit from
intelligently mapping application instances to available nodes.
- Node placement decisions can become ill-suited after about 30 minutes,
suggesting that a resource discovery system should not only be able to
deploy
applications intelligently, but should also support
migrating performance-sensitive applications whose
migration cost is acceptable.
- Stale data, and certain types of predicted data, can be used
effectively to reduce measurement overhead. For example,
using resource availability and utilization data up to 30 minutes old
to make migration decisions still enables our studied applications' resource
needs to be met more frequently than not migrating at all; this
suggests that a migration service for this workload need not support a high measurement
update rate. In addition, we find that inter-node latency is
both stable and a good predictor of available bandwidth; this
observation argues for collecting latency data at relatively coarse
timescales and bandwidth data at even coarser timescales, using the
former to predict the latter between measurements.
- Significant variability in usage patterns across
applications, combined with heavy sharing of nodes, precludes significant correlation between
the availability of different resources on the same node or at the
same site. For example, CPU availability does not correspond to memory or
network availability on a particular node, or to CPU availability on
other nodes at the same site. Hence, it is not
possible to make accurate predictions based on correlations within a
node or a site. Furthermore, because PlanetLab's
user base is globally distributed and applications are deployed
across a globally distributed set of nodes, we note an absence
of the cyclic usage pattern typical of Internet services with
geographically colocated user populations. As a result, it is not
possible to make accurate resource availability or utilization
predictions for this platform based on time-series forecasts that assume a daily or
weekly regression to the mean.
The remainder of this paper is organized as
follows. Section 2 describes our data sources and methodology.
Section 3 surveys platform, node, and network resource
utilization behavior; addresses the usefulness of informed
service placement; and describes resource demand models for three
long-running PlanetLab services--CoDeeN [26],
Coral [10], and OpenDHT [20]--that we use there and in subsequent
sections. Section 4 investigates
the potential benefits of periodically migrating service instances.
Section 5 analyzes the feasibility of making
placement and migration decisions using stale or predicted values.
Section 6 discusses related work, and in
Section 7 we conclude.
David Oppenheimer
2006-04-14