Related Work

The measurement aspects of this paper add to a growing literature on measurements of resource utilization in Internet-scale systems. Most of this work has focused on network-level measurements, a small subset of which we mention here. Balakrishnan examines throughput stability to many hosts from the vantage point of the 1996 Olympic Games web server [2], while Zhang collects data from 31 pairs of hosts [34]. Chen describes how to monitor a subset of paths to estimate loss rate and latency on all other paths in a network [6] . Wolski [29] and Vazhkudai [25] focus on predicting wide-area network resources.

Growing interest in shared computational Grids has led to several recent studies of node-level resource utilization in such multi-user systems. Foster describes resource utilization on Grid3 [8], while Yang describes techniques to predict available host resources to improve resource scheduling [31,32]. Harchol-Balter investigates process migration for dynamic load-balancing in networks of Unix workstations [12], and cluster load balancing is an area of study with a rich literature. Compared to these earlier studies, our paper represents the first study of resource utilization and service placement issues for a federated platform as heavily shared and utilized as PlanetLab.

Several recent papers have used measurement data from PlanetLab. Yalagandula investigates correlated node failure; correlations between MTTF, MTTR, and availability; and predictability of TTF, TTR, MTTF, and MTTR [30]. Rhea measures substantial variability over time and across nodes in the amount of time to complete CPU, disk, and network microbenchmarks; these findings corroborate our observations in Section 3.1 [19]. Rhea advocates application-internal mechanisms, as opposed to intelligent application placement, to counter node heterogeneity. Lastly, Spring uses measurements of CPU and node availability to dispel various ``myths'' about PlanetLab [23].

Resource discovery tools are a prerequisite for automated service placement and migration. CoMon [18], CoTop [17], and Ganglia [14] collect node-level resource utilization data on a centralized server, while MDS [33], SWORD [15], and XenoSearch [22] provide facilities to query such data to make placement and migration decisions.

Shared wide-area platforms themselves are growing in number and variety. PlanetLab focuses on network services [3], Grid3 focuses on large-scale scientific computation [8], FutureGrid aims to support both ``eScience'' and network service applications [7], and BOINC allows home computer users to multiplex their machines' spare resources among multiple public-resource computing projects [4]. Ripeanu compares resource management strategies on PlanetLab to those used in the Globus Grid toolkit [21].

David Oppenheimer 2006-04-14