Collecting up-to-date latency measurements between nodes in an overlay network is important for many classes of applications. Proximity-aware distributed hash tables use latency measurements to reduce the delay stretch of lookups [15], content distribution systems construct network-aware trees to minimize dissemination times [1], and decentralized web caches need latency information to map clients to cache locations. Especially in a wide-area network, communication latencies have a significant impact on the overall execution time of operations.
To exploit network locality, today's overlay networks are left with the burden of performing their
own network measurements. Developers must continually reinvent the wheel duplicating measurements
when multiple network-aware overlays are sharing a single distributed testbed, such as
PlanetLab [19]. Implementations that gather all-pairs latency measurements
are only scalable for relatively small overlay deployments. For example, the all-pairs ping service
managed by Stribling [18] has recently ceased operation because it became
infeasible to obtain up-to-date measurements for over PlanetLab nodes. In addition, measuring
techniques that lead to good latency samples without suffering from high variance caused by
measurement anomalies are non-trivial.
To address these issues, a latency service can provide applications with up-to-date estimates
of network latencies between nodes. We describe our experience of maintaining such a service on
PlanetLab based on network coordinates. Here, each overlay node maintains a coordinate
obtained through an embedding of latency measurements in a metric space. The Euclidean distance
between two coordinates is an estimate of the communication latency between the nodes. This enables
nodes to infer latencies to remote nodes without the overhead of a direct latency measurement. The
metric space interpolates non-existing measurements, which reduces the measurement overhead from
to linear in the number of nodes.
We discuss trade-offs between two different solutions for a network coordinate service: a dedicated, stand-alone service, which is shared among applications, and a per-application library, which exploits application-specific traffic for network coordinate updates. Our experience deploying network coordinates on PlanetLab reveals that coordinate stability and convergence is a challenge. We have developed two techniques to address this: statistical filtering of latency samples and decoupling low-level coordinate updates from the coordinates used by applications. We have found that our implementation of a latency service now provides network coordinates that are sufficiently stable and accurate to support our application needs, while keeping the measurement overhead small.
After a survey of existing work in Section 2, we present the APIs and trade-offs of our network coordinate service and library in Section 3. In Section 4, we show how statistical filtering and our delayed update technique greatly improves accuracy and stability and how a small number of measurement neighbors can lead to accurate coordinates. We conclude in Section 5.
Jonathan Ledlie 2005-10-18