Measuring Latency

Measuring Latency

During our initial deployment of NCs on PlanetLab, we observed latency samples of as much as three orders-of-magnitude greater than the normal latency for a given link. When used for NC calculation, these samples induce a large coordinate change in a high confidence node, which, in turn, causes large shifts in the NC of its neighbors. Such changes keep propagating through the coordinate space, causing high instability, low convergence, and decreased accuracy, because coordinate shifts are not reflecting future measurements. Occasionally, but not always, we could attribute large values in our application-level measurements to high CPU load on one of the nodes.

**Figure 2:** Raw latency samples on PlanetLab (ms).
$\includegraphics[width=0.9\columnwidth ]{pings/latency-all-histogram}$

To illustrate the extent of the anomalies, we show a distribution of application-level UDP latency samples between nodes collected over three days on PlanetLab in Figure 2. Over $0.4\%$ of the samples are greater than one second, larger than even a slow intercontinental link, and frequent enough to periodically distort the coordinate space. Not only is a significant fraction of samples large, but also individual links have extended tails: samples of a link tend to produce a consistent latency within a tight range, but then a tail of the samples can extend into the tens of seconds. Both the range and tail depend on the link. We found that feeding these raw samples directly into Vivaldi leads to poor accuracy and stability.

**Figure 3:** Kernel-level ping measurements.
$\includegraphics[width=0.9\columnwidth ]{pings/pings-combined}$

**Figure 4:** Long-term, periodic, bimodal latency samples.
$\includegraphics[width=0.9\columnwidth ]{pings/bimodal-link}$

We also tried using kernel-level ping measurements and found that they suffered from a similar baseline and extended tail. Figure 3 shows the results of a three hour set of ping measurements using the ping program between two PlanetLab nodes (berkeley to uvic.ca). The data shows that $82\%$ of the samples fall within $1\mathrm{ms}$ of the median, but that the largest $5\%$ are - times the median. Even though the measurements are being time-stamped by the kernel, there are many large measurements that would jolt a stable coordinate system. In addition, as the subgraph shows, the deviations from the baseline measurement are not clustered all at one time, but occur throughout the trace; they do not signal shifts, but aberrations. Because kernel-level measurements would need to be filtered also and do not have the benefit of application-level traffic, we decided to find a way to incorporate samples with a high variance into the NC computation.

Although we estimate that approx. $90\%$ of links fall into the type shown in Figure 3, a small percentage do exhibit multi-modal behavior. If multi-modal behavior was on a short time scale, it would be unclear what value would be appropriate to feed into the NC update algorithm; it might perhaps require a more complicated link description (e.g., a PDF). However, we have not seen this behavior in practice. Instead, we have found cases of long-term, periodic bi-modal link latencies, as shown in Figure 4 (ntu.edu.tw to 6planetlab.edu.cn). That this behavior is long-term is important for two reasons: (1) it appears reasonable to summarize each link with a single current baseline latency and (2) NCs need continuous maintenance because this baseline latency changes over time.

Statistical Filtering. We explored three obvious but ineffective approaches before arriving at our final solution for latency signal extraction. First, we tried using simple thresholds: if an observation is larger than a constant, it is ignored. This did not work because one link's normal latency was well into the range of the tail of another link. Second, we applied an exponentially-weighted moving average (EWMA) to each link's sample history. We found that this performed worse than no filter at all because it weighted outliers too strongly, even with unusually low weights. Finally, we tried a more Vivaldi-specific solution: lowering confidence in response to high load. However, because sample variance can only partially be attributed to load, this solution was also not effective.

Instead, we found that a non-linear moving percentile (MP) filter greatly improved accuracy and stability. The MP filter takes two parameters: a window size of samples and the percentile of these samples to output. It removes noise and, based on the window size, responds to changes in the underlying signal. Before presenting our experimental results, we introduce a technique layered on top of the filtered raw coordinate.

Application-Level NCs. Our latency service makes a distinction between system-level and application-level NCs. The former are raw Vivaldi coordinates, which are updated with each observation. The latter are the application's idea of the local NC, updated only when a statistically significant change in the system-level NC has occurred. While some applications may want to access the raw value, many others prefer updates when the system-level NC exhibits sustained change compared to its past.

We found two successful heuristics for setting application-level NCs, both based on a change detection algorithm that uses sliding windows [7]: RELATIVE and ENERGY. Both compare a current window of coordinates to a window starting at the most recent application-level coordinate update. RELATIVE compares the two windows based on the amount of change relative to the nearest known neighbor. ENERGY compares them based on a statistical test that measures the Euclidean distance between two multidimensional distributions. Both update the application-level coordinate to the centroid of a recent window of coordinates and both heuristics allow the raw coordinate to ``float'' in a given region. As long as the coordinate does not leave the region and other major changes in the network do not occur, application updates will be suppressed. Application-level NCs increase stability without decreasing accuracy.

**Figure 5:** Accuracy on PlanetLab.
$\includegraphics[width=0.9\columnwidth ]{app/re95-comp}$

Accuracy Results. Experimentally, we determined that a low percentile (e.g., $25^{th}$ ) and a window size of - samples or larger gives good results with the latency samples seen on PlanetLab. Links are moderately consistent: most follow the pattern seen in Figure 3, but about $10\%$ that in Figure 4. Windows that are too large suppress network changes that should be reflected in the NCs: short windows are more effective than long ones, keeping the required state low. Figure 5 shows results from using the MP filter and the ENERGY heuristic with PlanetLab nodes. It shows that with the MP filter only $14\%$ of the nodes experience a $95^{th}$ percentile relative error greater than one, while $62\%$ of those without the filter do. The enhancements combine to reduce the median of the $95^{th}$ percentile relative error by $54\%$ .

In this experiment we measure accuracy as the coordinate's ability to predict the next sample along that link. For each observation, we compute the relative error, that is, the difference between the predicted and actual latency, divided by the actual latency. Each node then has a collection of relative errors from its samples; the figure shows the $95^{th}$ percentile out of this distribution, collected for the second half of a hour run.

**Figure 6:** Application-oriented Accuracy Metrics.
$\includegraphics[width=0.9\columnwidth ]{rrl/rrl}$ $\includegraphics[width=0.9\columnwidth ]{rrl/ralp}$

Defining accuracy as relative error produces a low-level metric that may not sufficiently capture application impact. Recently, Lua et al. proposed relative rank loss (rrl) to calculate how well coordinates capture the relative ordering of (all) pairs of neighbors [10]. Thus, for each node , $\mathrm{if} \; (d_{xi} > d_{xj} \wedge l_{xi} < l_{xj}) \; \mathrm{or} \; (d_{xi} < d_{xj} \wedge l_{xi} > l_{xj}),$ then the distances between coordinates have to led to an incorrect prediction of the relative latencies , presumably inducing an application-level penalty due to the wrong preference of a farther node. While rrl quantifies the probability of incorrect rankings, we wanted a metric that captures the magnitude of each rank mis-ordering as well. For some applications, choosing the absolute nearest neighbor is important; however, often the extent of the error should be penalized: an error of $1\mathrm{ms}$ is less severe than one of $100\mathrm{ms}$ . Weighted rrl (wrrl) captures this by taking the sum of the latency penalties $l_{ij}$ of pairs ranked incorrectly, normalized over all possible latency penalties. However, wrrl does not express the percentage in lost latency that an application will notice when using NCs. To approximate this quantity, we sum the relative latency penalty $l_{ij}/l_{xi}$ for all pairs that are incorrectly ranked; we call this third metric the relative application latency penalty (ralp).

We illustrate how the MP filter affects these three metrics in Figure 6. The top graph portrays that while the probability of incorrect rankings (rrl) can range up to almost $30\%$ for the worst case node, the latency penalty due to incorrectly ranked neighbors (wrrl) is $11\%$ of the maximum in the worst case. The median ralp metric is $15\%$ when using raw latency inputs, improving to $8\%$ with the filter. We computed the ``true'' latency between nodes as the median for that link. In summary, our results indicate that the MP filter improves NC accuracy on PlanetLab for application-oriented operations, such as node ranking.

Stability Results. Unstable coordinates are problematic. Consider the situation where a node's coordinate is moving in a circle compared to using the centroid of that circle. If one is using the coordinate for a one-time decision (e.g., finding the nearest node to initialize a Pastry routing table or finding a nearby web cache), unstable coordinates make a good decision less likely because it depends on the particular time the coordinates are compared. When coordinates are used for periodic decisions (e.g., a proximity-based routing table update or re-positioning processing operators in a stream-based overlay), changing them may involve application-level work; unstable coordinates will induce updates based simply on their instability, not any fundamental change in relative node positions.

**Figure 7:** Instability on PlanetLab.
$\includegraphics[width=0.8\columnwidth]{app/ddsec-comp}$

We measure stability as the amount of change in coordinates per unit time in ms/sec. This captures the amount of oscillation around a particular coordinate. Both the MP filter and application-level coordinates serve to suppress insignificant change. As shown in Figure 7, ENERGY dampens the filter's updates: $91\%$ of the time it falls below even the minimum instability of the raw filter. Combined, the median instability is reduced by $96\%$ . More detail on the MP filter and on the application-update heuristics can be found in our technical report [9].

Jonathan Ledlie 2005-10-18