Check out the new USENIX Web site. next up previous
Next: CoDNS Up: Design Previous: Design

Cross-site Correlation of DNS Failures


The ``insurance'' model depends on failure being relatively uncorrelated - the system must always have a sufficient pool of working participants to help those having trouble. If failure across sites is correlated, this assumption is violated, and a cooperative lookup scheme is less feasible. To test our assumption, we study the correlation of DNS lookup failures across PlanetLab. At every minute, we record how many nodes have ``healthy'' DNS performance. We define healthy as showing no failures for one minute for the local domain name lookup test. Using the per-minute data for March 2004, we show the minimum, average and maximum number of nodes available per hour. The percentage of healthy nodes (as a fraction of live nodes) is shown in Figure 9.

Figure 9: Hourly % of nodes with working nameservers
\begin{figure}\epsfig{file=figs/hourly_fail_rate.eps,width=3.2in,height=2.0in}\end{figure}

From this graph, we can see some minor correlation in failures, shown as downward spikes in the percentage of available nodes, but most of the variation in availability seems largely uncorrelated. An investigation into the spikes reveals that many nodes on PlanetLab are configured to use the same set of nameservers, especially those colocated at Internet2 backbone facilities (not to be confused with Internet2-connected university sites). When these nameservers experience problems, the correlation appears large due to the number of nodes affected.

More important, however, is the observation that the fraction of healthy nameservers is always high, generally above 90%. This observation provides the key insight for CoDNS - with enough healthy nameservers, we can mask locally-observed delays via cooperation.

To ensure that these failures are not tied to any specific nameserver software, we survey the software running on the local nameservers used by the PlanetLab nodes (135 unique nameservers) with ``chaos'' class queries [14]. We find that they are mostly running a variety of BIND versions. We observe 11 different BIND 9 version strings, 13 different BIND 8 version strings and a number of humorous strings (included in ``other'') apparently set by the nameserver administrators. These measurements, shown in Table 2, are in line with two recent nameserver surveys conducted by Brad Knowles in 2002 [11] and by packetfactory in 2003 [19]. From this, we conclude that the failures are not likely to be specific to PlanetLab's choices of nameserver software.

Table: Comparison of nameserver software used by PlanetLab, packetfactory survey and the TLD survey
Software PlanetLab Packetfactory TLD
BIND-4.9.3+/8 31.1% 36.4% 55.9%
BIND 9 48.9% 25.1% 34.0%
Other 20.0% 38.5% 10.1%




next up previous
Next: CoDNS Up: Design Previous: Design
KyoungSoo Park 2004-10-02