Check out the new USENIX Web site. next up previous
Next: Trust, Verification, and Implications Up: Design Previous: Cross-site Correlation of DNS

CoDNS


The main idea behind CoDNS is to forward name lookup queries to peer nodes when the local name service is experiencing a problem. Essentially, this strategy applies a CDN approach to DNS - spreading the load among peers improves the size and performance of the ``global cache''. Many of the considerations in CDN systems apply in this environment. We need to consider the proximity and availability of a node as well as the locality of the queries. A different consideration is that we need to decide when it is desirable to send remote queries. Given the fact that most name lookups are fast in the local nameserver, simply spreading the requests to peers might generate unnecessary traffic with no gain in latency. Worse, the extra load may cause marginal DNS nameservers to become overloaded. We investigate considerations for deciding when to send remote queries, how many peers to involve, and what sorts of gains to expect.

To precisely determine the effects of locality, load, and proximity is difficult, since we have no control over the nameservers and have little information about their workloads, configurations, etc. The proximity of a peer server is important in that DNS response time can be affected by its peer to peer latency. Since the DNS requests and responses are not large, we are more interested in picking nearby peers with low round-trip latency instead of nodes with particularly high bandwidth. We have observed coast-to-coast round-trip ping times of 80ms in CoDeeN, with regional times in the 20ms range. Both of these ranges are much lower than the DNS timeout value of five seconds, so, in theory, any node would be an acceptable peer. In practice, choosing closer peers will reduce the difference between cache hit times and remote peer times, making CoDNS failure masking more transparent. For request locality, we would like to increase the chances of remote queries being cache hits in the remote nameservers. Using any scheme that consistently partitions this workload will help reduce cache pollution, and increase the likelihood of cache hits.

To understand the relationship between CoDNS response times, the number of peers involved, and the policies for determining when requests should be sent remotely, we collected 44,486 unique hostnames from one day's HTTP traffic on CoDeeN and simulated various policies and their effects. We replayed DNS lookups of those names at 77 PlanetLab nodes with different nameservers, starting requests at the same time of day in the original logs. The replay happened one month after the data collections to avoid local nameserver caches which could skew the data. During this time, we also use application-level heartbeat measurements between all pairs of nodes to determine their round-trip latencies. Since all of the nodes are doing DNS lookups at about the same time, by adding the response time at peerY to the time spent for the heartbeat from peerX to peerY, we will get the response time peerX can get if it asks peerY for a remote DNS lookup for the same hostname.

An interesting question is how many simultaneous lookups are needed to achieve a given average response time and to reduce the total time spent on slow lookups (defined as taking more than 1 second). As shown in the previous section, it is desirable to reduce the number of slow responses to reduce the total lookup time. Figures 10 and 11 show two graphs answering this question. The lookup scheme here is to contact the local nameserver first for a name lookup, wait for a timeout and issue x-1 simultaneous lookups using x-1 randomly-selected peer nodes. Figures 10 shows that even if we use only one extra lookup, we can reduce the average response time by more than half. Also, beyond about five peers, adding more simultaneous lookups produces diminishing returns. Different initial timeout values do not produce much difference in response times, because the benefit largely stems from reducing the number of slow lookups. The slow response portion graph proves this phenomenon, showing similar reduction in the slow response percentage at any initial timeout less than 700ms.

Figure 10: Average Response Time
\epsfig{file=figs/resp_time_vs_peers.eps,width=3.2in,height=1.8in}


Figure 11: Slow Response Time Portion
\epsfig{file=figs/big_time_perc_vs_peers.eps,width=3.2in,height=1.8in}


Figure 12: Extra DNS Lookups
\begin{figure}\begin{center}
\epsfig{file=figs/extra_lookups_vs_initial_delay.ep...
...dth=3.2in,height=1.8in} \end{center} \vspace{-.125in}\vspace{-.07in}\end{figure}

We must also consider the extra overhead of the simultaneous lookups, since shorter initial timeouts and more simultaneous lookups causes more DNS traffic at all peers. Figure 12 shows the overhead in terms of extra lookups needed for various scenarios. Most curves start to flatten at a 500ms initial timeout, providing only diminishing returns for larger timeouts. Worth noting is that even with one peer and a 200ms initial timeout, we can still cut the average response time by more than half, with only 38% extra DNS lookups.

These results are very encouraging, demonstrating that CoDNS can be effective even at very small scale - even a single extra site provides significant benefits, and it achieves most of its benefits with less than 10 sites. The reasons for this scale being important is twofold: only small commitments are required to try a CoDNS deployment, and DNS's limitations with respect to trust and verification (discussed in the next section) are unlikely to be an issue at these scales.



next up previous
Next: Trust, Verification, and Implications Up: Design Previous: Cross-site Correlation of DNS
KyoungSoo Park 2004-10-02