The route control mechanisms we presented and analyzed are a first attempt at understanding how to extract good performance from multiple provider connections in practice. There are clearly a number of ways in which they can be improved, however. Also, we do not address several important issues, such as ISP costs and the interplay of performance and reliability optimization. Below, we briefly discuss some of these potential improvements and issues.
Handling lost probes. In our implementation of the active probing schemes, we send just one probe when collecting a performance sample for a (ISP link, destination) pair. It is therefore possible that lost probes, e.g., due to transient congestion or even timeouts, may be misinterpreted for poor performance of the provider path to the destination. This can in turn cause unwanted changes in the ISP choice for the destination. We can mitigate this by sending a short burst of, say, three probes per (ISP link, destination) pair. Then the performance reported by all three probes can be used to estimate the quality of the ISP link, perhaps with a weighting to account for any observed losses.
Hybrid passive and active measurements. The accuracy of passive measurement can be improved by sending active probes immediately after a failed passive probe, for example when the observed connection ends unexpectedly. This increases confidence that the failed connection is due to a problem with the provider link, as opposed to a transient effect.
In our implementation, paths to less popular destinations are not explicitly monitored (in both active and passive schemes). As a result, we may have to rely on passive observations of transfers to unpopular destination to ensure quick fail-over. For example, whenever the proxy observes a number of failures on connections to an unpopular destination, it can immediately switch the destination's default provider to one of the remaining two providers for future transfers.
Balancing performance and resilience. The goal of most current multihoming deployments is to provide resilient connectivity in the face of network failures. Hence, one of the main functions of a route control product is to respond quickly to ISP failures. One of our findings is that even with a relatively long sampling interval, the performance advantages of multihoming can be realized. A long interval can also slow the end-network's reaction to path failures, however. This can be addressed by sampling each destination with a sufficiently high frequency, while still keeping the probing overhead low. For example, a sampling interval of 60s with active measurement works well in such cases, providing reasonably low overhead and good performance (Figure 11(b)), while ensuring a failover time of about one minute.
ISP pricing structures. In our study, we ignore issues relating to the the cost of the provider links. Different ISP connections may have very different pricing policies. One may charge a flat rate up to some committed rate, while another may use purely usage-based pricing or charge differently depending on whether the destination is ``on-net'' or ``off-net.'' Though we do not consider how to optimize overall bandwidth costs, our evaluation of active and passive monitoring, and the utility of history, are central to more general schemes that optimize for both cost and performance.
Long-lived TCP flows. In our route control schemes, an update to a NAT entry for a destination in the midst of an ongoing transfer involving that destination could cause the transfer to fail (due to the change in source IP address). We did not observe many failed connections in our experiments, however, and most of the flows were very short. However, this effect is nevertheless likely to have a pronounced impact on the performance of long-lived flows. It is possible to address this problem by delaying updates to NAT table until after ongoing large transfers complete. However, this increases the complexity of the implementation since it involves identifying flow lengths, and checking for the existence of other long-lived flows at the time of update. It may also force other short flows to the same destination to traverse sub-optimal ISPs while the NAT update is delayed.
Issues for further study. We do not address the impact that announcements of small address sub-blocks to different upstream ISPs (Section 2.3) has on the on the inflation of the routing table size in the core of the network. We also do not consider the potential impact of interactions when many enterprises deploy intelligent route control to each optimize their own multihomed connectivity. This will likely have an affect on the marginal benefits of the route control solutions themselves, and on the network as a whole. We leave these issues for future consideration.
Our implementation primarily considered handling connections initiated from within the enterprise, as these are common for current enterprise applications (e.g., to contact content providers). A route control product must also handle connections from outside clients, however, to enable optimized access to servers hosted in the enterprise network. Next, we describe some preliminary measurements regarding the usefulness DNS for externally-initiated connections.