Next: Local Monitoring Up: Reliability and Security in Previous: Introduction

Reliability and Monitoring

Unlike commercial CDNs, CoDeeN does not operate on dedicated nodes with reliable resources, nor does it employ a centralized Network Operations Center (NOC) to collect and distribute status information. CoDeeN runs on all academic PlanetLab sites in North America, and, as a result, shares resources with other experiments.³ Such sharing can lead to resource exhaustion (disk space, global file table entries, physical memory) as well as contention (network bandwidth, CPU cycles). In such cases, a CoDeeN instance may be unable to service requests, which would normally lead to overall service degradation or failure. Therefore, to maintain reliable and smooth operations on CoDeeN, each instance monitors system health and provides this data to its local request redirector.

In a latency-sensitive environment such as CoDeeN, avoiding problematic nodes, even if they (eventually) produce a correct result, is preferable to incurring reliability-induced delays. Even a seemingly harmless activity such as a TCP SYN retransmit increases user-perceived latency, reducing the system's overall utility. For CoDeeN to operate smoothly, our distributed redirectors need to continually know the state of other proxies and decide which reverse proxies should be used for request redirection. In practice, what this entails is first finding a healthy subset of the proxies and then letting the redirection strategy decide which one is the best. As a result, CoDeeN includes significant node health monitoring facilities, much of which is not specific to CoDeeN and can be used in other latency-sensitive peer-to-peer environments.

Two alternatives to active monitoring and avoidance, using retry/failover or multiple simultaneous requests, are not appropriate for this environment. Retrying failed requests requires that failure has already occurred, which implies latency before the retry. We have observed failures where the outbound connection from the reverse proxy makes no progress. In this situation, the forward proxy has no information on whether the request has been sent to the origin server. The problem in this scenario is the same reason why multiple simultaneous requests are not used - the idempotency of an HTTP request can not be determined a priori. Some requests, such as queries with a question mark in the URL, are generally assumed to be non-idempotent and uncacheable. However, the CGI mechanism also allows the query portion of the request to be concatenated to the URL as any other URL component. For example, the URL ``/directory/program/query'' may also be represented as ``/directory/program?query''. As a result, sending multiple parallel requests and waiting for the fastest answer can cause errors.

The success of distributed monitoring and its effectiveness in avoiding problems depends on the relative difference in time between service failures and monitoring frequency. Our measurements indicate that most failures in CoDeeN are much longer than the monitoring frequency, and that short failures, while numerous, can be avoided by maintaining a recent history of peer nodes. The research challenge here is to devise effective distributed monitoring facilities that help to avoid service disruption and improve system response latency. Our design uses heartbeat messages combined with other tests to estimate which other nodes are healthy and therefore worth using.

Subsections

Next: Local Monitoring Up: Reliability and Security in Previous: Introduction

Vivek Pai
2004-05-04