KyoungSoo Park and Vivek S. Pai Department of Computer Science Princeton University
Scale and Performance in the
CoBlitz Large-File Distribution Service
Using peer-to-peer systems makes sense when the window of interest in the content is short, or when the content provider cannot afford enough bandwidth or CDN hosting costs. However, in other scenarios, a managed CDN service may be an attractive option, especially for businesses that want to offload their bandwidth but want more predictable performance. The problem arises from the fact that HTTP CDNs have not traditionally handled this kind of traffic, and are not optimized for this workload. In an environment where objects average 10KB, and where whole-file access is dominant, suddenly introducing objects in the range of hundreds of megabytes may have undesirable consequences. For example, CDN nodes commonly cache popular objects in main memory to reduce disk access, so serving several large files at once could evict thousands of small objects, increasing their latency as they are reloaded from disk.
To address this problem, we have developed the CoBlitz large file transfer service, which runs on top of the CoDeeN content distribution network, an HTTP-based CDN. This combination provides several benefits: (a) using CoBlitz to serve large files is as simple as changing its URL - no rehosting, extra copies, or additional protocol support is required; (b) CoBlitz can operate with unmodified clients, servers, and tools like curl or wget, providing greater ease-of-use for users and for developers of other services; (c) obtaining maximum per-client performance does not require multiple clients to be downloading simultaneously; and (d) even after an initial burst of activity, the file stays cached in the CDN, providing latecomers with the cached copy.
From an operational standpoint, this approach of running a large-file transfer service on top of an HTTP content distribution network also has several benefits: (a) given an existing CDN, the changes to support scalable large-file transfer are small; (b) no dedicated resources need to be devoted for the large-file service, allowing it to be practical even if utilization is low or bursty; (c) the algorithmic changes to efficiently support large files also benefit smaller objects.
Over the 18 months that CoBlitz and its partner service, CoDeploy, have been running on PlanetLab, we have had the opportunity to observe its algorithms in practice, and to evolve its design, both to reflect its actual use, and to better handle real-world conditions. This utilitarian approach has given us a better understanding of the effects of scale, peering policies, replication behavior, and congestion, giving us new insights into how to improve performance and reliability. With these changes, CoBlitz is able to deliver in excess of 1 Gbps on PlanetLab, and to outperform a range of systems, including research systems as well as BitTorrent.
In this paper, we discuss what we have learned in the process, and how
the observations and feedback from long-term deployment have shaped
our system. We discuss how our algorithms have evolved, both to
improve performance and to cope with the scalability aspects of our
system. Some of these changes stem from observing the real behavior of
the system versus the abstract underpinnings of our original
algorithms, and others from observing how our system operates when
pushed to its limits. We believe that our observations will be useful
for three classes of researchers: (a) those who are considering
deploying scalable large-file transfer services; (b) those trying to
understand how to evaluate the performance of such systems, and; (c)
those who are trying to capture salient features of real-world
behavior in order to improve the fidelity of simulators and emulators.
In this environment, serving large files can cause several problems. Loading a large file from disk can temporarily evict several thousand small files from the in-memory cache, reducing the proxy's effectiveness. Popular large files can stay in the main memory for a longer period, making the effects more pronounced. To get a sense of the performance loss that can occur, one can examine results from the Proxy Cacheoffs (25), which show that the same proxies, when operating as ``Web accelerators,'' can handle 3-6 times the request rate than operating in ``forward mode,'' with much larger working sets. So, if a CDN node suddenly starts serving a data set that exceeds its physical memory, its performance will drop dramatically, and latency rises sharply. Bruce Maggs, Akamai's VP of Research, states:
``Memory pressure is a concern for CDN developers, because for optimal latency, we want to ensure that the tens of thousands of popular objects served by each node stay in the main memory. Especially in environments where caches are deployed inside the ISP, any increase in latency caused by objects being fetched from disk would be a noticeable degradation. In these environments, whole-file caching of large files would be a concern (21).''
Akamai has a service called EdgeSuite Net Storage,
where large files reside in specialized replicated storage, and are
served to clients via overlay routing (1).
We believe that this service demonstrates
that large files are a qualitatively different problem for
CDNs.
We subdivide these systems based on their inter-client communication topology. We term those that rely on greedy selection or all-to-all communication as examples of the swarm approach, while those that use tree-like topologies are termed stream systems.
Swarm systems, such as BitTorrent (12) and FastReplica (9), preceded stream systems, and scaled despite relatively simple topologies. BitTorrent originally used a per-file centralized directory, called a tracker, that lists clients that are downloading or have recently downloaded the file. Clients use this directory to greedily find peers that can provide them with chunks. The newest BitTorrent can operate with tracker information shared by clients. In FastReplica, all clients are known at the start, and each client downloads a unique chunk from the origin. The clients then communicate in an all-to-all fashion to exchange chunks. These systems reduce link stress compared to direct downloads from the origin, but some chunks may traverse shared links repeatedly if multiple clients download them.
The stream systems, such as ESM (10),
SplitStream (8), Bullet (20),
Astrolabe (30), and FatNemo (4) address
the issues of load balancing and link stress by optimizing the
peer-selection process. The result generates a tree-like topology (or
a mesh or gossip-based network inside the tree), which tends to stay
relatively stable during the download process. The effort in
tree-building can produce higher aggregate bandwidths, suitable for
transmitting the content simultaneously to a large number of
receivers. The trade-off, however, is that the higher link utilization
is possible only with greater synchrony. If receivers are only
loosely synchronized and chunks are transmitted repeatedly on some
links, the transmission rate of any subtrees using those nodes also
decreases. As a result, these systems are best suited for synchronous
activity of a specified duration.
This paper discusses our experience running two large-file distribution systems, CoBlitz and CoDeploy, which operate on top of the CoDeeN content distribution network. CoDeeN is a HTTP CDN that runs on every available PlanetLab node, with access restrictions in place to prevent abuse and to comply with hosting site policies. It has been in operation for nearly three years, and currently handles over 25 million requests per day. To use CoDeeN, clients configure their browsers to use a CoDeeN node as a proxy, and all of their Web traffic is then handled by CoDeeN. Note that this behavior is only part of CoDeeN as a policy decision - CoBlitz does not require changing any browser setting.
Both CoBlitz and CoDeploy use the same infrastructure, which we call CoBlitz in the rest of this paper for simplicity. The main difference between the two is the access mechanism - CoDeploy requires the client to be a PlanetLab machine, while CoBlitz is publicly accessible. CoDeploy was launched first, and allows PlanetLab researchers to use a local instance of CoDeeN to fetch experiment files. CoBlitz allows the public to access CoDeploy by providing a simpler URL-based interface. To use CoBlitz, clients prepend the original URL with https://coblitz.codeen.org:3125/ and fetch it like any other URL. A customized DNS server maps the name coblitz.codeen.org to a nearby PlanetLab.
In 18 months of operation, the system has undergone three sets of
changes: scaling from just North American PlanetLab nodes to all of
PlanetLab, changing the algorithms to reduce load at the origin
server, and changing the algorithms to reduce overall congestion and
increase performance. Our general mode of operation is shown in
Figure 1, and consists of four steps: (1)
deploy the system, (2) observe its behavior in actual operation, (3)
determine how the underlying algorithms, when exposed to the real
environment, cause the behaviors, and (4) adapt the algorithms to make
better decisions using the real-world data. We believe this approach
has been absolutely critical to our success in improving CoBlitz, as
we describe later in this paper.
Chunk naming - If chunks are named using the original URL, all of a file's chunks will share the same name, and will be routed similarly since CDNs hash URLs for routing (31,16). Since we want to spread chunks across the CDN, we must use a different chunk naming scheme.
Range caching - We know of no HTTP proxies that cache arbitrary ranges of Web objects, though some can serve ranges from cached objects, and even recreate a full object from all of its chunks. Since browsers are not likely to ask for arbitrary and disjoint pieces of an object, no proxies have developed the necessary support. Since we want to cache at the chunk level instead of the file level, we must address this limitation.
Congestion - During periods of bursty demand and heavy synchrony, consistent hashing may produce roving instantaneous congestion. If many clients at different locations suddenly ask for the same file, a lightly-loaded CDN node may see a burst of request. If the clients all ask for another file as soon as the first download completes, another CDN node may become instantly congested. This bursty congestion prevents using the aggregate CDN bandwidth effectively over short time scales.
We address these problems as a whole, to avoid new problems from
piecemeal fixes. For example, adding range caching to the Squid proxy
has been discussed since 1998 (24), but would expand the
in-memory metadata structures, increasing memory pressure, and would
require changing the internet cache protocol (ICP) used by caches to
query each other. Even if we added this support to CoDeeN's proxies,
it would still require extra support in the CDN, since the range
information would have to be hashed along with the URL.
We modify intra-CDN chunk handling and request redirection by treating each chunk as a real file with its own name, so the bulk of the CDN does not need to be modified. This name contains the start and end ranges of the file, so different chunks will have different hash values. Only the CDN ingress/egress points are affected, at the boundaries with the client and the origin server.
The agent takes the client's request, converts it into a series of requests for chunks, reassembles the responses, and sends it to the client. The client is not aware that the request is handled in pieces, and no browser modifications are needed. This process is implemented in a small program on each CDN node, so communication between it and the CDN infrastructure is cheap. The requests sent into the CDN, shown in Figure 2, contain extended filenames that specify the actual file and the desired byte range, as well as a special header so that the CDN modifies these requests on egress. Otherwise, these requests look like ordinary requests with slightly longer filenames. The full set of steps are shown in Figure 3, where each solid rectangle is a separate machine connected via the Internet.
All byte-range interactions take place between the proxy and the origin server - on egress, the request's name is reverted, and range headers are added. The server's response is changed from a HTTP 206 code (partial content received) to 200 (full file received). The underlying proxy never sees the byte-range transformations, so no range-caching support is required. Figure 4 shows this process with additional temporary headers. These headers contain the file length, allowing the agent to provide the content length for the complete download.
Having the agent use the local proxy avoids having to reimplement CDN code (such as node liveness, or connection management) in the agent, but can cause cache pollution if the proxy caches all of the agent's requests. The ingress add a cache-control header that disallows local caching, which is removed on egress when the proxy routes the request to the next CDN node. As a result, chunks are cached at the next-hop CDN nodes instead of the local node.
Since the CDN sees a large number of small file requests, it can use
its normal routing, replication, and caching policies. These cached
pieces can then be used to serve future requests. If a node
experiences cache pressure, it can evict as many pieces as needed,
instead of evicting one large file. Similarly, the addition/departure
of nodes will only cause missing pieces to be re-fetched, instead of
the whole file. The only external difference is that the server sees
byte-range requests from many proxies instead of one large file
request from one proxy.
To determine when to re-issue chunk fetches, the agent maintains overall and per-chunk statistics during the download. Several factors may slow chunk fetching, including congestion between the proxy and its peers, operational problems at the peers, and congestion between the peers and the origin. After downloading the first chunk, the agent has the header containing the overall file size, and knows the total number of chunks to download. It issues parallel requests up to its limit, and uses non-blocking operations to read data from the sockets as it becomes available.
Using an approach inspired by LoCI (3), slow transfers are addressed by issuing multiple requests - whenever a chunk exceeds its download deadline, the agent opens a new connection and re-issues the chunk request. The most recent request for the same chunk is allowed to continue downloading, and any earlier requests for the chunk are terminated. In this way, each chunk can have at most two requests for it in flight from the agent, a departure from LoCI where even more connections are made as the deadline approaches. The agent modifies a non-critical field of the URL in retry requests beyond the first retried request for each chunk. This field is stripped from the URL on egress, and exists solely to allow the agent to randomize the peer serving the chunk. In this way, the agent can exert some control over which peer serves the request, to reduce the chance of multiple failures within the CDN. Keeping the same URL on the first retry attempts to reduce cache pollution - in a load-balanced, replicated CDN, the retry is unlikely to be assigned to the same peer that is handling the original request.
The first retry timeout for each chunk is set using a combination of the standard deviation and exponentially-weighted moving average for recent chunks. Subsequent retries use exponential backoff to adjust the deadline, up to a limit of 10 backoffs per chunk. To bound the backoff time, we also have a hard limit of 10 seconds for the chunk timeout. The initial timeout is set to 3 seconds for the first chunk - while most nodes finish faster, using a generous starting point avoids overloading slow origin servers. In practice, 10-20% of chunks are retried, but the original fetch usually completes before the retry. We could reduce retry aggressiveness, but this approach is unlikely to cause much extra traffic to the origin since the first retry uses a different replica with the same URL.
By default, the agent sends completed chunks to the client as soon as they finish downloading, as long as all preceding chunks have also been sent. If the chunk at the head of the line has not completed downloading, no new data is sent to the client until the chunk completes. By using enough parallel chunk fetches, delays in downloading chunks can generally be overlapped with others in the pipeline. If clients that can use chunked transfer encoding provide a header in the request indicating they are capable of handling chunks in any order, the agent sends chunks as they complete, with no head-of-line blocking. Chunk position information is returned in a trailer following each chunk, which the client software can use to assemble the file in the correct order.
The choice of chunk size is a trade-off between efficiency and latency
- small chunks will result in faster chunk downloads, so slower
clients will have less impact. However, the small chunks require more
processing at all stages - the agent, the CDN infrastructure, and
possibly the origin server. Larger chunks, while more efficient, can
also cause more delay if head-of-line blocking arises. After some
testing, we chose a chunk size of 60KB, which is large enough to be
efficient, but small enough to be manageable. In particular, this
chunk size can easily fit into Linux's default outbound kernel socket
buffers, allowing the entire chunk to be written to the socket with a
single system call that returns without blocking.
|
|
|
In CoDeeN, we have intentionally avoided any synchronized communication for group maintenance, which results in avoiding any quorum protocols, 2-phase behavior, or any group membership protocols. The motivations behind this decision were simplicity and robustness - by making every decision unilaterally and independently at each node, we avoid any situation where forward progress fails because some handshaking protocol fails. As a result, CoDeeN has been operational even in some very extreme circumstances, such as in February 2005, when a kernel bug caused the sudden, near-simultaneous failures of nodes, with more than half of all PlanetLab nodes freezing.
One side-effect of asynchronous communication is that all peering is unilateral - nodes independently pick their peers, using periodic heartbeats and acknowledgments to judge peer health. Pairwise heartbeats are simple, robust, and particularly useful for testing reachability. More sophisticated techniques, such as aggregating node health information using trees, can reduce the number of heartbeats, but can lead to worse information, since the tree may miss or use different links than those used for pairwise communication.
Unilateral and unidirectional peering improves CoBlitz's scalability,
since it allows nodes with heterogeneous connectivity or policy issues
to participate to the extent possible. These scenarios are shown in
Figures 5, 6,
and 7. For example, research networks like
Internet2 or CANARIE (the Canadian high-speed network) do not peer
with the commercial Internet, but are reachable from a number of
research sites including universities and corporate labs. These nodes
advertise that they do not want any nodes (including each other) using
them as peers, since they cannot fetch content from the commercial
Internet. These sites can unidirectionally peer with any CoDeeN nodes
they can reach - regular CoDeeN nodes do not reciprocate, since the
restricted nodes cannot fetch arbitrary Web content. Also, in certain
PlanetLab locations, both corporate and regional, political/policy
considerations make the transit of arbitrary content an unwise idea,
but the area may have a sizable number of nodes. These nodes
advertise that only other nodes from the same organization can use
them as peers. These nodes will peer both with each other and with
unrestricted nodes, giving them more peers for CoBlitz transfers than
they would have available otherwise. Policy restrictions are not
PlanetLab-specific - ISPs host commercial CDN nodes in their network
with the restriction that the CDN nodes only serve their own
customers.
To get some indication of application health, CoDeeN uses application-level pings, rather than network pings, to determine round trip times (RTTs). Originally, CoDeeN kept the average of the four most recent RTT values, and selected the 60 closest peers within a 100ms RTT cutoff. The 100ms cutoff was to reduce noticeable lag in interactive settings, such as Web browsing. In parts of the world where nodes could not find 20 peers within 100ms, this cutoff is raised to 200ms and the 20 best peers are selected.
This approach exhibited two problems - a high rate of change in the peer sets, and low overlap among peer sets for nearby peers. The high change rate potentially impacts chunk caching in CoBlitz - if the peer that previously fetched a chunk is no longer in the peer set, the new peer that replaces it may not yet have fetched the chunk. To address this issue, hysteresis was added to the peer set selection process. Any peer not on the set could only replace a peer on the set if it was closer in two-thirds of the last 32 heartbeats. Even under the worst-case conditions, using the two-thirds threshold would keep a peer on the set for 10 minutes at a time. While hysteresis reduced peer set churn, it also reinforced the low overlap between neighboring peer sets. Further investigation indicated that CoDeeN's application-level heartbeats had more than an order of magnitude variance than network pings. This variance led to instability in the average RTT calculations, so once nodes were added to the peer set, they rarely got displaced.
Switching from an average application-level RTT to the minimum observed RTT (an approach also used in other systems (6,13,22)) and increasing the number of samples yielded significant improvement, with application-level RTTs correlating well with ping time on all functioning nodes. Misbehaving nodes still showed large application-level minimum RTTs, despite having low ping times. The overlap of peer lists for nodes at the same site increased from roughly half to almost 90%. At the same time, we discovered that many intra-PlanetLab paths had very low latency, and restricting the peer size to 60 was needlessly constrained. We increased this limit to 120 nodes, and issued 2 heartbeats per second. Of the nodes regularly running CoDeeN, two-thirds tend to now have 100+ peers. More details of the redesign process and its corresponding performance improvement can be found in our previous study (5).
For each URL, CoDeeN generates an array of values by hashing the URL
with the name of each node in the peer set. It then prunes this list
based on the replication factor specified, and then prunes it again so
only the nodes with the lowest load values remain. The final candidate
is chosen randomly from this set. Using replication and load balancing
reduces hot spots in the CDN - raising the replication factor reduces
the chance any node gets a large number of requests, but also
increases the node's working set, possibly degrading performance.
When examining origin server load in CoBlitz, we found that nodes with fewer than five peers generate almost one-third of the traffic. Some poorly-connected sites have such high latency that even with an expanded RTT criterion, they find few peers. At the same time, few sites use them as peers, leading to them being an isolated cluster. For regular Web CDN traffic, these small clusters are not much of an issue, but for large-file traffic, the extra load these clusters cause on the origin server slows the rest of the CDN significantly. Increasing the minimum number of peers per node to 60 reduces traffic to the origin. Because of unilateral peering, this change does not harm nearby nodes - other nodes still avoid these poorly-connected nodes.
Reducing the number of replicas per URL reduces origin server load, since fewer nodes fetch copies from the origin, but it also causes more bursty traffic at those replicas if downloading is synchronized. For CoBlitz, synchronized downloads occur when developers push software updates to all nodes, or when cron-initiated tasks simultaneously fetch the same file. In these cases, demand at any node experiences high burstiness over short time scales, which leads to congestion in the CDN.
To fix this problem, we observe that when a node receives a forwarded request, it can independently check to see whether it should be the node responsible for serving that request. On every forwarded request that is not satisfied from the cache, the receiving node performs its own HRW calculation. If it finds itself as one of the top candidates, it considers the forwarded request reasonable and fetches it from the origin server. If the receiver finds that it is not one of the top candidates, it forwards the request again. We find that 3-7% of chunks get re-forwarded this way in CoBlitz, but it can get as high as 10-15% in some cases. When all PlanetLab nodes act as clients, this technique cuts origin server load almost in half.
Due to the deterministic order of HRW, this approach is guaranteed to
make forward progress and be loop-free. While the worst case is a
number of hops linear in the number of peer groups, this case is also
exponentially unlikely. Even so, we limit this approach to only one
additional hop in the redirection, to avoid forwarding requests across
the world and to limit any damage caused by bugs in the forwarding
logic. Given the relatively low rate of chunks forwarded in this
manner, restricting it to only one additional hop appears sufficient.
The simplest way of reducing the short time-scale node congestion is to increase the number of replicas for each chunk, but this would increase the number of fetches to the origin. Instead, we can improve on the purely mesh-based topology by taking some elements of the stream-oriented systems, which are excellent for reducing link stress. These systems all build communication trees, which eliminates the need to have the same data traverse a link multiple times. While trees are an unattractive option for standard Web CDNs because they add extra latency to every request fetched from the origin, a hybrid scheme can help the large-file case, if the extra hops can reduce congestion.
We take the re-forwarding support to forward misdirected chunks, and use it to create broader routing trees in the peer sets. We change the re-forwarding logic to use a different number of replicas when calculating the HRW set, leading to a broad replica set and a smaller set of nodes that fetch from the origin. We set the NumCandidates value to 1 when evaluating the re-forwarding logic, while dynamically selecting the value at the first proxy. The larger replica set at the first hop reduces the burstiness at any node without increasing origin load.
To dynamically select the number of replicas, we observe that we can eliminate burstiness by spreading the requests equally across the peers at all times. With a target per-client memory consumption, we can determine how many chunks are issued in parallel. So, the replication factor is governed by the following equation:
(1) |
At 1 MB of buffer space per client, a 60KB chunk size, and 120 peers,
our replication factor will be 7. We can, of course, cap the number of
peers at some reasonable fraction of the maximum number of peers so
that memory pressure does not cause runaway replication for the sake
of load balancing. In practice, we limit the replication factor to
20% of the minimum target peer set, which yields a maximum factor of
12.
In either of these scenarios, using a smaller number of simultaneous fetches would be beneficial, since the per-chunk download time would improve. We view finding the ``right'' number of parallel chunks as a congestion issue, and address it in a manner similar to how TCP performs congestion control. Note that changing the number of parallel chunks is not an attempt to perform low-level TCP congestion control - since the fetches are themselves using TCP, we have this benefit already. Moreover, since the underlying TCP transport is already using additive-increase multiplicative-decrease, we can choose whatever scheme we desire on top of it.
Drawing on TCP Vegas (6), we use the extra information we have in the CoBlitz agent to make the chunk ``congestion window'' a little more stable than a simple sawtooth. We use three criteria: (1) if the chunk finishes in less than the average time, increase the window, (2) if the first fetch attempt is killed by retries, shrink the window, and (3) otherwise, leave the window size unmodified. We also decide that if more chunk fetches are in progress than the window size dictates, existing fetches are allowed to continue, but no new fetches (including retries) are allowed. Given that our condition for increasing the window is already conservative, we give ourselves some flexibility on exactly how much to add. Similarly, given that the reason for requiring a retry might be that any peer is slow, we decide against using multiplicative decrease when a chunk misses the deadline.
While determining the decrease rate is fairly easy, choosing a reasonable increase rate required some experimentation. The decrease rate was chosen to be one full chunk for each failed chunk, which would have the effect of closing the congestion window very quickly if all of the chunks outstanding were to retry. This logic is less severe than multiplicative decrease if only a small number of chunks miss their deadlines, but can shrink the window to a single chunk within one ``RTT'' (in this case, average chunk download time) in the case of many failures.
Some experimentation with different increase rates is shown in
Figure 9. The purely additive condition,
on each fast chunk (where x is the current number of
chunks allowed), fares poorly. Even worse is adding one-tenth of a
chunk per fast chunk, which would be a slow multiplicative increase.
The more promising approaches, adding
and
(where we use log(1) = 1) produce much better
results. The case is not surprising, since it will
always be no more than additive, since the window grows only when
performing well. In TCP, the ``slow start'' phase would open the
window exponentially faster, so we choose to use
to
achieve a similar effect - it grows relatively quickly at first, and
more slowly with larger windows. The chunk congestion window is
maintained as a floating-point value, which has a lower bound of 1
chunk, and an upper bound as dictated by the buffer size available,
which is normally 60 chunks. The final line in the graph, showing a
fixed-size window of 60 chunks, appears to produce better performance,
but comes at the cost of a higher node failure rate - 2.5 times as
many nodes fail to complete with the fixed window size versus the
dynamic sizing.
One unique aspect of our testing is the scale - we use every running PlanetLab node except those at Princeton, those designated as alpha testing nodes, and those behind firewalls that prevent CoDeeN traffic. The reason for excluding the Princeton nodes is because we place our origin server at Princeton, so the local PlanetLab nodes would exhibit unrealistically large throughputs and skew the means. During our testing in September and early October 2005, the number of available nodes that met the criteria above ranged from 360-380 at any given time, with a union size of 400 nodes.
Our test environment consists of a server with an AMD Sempron processor running at 1.5 GHz, with Linux 2.6.9 as its operating system and lighttpd 1.4.4 (18) as our web server. Our testing consists of downloading a 50MB file in various scenarios. The choice of this file size was to facilitate comparisons with other work (2,19), which uses file sizes of 40-50MB in their testing. Our testing using a 630MB ISO image for the Fedora Core 4 download yielded slightly higher performance, but would complicate comparisons with other systems. Given that some PlanetLab nodes are in parts of the world with limited bandwidth, our use of 50MB files also reduces contention problems for them. Each test is run three times, and the reported numbers are the average value across the tests for which the node was available. Due to the dynamics of PlanetLab, over any long period of time, the set of available nodes will change, and given the span of our testing, this churn is unavoidable.
We tune BitTorrent for performance - the clients and the server are configured to seed the peers indefinitely, and the maximum number of peers is increased to 60. While live BitTorrent usage will have lower performance due to fewer peers and peers departing after downloading, we want the maximum BitTorrent performance.
We test a number of scenarios, as follows:
The throughputs and download times for all tests are shown in Figure 10 and Figure 11, with summaries presented in Table 1. For clarity, we trim the x axes of both graphs, and the CDFs shown are of all nodes completing the tests. The actual number of nodes finishing each test are shown in the table. In the throughput graph, lines to the right are more desirable, while in the download time graph, lines to the left are more desirable.
|
From the graphs, we can see several general trends: all schemes beat direct downloading, uncached CoBlitz generally beats BitTorrent, out-of-order CoBlitz beats in-order delivery, staggered downloading beats synchronized delivery, and cached delivery, even when synchronized, beats the others. Direct downloading at this scale is particularly problematic - we had to abruptly shut down this test because it was consuming most of Princeton's bandwidth and causing noticeable performance degradation.
The worst-case performance for CoBlitz occurs for the uncached case where all clients request the content at exactly the same time and more load is placed on the origin server at once. This case is also very unlikely for regular users, since even a few seconds of difference in start times defeats this problem.
The fairest comparison between BitTorrent and CoBlitz is BT-Total versus CoBlitz out-of-order with Staggering, in which case CoBlitz beats BitTorrent by 55-86% in throughput and factor of 1.7 to 4.94 in download time. Even the worst-case performance for CoBlitz, when all clients are synchronized on uncached content, generally beats BitTorrent by 27-48% in throughput and a factor of 1.47 to 2.48 in download time.
In assessing how well CoBlitz compares against BitTorrent, it is interesting to examine the 90 percentile download times in Table 1 and compare them to the mean and median throughputs. This comparison has appeared in other papers comparing with BitTorrent (26,19). We see that the tail of BitTorrent's download times is much worse than comparing the mean or median values. As a result, systems that compare themselves primarily with the worst-case times may be presenting a much more optimistic benefit than seen by the majority of users.
It may be argued that worst-case times are important for systems that
need to know an update has been propagated to its members, but if this
is an issue, more important than delay is failure to complete. In
Table 1, we show the number of nodes that finish
each test, and these vary considerably despite the fact that the same
set of machines is being used. Of the approximately 400 machines
available across the union of all tests, only about 5-12 nodes fail to
complete using CoBlitz, while roughly 17-18 fail in direct testing,
and about 21-25 fail with BitTorrent. The 5-12 nodes where CoBlitz
eventually stops trying to download are at PlanetLab sites with
highly-congested links, poor bandwidth, and other problems - India,
Australia, and some Switzerland nodes.
|
Another metric of interest is how much traffic reaches the origin server in these different tests, and this information is provided in Table 2, shown as a multiple of the file size. We see that the CoBlitz scenarios fetch a total of 7 to 9 copies in the various tests, which yields a utility of 43-55 nodes served per fetch (or a cache hit rate of 97.6 - 98.2%). BitTorrent has comparable overall load on the origin, at 10 copies, but has a lower utility value, 35, since it has fewer nodes complete. For Shark, the authors observed it downloading 24 copies from the origin to serve 185 nodes, yielding a utility of 7.7. We believe that part of the difference may stem from peering policy - CoDeeN's unilateral peering approach allows poorly-connected nodes to benefit from existing clusters, while Coral's latency-oriented clustering may adversely impact the number of fetches needed.
A closer examination of fetches per chunk, shown in
Figure 12, shows that CoBlitz's average of 8
copies varies from 4-11 copies by chunk, and these copies appear to be
spread fairly evenly geographically. The chunks that receive only 4
fetches are particularly interesting, because they suggest it may be
possible to cut CoBlitz's origin load by another factor of 2. We are
investigating whether these chunks happen to be served by nodes that
overlap with many peer sets, which would further validate CoBlitz's
unilateral peering.
To get a sense of a typical month's CoBlitz usage, we present the breakdown for February 2006 traffic in Figures 14 (by number of requests) and 15 (by bytes served). Most of the requests for files less than 2MB come from the Stork service (28), which provides package management on PlanetLab, and the CiteSeer Digital Library (11), which provides document downloads via CoBlitz. The two spikes in bytes served are from the Fedora Core Linux distribution, available as either downloadable CD images or DVD images. Most of the remaining traffic comes from smaller sites, other PlanetLab users, and Fedora Core RPM downloads.
A more unusual usage pattern occurred on March 20, 2006, when the
Fedora Core 5 Linux distribution was released. Within minutes of the
official announcement on the Fedora mailing lists, the availability
was mentioned on the front page of Slashdot (27), on a
Monday morning for the US. The measurements from this day and the
previous day are shown in Figure 16. In less than an
hour, CoBlitz went from an average of 20Mbps of traffic to over 400
Mbps, and sustained 5-minute peaks exceeded 700Mbps. CoBlitz
functioned as expected, with one exception - many of the clients were
using ``download agents'' that fetch files using a ``no-cache'' HTTP
header. CoBlitz had been honoring these requests for PlanetLab
researchers who wanted to force refreshes, and we had not seen a
problem in other environments. However, for this download, these
headers were causing unnecessary fetches to the origin that were
impacting performance. We made a policy decision to disregard these
headers for the Fedora Mirror sites, at which point origin traffic
dropped dramatically. This flash crowd had a relatively long tail -
it average 200-250Mbps on the third day, and only dropped to less than
100Mbps on the fifth day, a weekend. The memory footprint of CoBlitz
was also low - even serving the CD and DVD images on several
platforms (PPC, i386, x86_64), the average memory consumption was
only 75MB per node.
The use of parallel downloads to fetch a file has been explored before, but in a more narrow context - Rodriguez et al. use HTTP byte-range queries to simultaneously download chunks in parallel from different mirror sites (23). Their primary goal was to improve single client downloading performance, and the full file is pre-populated on all of their mirrors. What distinguishes CoBlitz from this earlier work is that we make no assumptions about the existence of the file on peers, and we focus on maintaining stability of the system even when a large number of nodes are trying to download simultaneously. CoBlitz works if the chunks are fully cached, partially cached, or not at all cached, fetching any missing chunks from the origin as needed. In the event that many chunks need to be fetched from the origin, CoBlitz attempts to reduce origin server overload. Finally, from a performance standpoint, CoBlitz attempts to optimize the memory cache hit rate for chunks, something not considered in Rodriguez's system.
Shark | CoBlitz | Bullet' | ||||
Uncached | Cached | |||||
In | Out | In | Out | |||
# Nodes | 185 | 41 | 41 | 41 | 41 | 41 |
Median | 1.0 | 6.8 | 7.4 | 7.4 | 9.2 | |
Mean | 7.0 | 7.4 | 8.4 | 10.6 | 7.0 |
While comparing with other work is difficult due to the difference in test environment, we can make some informed conjecture based on our experiences. FastReplica's evaluation includes tests of 4-8 clients, and their per-client throughput drops from 5.5 Mbps with 4 clients to 3.6 Mbps with 8 clients (9). Given that their file is broken into a small number of equal-sized pieces, the slowest node in the system is the overall bottleneck. By using a large number of small, fixed-size pieces, CoBlitz can mitigate the effects of slow nodes, either by increasing the number of parallel fetches, or by retrying chunks that are too slow. Another system, Slurpie, limits the number of clients that can access the system at once by having each one randomly back off such that only a small number are contacting the server regardless of the number of nodes that want service. Their local-area testing has clients contact the server at the rate of one every three seconds, which staggers it far more than BitTorrent. Slurpie's evaluation on PlanetLab provides no absolute performance numbers (26), making it difficult to draw comparisons. However, their performance appears to degrade beyond 16 nodes.
The scarcity of deployed systems for head-to-head
comparisons supports part of our motivation - by reusing CDN
infrastructure, we have been able to easily deploy CoBlitz and keep
it running.
Additionally, we show how we have taken the experience gained from 18 months of CoBlitz deployment, and used it to adapt our algorithms to be more aware of real-world conditions. We demonstrate the advantages provided by this approach by evaluating CoBlitz's performance across all of PlanetLab, where it exceeds the performance of BitTorrent as well as all other research efforts known to us.
In the process of making CoBlitz handle scale and reduce congestion both within the CDN and at the origin server, we identify a number of techniques and observations that we believe can be applied to other systems of this type. Among them are: (a) using unilateral peering, which simplifies communication as well as enabling the inclusion of policy-limited or poorly-connected nodes, (b) using request re-forwarding to reduce the origin server load when nodes send requests to an overly-broad replica set, (c) dynamically adjusting replica sets to reduce burstiness in short time scales, (d) congestion-controlled parallel chunk fetching, to reduce both origin server load as well as self-interference at slower CDN nodes.
We believe that the lessons we have learned from CoBlitz should help not only the designers of future systems, but also provide a better understanding of how to design these kinds of algorithms to reflect the unpredictable behavior we have seen in real deployment.
This document was generated using the LaTeX2HTML translator Version 2002-1 (1.69)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons bigfile.tex
The translation was initiated by KyoungSoo Park on 2006-03-28