We first define the file download problem more formally, and then outline the goals and motivation for our work. Consider a content server with a network link of bandwidth Rnbytes/s and a population of N clients concurrently downloading shared files. We use the term file to mean a unique ordered set of data blocks; thus it could refer to any segment of a larger data set. Suppose that each client i can receive data at a sustained rate , with an average client rate Rc bytes/s. The goal is to schedule block reads from disks and block transfers to clients in order to maximize X(N), the system throughput for N concurrent requests, or the number of download requests completed per time unit. Unless it is saturated, the server should be fair and it should complete each request in the minimum time that the network allows: a request from client i to download a file of length L bytes should complete after approximate delay L/Rci.
Since this paper focuses on the design of the server and its storage system, we will claim success when the server is network-limited, i.e., it consumes all of its available network bandwidth. Suppose the average file length is L bytes. For low client loads, the server may be limited by the aggregate network bandwidth to its clients: . If the client population is large, or if the clients have high-speed links, then the server may be limited by its own network link: (with unicast). We assume without loss of generality that the server's CPU and memory system are able to sustain the peak data rate Rnwhen serving large files.
Our objective, then, is to minimize the memory and storage resources needed to
feed content to the network at the peak rate Rn for a sufficiently
large client population:
.
Suppose the content server has a memory of
size M and D disks delivering a peak bandwidth of Rd byte/s
each. Our purpose is to explore how block reordering
can reduce the number of disks D needed for a given M, and/or
reduce the M needed for a given D and Rd. Equivalently, we
may view the objective as maximizing the network speed Rn or
client population size N that a given configuration (M,D) can
support. The problem is uninteresting
for small Rn, e.g., in serverless P2P systems that distribute
server functions across many slow clients. The problem is most
interesting for servers in data centers with high-capacity
network links, serving large client populations with slow transfer
rates Rc. N may be quite large for commercial content servers
with a large bandwidth disparity relative to their clients: IP network
bandwidth prices are dropping [13], while broadband
deployments for the ``last mile'' to clients continue to lag behind.
Although our approach does not require
multicast support, it is compatible with multicast and
N may be even larger when it exists.
Figure 1: The maximum ratio of network bandwidth to disk bandwidth, as a function of the portion of the data set size that fits in the cache. The system is heavily disk-limited unless the bulk of the data set fits in memory. though. |
The server's memory of size M consists of a common pool of shared buffers. A server uses these buffers for some combination of disk buffering, network buffering, and data caching. For example, suppose the server fetches data from storage in blocks of size Bbytes. A typical server would buffer each fetched block until all client transmits of the block to clients have completed; the surplus memory acts as a cache over blocks recently fetched for some client i that are deemed likely to be needed soon for another client j requesting the same file. Caching is effective for small files that can reside in the cache in their entirety until the next request [6]. However, if j's request arrives t time units after i's, then the server has already fetched up to tRci bytes of the file into memory and delivered them to i. The memory needed to cache the segment until j is ready to receive it grows with the inter-arrival time t. If Rci > Rcj then the required cache space continues to grow as the transfers progress--when the system is constrained to deliver the data in order to both clients, as for conventional file servers.
Since each file request accesses each block of the file exactly once, block accesses are uniformly distributed over the entire file length. As the number of clients increases, and as client rates and arrival times vary, the block access request stream shows less temporal locality and becomes effectively random. Of course, there is spatial locality when the server delivers each block in sequence; the system may exploit this by using a larger block size B, as discussed below (or, equivalently, by prefetching more deeply). Suppose without loss of generality that the content consists of a single file of length L >> M, or any number of equally popular files with aggregate size L. Then the probability of any block access being a cache hit is Phit = M/L, and Pmiss = 1 - M/L is the probability that the access requires a disk fetch. Then the server's storage system must be capable of sustaining block fetches at rate (Rn/B)(1 - M/L). Figure 1 shows the role of cache size in determining the number of disks required to sustain this rate.
Figure 2: Disk throughput with random block accesses in a disk with positioning time Tpos = 0.005s across different transfer block sizes, and sequential disk transfer rates Rd. We show that as Rd increases, block size must exceed one megabyte to achieve peak disk throughput. |
Figure 3: Minimum buffer space required to support clients at different rates in a server with a 1 Gbps network link. The minimum buffer space can reach several gigabytes for large populations of clients with bandwidth less than 1 Mbps. |
The areal density of magnetic storage has been doubling every year, and disks today spin several times faster than ten years ago [13]. While these trends increase sequential disk bandwidth, throughput for random block accesses (IOPS) has not kept pace. Seek costs tend to dominate random accesses, and these costs are limited by mechanical movement and are not improving as rapidly. Unfortunately, the block miss stream for a content server degrades to random access as the client population N increases, for the same reasons outlined above.
One solution is to increase the block size B. This reduces the rate of disk operations required to sustain a given effective bandwidth, which is inversely proportional to B. This technique can significantly reduce the number of disks required. Suppose each disk has an average head positioning time per access of Tpos seconds (seek plus half-rotation). Every disk access moves a block of size B bytes, and takes time Tpos + B/Rd. Thus, each disk supports random accesses per time unit, and the aggregate peak disk bandwidth for D disks becomes bytes/s. The server needs disks to fully pipeline the network. For example, Figure 2 illustrates the disk random access throughput when Tpos = 0.005 s, a typical value. With block size B = 256 KB, and disks of Rd = 50 MB/s, the disk throughput Xd becomes roughly 25 MB/s. We need about D=5 such disks to feed a server network link of Rn = 1 Gb/s.
However, Figure 3 shows that this approach consumes large amounts of memory with large client populations. Depending on the buffering scheme the minimum memory size M for N active clients is . The 1 Gb/s server can support N = 664 T1 (1.544 Mbps = 193 KB/s) client connections, consuming a minimum M = 87 MB just for device buffering with B = 256 KB. If we increase the block size to B = 1 MB then disk throughput becomes Xd = 40MB/s, the number of disks drops to D=4, and the minimum memory grows to M = 325 MB. If instead, we assume slower clients with Rc = 56 Kbps, the server needs a minimum M = 2.18 GB and M = 8.7 GB for block size B = 256 KB and B = 1 MB, respectively. In media servers using large block sizes it is common to eliminate the cache and partition memory into separate buffer regions for each client; each region is sufficient to hold any blocks in transit between the disk and the network for that client [4].
In this section we explained why accesses at the block level appear increasingly random as the client population Ngrows. This minimizes the benefit of conventional caching and weakens the locality of the disk accesses from the cache miss stream. The combination of these effects increases server cost, since more disks and/or more memory are needed to fill any given network link. For example, the experiments in this paper use disks with average sequential throughput of Rd = 33 MByte/s, but a conventional FTP server under even modest load delivers a per-disk throughput of roughly 10 MB/s, including cache hits. Thus we need about a dozen such disks to feed a network link of 1 Gb/s.