Sawmill: A High-Bandwidth Logging File System

Ken Shirriff
John Ousterhout
University of California, Berkeley

Abstract

This paper describes the implementation of Sawmill, a network file system
using the RAID-II storage system. Sawmill takes advantage of the direct
data path in RAID-II between the disks and the network, which bypasses
the file server CPU. The key ideas in the implementation of Sawmill are
combining logging (LFS) with RAID to obtain fast small writes, using
new log layout techniques to improve bandwidth, and pipelining through
the controller memory to reduce latency. The file system can currently
read data at 21 MB/s and write data at 15 MB/s, close to the raw disk
array bandwidth, while running on a relatively slow Sun-4. Performance
measurements show that LFS improved performance of a stream of small
writes by over a order of magnitude compared to writing directly to the
RAID, and this improvement would be even larger with a faster CPU.
Sawmill demonstrates that by using a storage system with a direct data
path, a file system can provide data at bandwidths much higher than the
file server itself could handle. However, processor speed is still an
important factor, especially when handling many small requests in
parallel.

1. Introduction

An I/O gap has arisen between the data demands of processors and the
data rates individual disks can supply [PGK88]. This gap is worsening
as processor speeds continue to increase and as new applications such
as multimedia and scientific visualization demand ever higher data rates.

One common solution to the I/O bottleneck is the disk array, where
several disks provide data in parallel. This can provide much higher
performance than a single disk, while still providing relatively
inexpensive storage. By using a RAID (Redundant Array of Inexpensive
Disks), disk arrays can be made reliable despite disk failures.

Even though disk arrays have the potential of supplying high-bandwidth
data inexpensively, there are difficulties in making this data available
to client machines across a network. For example, the RAID group at
Berkeley built a prototype system called RAID-I, using a Sun-4/280
workstation connected to an array of 28 disks [CK91]. Unfortunately,
the bandwidth available through the system was very low, only 2.3 MB/s,
mainly because the memory system of the Sun 4 file server was a bottleneck.

To avoid the file server bottleneck, the Berkeley RAID group designed
a storage system called RAID-II [DSH+94]. RAID-II uses hardware support
to move data directly between the disks and the network at high rates,
avoiding the bottleneck of moving data through the file server's CPU
and memory.

Another problem with a RAID disk array is that small random writes are
very expensive due to parity computation, which is used for reliability.
One solution is a log-structured file system (LFS) [Ros92], which writes
everything to a sequential log so there are only large sequential writes.
Thus, a log-structured file system can greatly improve performance of
small writes.

This paper describes the Sawmill file system, which has been designed
to provide high bandwidths by taking advantage of the RAID-II architecture.
Sawmill is a log-structured file system that is able to read data at 21
MB/s and write at 15 MB/s, close to the raw disk bandwidth. Sawmill is
designed to operate as a file server on a high-bandwidth network,
providing file service to network clients.

Sawmill solves two problems: how to make use of a direct data path and
how to combine a log-structured file system with a disk array. The key
ideas in Sawmill are using streaming instead of caching, performing log
layout on the fly, minimizing per-block overheads, and moving metadata
over a separate control path.

The remainder of the paper is organized as follows. Section 2 gives some
background on RAID storage, the RAID-II storage system, and log-structured
file systems. Section 3 describes in more detail the implementation of
the Sawmill file system and the techniques it uses to obtain high
bandwidth. Section 4 contains performance measurements of Sawmill.
Section 5 discusses related work, Section 6 discusses future work, and
Section 7 concludes the paper.

2. Background

There are three main concepts combined in the Sawmill file system: the
RAID disk array, hardware support for a fast data path, and the
log-structured file system (LFS). This section gives background information
on these ideas.

2.1  RAID

A RAID (Redundant Array of Inexpensive Disks) combines two ideas:
parallelism and redundancy [PGK88]. A RAID uses multiple disks in parallel
to provide much higher bandwidth than a single disk. A RAID can also
perform multiple operations in parallel. Redundancy in the form of parity
is used to maintain reliability. By storing parity, data can be fully
recovered after a single disk failure.

In a RAID, data is striped across multiple disks. We use a RAID-5
architecture, which stripes data as blocks. With N disks, a group of
N-1 data blocks is striped across N-1 disks. A parity block is computed
by exclusive-oring these N-1 blocks and the parity block is stored on
the remaining disk. Parity is distributed; successive parity blocks are
stored on different disks to avoid the bottleneck of a dedicated parity
disk. Each individual block is called a stripe unit, and a collection
of N-1 data blocks and a parity block is called a parity stripe.

For peak efficiency, a full parity stripe should be written at once. In
this case, the data and the parity can be written out in parallel. If
only part of the parity stripe is modified, parity must be recomputed
from the data already on disk. This overhead is especially costly for
small writes. Figure 1 illustrates how parity is recomputed after a
small write: the old data and parity must be read and then the new data
and the recomputed parity are written, resulting in 4 disk operations.
Thus, small writes are relatively expensive with a RAID, and writes of
a parity stripe are most efficient.

2.2  The RAID-II storage architecture

The Berkeley RAID project found that a high-bandwidth disk array could
easily saturate the memory bandwidth of a typical workstation file server
[CK91]. The RAID-I prototype used a Sun-4/280 workstation connected to
an array of 28 disks. The bandwidth available through the system was
very low, only 2.3 MB/s, mainly due to the low bandwidth of the Sun 4
file server's memory system and backplane.

To avoid the file server bottleneck, the follow-on RAID-II project built
a storage system with a new hardware architecture designed to support
disk array bandwidths. This hardware provides a high-bandwidth data path
between the disks and the network, bypassing the file server. This
results in separation of the control and data paths: the file server
handles requests and provides low-bandwidth control commands, while the
controller board provides high-bandwidth data movement. The controller
also has fast memory that can be used for buffering and prefetching.

Figure 2 illustrates the hardware configuration of RAID-II. The controller
provides a fast path between the disks and the network. It is built
around a high-bandwidth crossbar switch that connects memory buffers to
various functional units. These units include a fast HIPPI network
interface, an XOR engine for computing parity, and the disk interfaces.
Although RAID-II was designed to support 40 MB/s, the maximum bandwidth
is currently about 24 MB/s. Performance is currently limited by the disk
controller and the VME link to the controller; each of the four disk
controllers can handle about 6 MB/s. We have used a HIPPI network and
an Ultranet to connect to clients. Further details of RAID-II are given
in [DSH+94].

2.3  Log-structured file systems

The third idea used by the Sawmill file system is the log-structured
file system (LFS). A log-structured file system [Ros92] writes data only
to a sequential log. (A file block does not have a fixed position on
disk, but instead its position changes every time the block is rewritten.)
The log is written to disk in large units, called segments, on the order
of 1 MB in length. As the log fills the disk, old segments are read in
to free up the stale data; this process is called cleaning.

Because it writes only to the log, an LFS does not perform any random
writes, but only large sequential log updates. This is an advantage with
any disk system because it avoids the seek time and rotational latency
of a random write. However, there is an even more significant advantage
with a RAID due to the high parity overhead of small writes to RAID. By
using an LFS, random writes can use the full sequential write bandwidth
of the disk array because they are grouped together into a log write.
The LFS segment size must be a multiple of the parity stripe size for
maximum performance.

3. Implementation

This section describes the implementation of the Sawmill file system.
Section 3.1 gives an overview of how Sawmill processes client requests.
Section 3.2 describes how Sawmill uses the controller memory. Section
3.3 explains how read performance is improved. Section 3.4 describes
the new log layout techniques. Section 3.5 describes the handling of
metadata.

3.1  Handling a typical request

Clients communicate with Sawmill over a high-speed network. In the
current implementation, client applications are linked with a small
library to use the Sawmill file system. This library converts file system
operations into socket operations. For instance, to open a file, the
client calls a library routine, which opens a socket connection to the
Sawmill file server. The client then sends the open parameters through
the socket and receives the status from the file server. It would also
be straightforward to provide access to data via NFS; however, performance
would be limited by NFS protocol overhead.

For a read, the client library routine sends the read parameters over
the socket and then receives the data. On the server side, the read
request is received into controller memory and copied through the RAID-II
controller into the file server. The Sawmill file system determines
where the data is stored on disk. The file server sends commands to the
RAID-II controller to read data from disk into controller memory. When
the first block of data has arrived from the disks, the file server
issues commands to send the data across the network and to read the next
blocks of data. Pipelining the transfers in this way reduces latency.
Note that the file server only processes requests and handles control
messages, while the RAID-II controller does the actual data movement.

Writes to Sawmill are handled in a similar fashion. The client sends
the write parameters over the socket, followed by the data. The file
server receives the write request, determines where in controller memory
to receive the data, and informs the RAID-II controller to start receiving
data. After a full segment of the log has been collected in controller
memory, the file server commands the controller to write the data to
disk. Again, the data is transferred from network to disk without passing
through the file server's memory.

Sawmill currently uses the same network to receive requests and to
transfer data. However, it would be straightforward to use separate
control and data networks. This would typically be used if the high-speed
network had high latency, for instance, to set up a connection.

3.2  Controller memory

The RAID-II controller board has a 32 MB memory buffer. This memory is
used by the RAID striping driver and the Sawmill file system. The RAID
striping driver uses memory for buffering, to compute parity, and to
reconstruct data after a disk failure. The Sawmill file system uses the
remaining memory for several purposes: buffering network requests and
replies, holding LFS segments before they are written to disk, buffering
data during reads, and holding LFS segments for cleaning. This section
describes the use of controller memory for reads. Section 3.4 will
describe the use of controller memory for layout of write data in the
log.

In the original Sawmill file system design, we planned to implement a
standard file system block cache in controller memory. However, we
decided against this for two reasons. First, a block cache requires a
significant amount of CPU overhead for each block, to check whether a
block is in the cache and to update the cache information with new
blocks. Second, with RAID-II, the memory available for a cache would be
about 20 MB; we expect this is too small to provide a high hit rate with
high-bandwidth applications. Note that with a data rate of about 20
MB/s, the lifetime of cached data would be only one second. In any case,
client caching could provide most of the potential benefits of a server
data cache.

Currently, the Sawmill file system uses controller memory for pipelining
read requests. That is, for large reads, disk operations are overlapped
with network operations to hide the latency of disk and network operations.
As data blocks are read from disk to memory, previous blocks are being
sent from memory to the network. We are currently adding prefetching to
Sawmill to obtain this benefit for small operations.

3.3  Batching of reads

For efficiency, data should be read with a single disk operation whenever
possible. Unfortunately, a sequential request might not be stored
sequentially in the log. Fortunately, there will often be locality of
writes, so the data may be stored in the same segment, even if the blocks
are not stored sequentially. In this case, it would be more efficient
to read the entire segment and then send the pieces from memory over
the network in the proper order, rather than perform multiple disk
operations to fetch the blocks in order.

To group reads, the file system loops over each block, checking its
position on disk and seeing whether it is part of the same segment. When
it collects as many blocks as possible, it can read them all at once.
More complex algorithms are possible to improve performance when reading
out-of-order data. For instance, batching may be combined with prefetching
to read blocks before they are requested.

3.4  On-the-fly layout

A log-structured file system must arrange new file data into the sequential
log that is written to disk; this process is called log layout. Sawmill
uses a new method of performing log layout, called on-the-fly layout,
to get more efficiency. In a standard LFS implementation [Ros92][SBMS93],
write data is stored in the file system block cache. A backend process
pulls blocks out of the cache and places the blocks in the log. In
contrast, Sawmill assigns a position in the log to each data block when
the write request comes in. The file system informs the controller of
this position, causing incoming data to go directly from the network
into the proper position in the log. At the same time, the file system
reserves space in the log for any needed metadata such as inodes. Figure
3 illustrates the two techniques.

There are several advantages of doing layout when writes are received
rather than as a backend process. First, layout for a large write can
be done as a single operation, instead of many per-block operations,
thereby minimizing the processing overhead. Second, blocks do not have
to be moved to and from the cache, but are immediately put in the proper
location. The RAID striping driver requires the segment written to disk
to be stored contiguously in memory. Thus, a copy operation would be
required, which would significantly reduces performance. (An alternative
would be a device driver that accepts a scatter-gather array so layout
could be done by moving pointers instead of blocks.) On-the-fly layout
had a large impact on performance of Sawmill; an earlier implementation
used a block cache for writes but due to copying and per-block overheads,
performance was limited to under two megabytes per second.

Because it doesn't use a cache, on-the-fly layout has a few potential
disadvantages. First, data cannot be reorganized before it is written.
For instance, if writes to a file are received in random order into a
cache, they could be taken out of the cache sequentially and written in
order. However, if blocks are placed in the log immediately, they will
be written in the random order in which they were received. This will
decrease performance if they are later read sequentially. The second
disadvantage occurs if data blocks are modified or deleted. With a cache,
only the live blocks will be written to disk. However, with on-the-fly
layout, the dead blocks will have a position in the log and will go to
disk, resulting in unnecessary disk traffic. We expect that these
disadvantages will not be important in practice. Because of the high
data rates, blocks are likely to go to disk before they would be modified
or deleted. Also, if there were a cache on the clients, the client cache
could do the processing and provide the benefits.

3.5  Metadata movement

Besides handling file data, the file system must handle various types
of metadata such as information on files (inodes), information about
where blocks are on disk, the directory structure of the file system,
and information about the contents of LFS segments. With a normal storage
system architecture, the file system just accesses metadata as it needs
it. However, with RAID-II any required metadata must be explicitly copied
between file server memory and the RAID-II controller. Since moving data
over this path is relatively slow, metadata is cached in file server
memory. As a result, the Sawmill file system must maintain consistency
among metadata in file server memory, metadata in controller memory,
and metadata stored on disk.

4. Performance measurements

This section describes performance measurements of the Sawmill file
system. Section 4.1 describes measurements of a single request stream
with requests that are handled sequentially. Section 4.2 describes
measurements of multiple requests processed in parallel: these numbers
indicate the additional performance that a disk array can obtain by
handling concurrent requests. Performance results are given for the raw
disk array, the disk array configured as a RAID, and the disk array
running Sawmill.

These performance measurements used RAID-II with 16 disks. The stripe
unit was 64 KB, yielding a parity stripe size of 960 KB. In other words,
writes of 960 KB transfer 64 KB data blocks to 15 disks and a 64 KB
parity block to 1 disk in parallel. The LFS segment size was also 960
KB. The measurements in this section are of random requests of various
fixed sizes. For these measurements, the Sawmill file server was a
Sun-4/280, which is relatively slow (about 9 SPEC89 integer SPECmarks).

Because we don't have a client that can handle the data rates of RAID-II
yet, the following performance tests did not transmit data across the
network, but only transferred the data between the disks and controller
memory. Our current client can only transfer 3 to 4 MB/s over the network.
However, the measured server CPU load to handle this network traffic
was about 2%. Thus, the server could transfer the full RAID-II bandwidth
over the network without a CPU bottleneck arising due to the network
traffic. Therefore, the performance numbers in this section should be
a reasonable indication of the actual network performance with a fast
client.

There are several key results from the performance measurements. First,
Sawmill is able to provide about 80% of the raw disk bandwidth for large
requests. Sawmill is also able to handle a single stream of small read
requests with performance close to that of the raw disk. Because of LFS,
Sawmill can handle small writes an order of magnitude faster than the
raw disk. However, the server CPU is a performance bottleneck for small
writes and for multiple request streams. Small write performance with
Sawmill would improve dramatically with a faster CPU. Also, Sawmill can
only support about 3 or 4 concurrent streams of small requests before
the CPU becomes a bottleneck.

4.1  Single request stream

This section describes the performance of the raw disk array, the disk
array treated as a RAID, and the Sawmill file system, when receiving a
single stream of requests. This indicates the performance available to
a single application. Figure 4 shows read requests and Figure 5 shows
write requests.

4.1.1  Raw disk array performance

To understand the performance, the first item to examine is the disk
array. The array contains IBM 0661 disks [IBM91]. These disks have
average rotational latency of 7 ms and average seek time of 12 ms. Each
disk can read and write at about 1.6 MB/s. The disk array was configured
with 16 disks on four controllers. A single SCSI string was able to read
at 3.14 MB/s with two disks. With two strings of two disks per controller,
each controller can provide 6.24 MB/s. For the measurements in this
section, the disk array was treated as independent disks, with no parity
computation. To make the measurements comparable to the RAID and Sawmill
measurements, requests were broken into 64 KB blocks and made to the
appropriate number of disks: requests smaller than 64 KB used a single
disk and larger requests used multiple disks.

The "Raw" lines in Figures 4 and 5 show the performance of the raw disk
array as a function of request size. Performance was highly dependent
on request size. Peak read performance was 24.5 MB/s and peak write
performance was 18.4 MB/s. For small requests, seek time dominated.

With the raw disk array, the processing overhead is minimal, so the
performance is almost entirely dependent on the disk speed. The time to
complete a request to a disk can be modeled approximately as:
20 ms + disk_request_size / peak_bandwidth
where the measured peak bandwidth is 1.57 MB/s for reads and 1.16 MB/s
for writes. By dividing the request size by this time, the bandwidth
for a given request size can be modeled. Writes are slower because of
missed revolutions; the disks have a track buffer that minimizes missed
revolution costs for reads.

4.1.2  RAID performance

In the previous section, the disk array was viewed as independent disks.
In this section, the disks are treated as a RAID, but without a file
system. The RAID striping driver receives requests and breaks them up
into stripe units spread across multiple disks. For writes, parity is
computed. These measurements are significant because Sawmill is built
on top of the RAID.

The "RAID" lines in Figures 4 and 5 show the read and write performance
of a single stream to the RAID device. Read performance is generally
the same as for the raw disk. For large accesses there is a small
performance loss, mainly from processing overhead and increased latency
due to the RAID striping driver. RAID writes take at least twice as long
as writes to the raw disk due to the read-modify-write parity update.
Note the spikes in write performance at multiples of 960 KB; these sizes
are multiples of the parity stripe size and thus don't require a parity
read-modify-write.

These measurements show that for good write performance, the file system
must perform operations in multiples of the parity stripe size.

4.1.3  Sawmill performance

The Sawmill file system, by using the techniques described earlier, was
able to obtain peak bandwidth of about 80% of the raw disk bandwidth.
For the measurements in this section, random reads and writes of various
sizes were made to a single 30 MB file.

For a single request stream, Sawmill read performance is close to the
raw disk performance. There is a performance loss for very large requests
due to the per-request file system overhead of Sawmill and due to reads
broken across multiple LFS segments. For large sequential reads, maximum
performance is 21 MB/s, slightly below the raw disk bandwidth. For
individual small random reads, as shown in Figure 4, performance is
limited by the file system overhead and the seek time of the disks,
resulting in a minimum latency of about 20 ms per operation.

Figure 4 shows the CPU and disk usage for Sawmill read requests. The
disk usage shows the percentage of time the disks are in use, averaged
across all 16 disks (e.g. 8 disks in use 50% of the time is 25%
utilization). Because the measurements in Figure 4 use only a single
stream of requests, small requests only use one disk at a time and
cannot exceed 6.25% utilization. Note that CPU load is fairly high in
Figure 4, but drops around 100 KB. This shape is due to two factors:
CPU overhead per byte and the total number of bytes. The CPU usage per
disk operation is roughly constant. As the request size increases to
the 64 KB stripe unit, each disk operation transfers more data, so the
fraction of time spent with CPU overhead decreases. However, as request
size continues to increase beyond 64 KB, more disks are used in parallel,
causing the CPU usage to climb again.

Figure 5 shows write performance. Large files can be written to disk at
about 15 MB/s. Note that small writes are totally CPU limited. This is
due to LFS: since small writes are batched together and written
sequentially, very many small writes take place before a disk operation
occurs. Thus, the file system overhead to perform this batching dominates
small write performance. As the request size increases, per-operation
CPU time becomes less important, causing disk usage to climb and CPU
usage to drop. These performance measurements omit the cost of writes
that modify less than a file block; in this case an additional read
would be required to fetch the unmodified data.

Figure 5 illustrates the benefits of a log-structured file system and
also shows the performance lost due to CPU load. Because LFS groups
small writes together, performance is about 20 times that of the RAID.
This performance would be much better with a faster CPU; if there were
no CPU load, the Sawmill line would be approximately flat around 15
MB/s, since the segment size written to disk is fixed. A more realistic
projection is to consider a CPU that is ten times faster, such as a DEC
Alpha. Small write performance would scale almost linearly, since small
writes are totally CPU bound. This would result in small write performance
of 2 MB/s, about 200 times the performance of writing directly to the
RAID.

Figure 5 can be used to estimate the write performance of a file system
based on the Unix FFS [MJLF84] running on RAID-II and compare it with
Sawmill. If writes go to disk individually, the Unix file system would
have performance similar to that of the RAID (assuming no CPU overhead
for the Unix file system and ignoring the seeks to write metadata in
the Unix file system), so Sawmill would be a factor of 20 faster for
small writes. A Unix file system with clustering [MK91] would shift
performance along the RAID line, since the writes would be in larger
units. However, Sawmill performance would still be much better because
Sawmill writes are grouped into full parity stripes. With a faster
processor, the benefits of Sawmill would be even more dramatic because
small writes are currently CPU limited. Sawmill would have additional
overhead due to cleaning (which will be discussed in Section 6.1), but
the performance benefit of logging is likely to greatly exceed this
cost.

4.2  Multiple stream performance

This section describes the performance of the system when multiple
requests are processed in parallel. This indicates the performance if,
for example, multiple applications or multiple clients were using the
system. Figure 6 shows read performance with concurrency and Figure 7
shows write performance.

A disk array can process multiple requests in parallel if they go to
separate disks. This improves the overall system throughput and the
total number of operations per second, but doesn't improve the performance
of any particular request stream. Concurrency is a significant benefit
for small requests, which only use a single disk. Potentially, 16 small
reads or 8 small RAID writes could take place in parallel, improving
performance by the same factor. Requests larger than the 64K stripe unit
will be striped across multiple disks and become inherently parallel.
Thus, multiple requests improve the performance much more for small
requests than large requests.

The limiting factor to concurrency in our system is the CPU load.
Unfortunately, our file server CPU is relatively slow and cannot handle
the full potential concurrency of the system. Only 3 or 4 Sawmill
operations can be run in parallel before the CPU becomes a bottleneck.
Since the disk array could support 16 operations in parallel, small
operation throughput would increase substantially with a better processor.

In Figures 6 and 7, the degree of concurrency used depended on how much
concurrency the system could handle before saturating. The raw disks
handled 16 independent request streams. The RAID handled 8 independent
streams for small requests, with the number decreasing as the request
size increased. Sawmill handled 4 concurrent small read requests. For
Sawmill, no write concurrency was required because LFS results in all
the disks being used in parallel.

For the raw disk array, the CPU load is almost entirely determined by
the number of disk operations. Starting a disk operation, handling the
interrupt, and concluding the disk operation take about 1 ms in total.
Since the SCSI controller can only handle requests up to 128KB, a larger
logical operation results in multiple disk operations. Due to these
factors, CPU load for 16 parallel raw disk operations is about 70% for
small operations, drops to 20% for 128 KB requests, and then climbs
slightly. This indicates that although the RAID-II architecture allows
data rates greatly exceeding what the file server could move, our current
file server CPU is just barely able to handle the high interrupt rates
of small disk operations.

Besides bandwidth, another important measurement is the total number of
I/O operations the disk array can support per second. Figure 8 shows
the number of I/O operations per second the system can support, and
compares a single request stream with multiple request streams. These
measurements are for 4 KB random operations.

In theory, by performing multiple operations in parallel, performance
should scale with the number of disks. However, the main limitation on
parallelism with RAID-II is the CPU load; handling multiple requests
through RAID or Sawmill saturates the CPU before all disks are in use.
The RAID striping driver has more processing load than the raw disk,
and the Sawmill file system has additional load. Thus, the number of
parallel requests possible before the CPU saturates is reduced. The RAID
measurements were saturated with 8 requests in parallel, while the
Sawmill measurements were saturated with 4 requests.

For a single raw disk, the number of operations per second is almost
entirely limited by the 7 ms rotational latency and the 12 ms average
seek time. With 16 raw disks, Figure 8 shows about a 5% performance
penalty; the CPU load is about 75%, so there is some delay in starting
requests.

When the disk array is treated as a RAID, the I/O rates change
significantly. A single stream of read requests to the RAID uses a single
disk, so performance is approximately the same as for a single disk.
(Performance in Figure 8 is slightly better for the RAID because seek
distance was shorter for the striped requests.) Performance of a single
small write stream is about half of the read performance due to the
read-modify-write parity update. With multiple parallel request streams,
the RAID uses disks in parallel. Performance levels off around 8 request
streams because the CPU becomes saturated. A faster CPU would significantly
increase the total number of I/Os per second that the RAID can support.

Figure 8 also shows performance of the Sawmill file system for small
requests. Comparing the read measurements to the RAID with a single
request stream shows that there is a slight performance loss due to
Sawmill; this is due to overhead in the file system. (The read latency
is simply the reciprocal of the I/Os per second, or about 22 ms.) Due
to the CPU load, reads obtained little benefit from concurrency. For
writes, the number of I/Os per second is very good compared to the RAID,
showing the benefit of LFS. With a faster CPU, the number of I/Os per
second would be even higher, since performance is CPU limited, not disk
limited.

4.3  Scalability

This section describes how the performance measurements are likely to
change with improvements in technology. To summarize, performance of
large requests, small writes, and concurrent operations is likely to
improve with faster technology. However, small reads are limited by the
performance of individual disk drives, which will not improve as quickly
over time.

Because small writes are limited by CPU speed, small write performance
will scale almost linearly with increases in processing power. Current
high-performance workstations already have ten times the CPU power of
our Sun-4 server, and even faster machines will become available over
the next few years. Thus, the performance of small writes in a Sawmill-like
file system should improve dramatically over what we measured.

Small individual random reads are limited by disk positioning time. Disk
positioning times are likely to improve relatively slowly, compared to
improvements in CPU speed. Faster CPU speeds will improve the potential
parallelism of small reads, which was limited by our CPU. Increasing
the number of disks will increase the total number of independent reads
that can be carried out simultaneously. This will increase the number
of I/O operations per second, but won't help the performance of an
individual request. Prefetching can be used to improve the performance
of individual requests; if the data is fetched before it is required,
the disk latency will not affect latency to handle the request. One
technique for this is Gibson's Transparent Informed Prefetching [GPS93];
by obtaining hints from the application, the file system can fetch data
before it is required.

Large reads and writes already achieve close to the raw system performance
with Sawmill. Thus, large request performance will only increase with
storage systems that have more or faster disks. The CPU speed will have
to increase proportionally or else the CPU will become a bottleneck.
However, note that the current server is a Sun-4 and much faster CPUs
are readily available.

5. Related work

There are several storage systems that provide high-performance I/O.
One fundamental problem these systems must solve is how to provide the
data without being limited by the file server's CPU, memory, and I/O
bandwidth. There are various methods that have been used to solve these
problems. For instance, mass storage systems usually use an expensive
mainframe, designed to support high bandwidth, as the file server. Other
systems use multiprocessors or multiple servers to provide sufficient
server power.

5.1  Mass storage systems

There are several high-performance mass storage systems in use at
supercomputing centers. One example is the LINCS storage system at
Lawrence Livermore National Laboratory [HCF+90]; this storage system
has 200 GB of disk connected to an Amdahl 5868. Another example is a
system at the Los Alamos National Laboratory [CMS90], which has an IBM
3090 running the Los Alamos Common File System.

These systems solve the problem of the file server being a performance
bottleneck by using a mainframe as the file server. The mainframe server
is designed to have the I/O bandwidth, memory bandwidth, and CPU power
necessary to provide very high data rates to clients. In addition, the
server is connected to a network sufficient to handle very high data
rates. The disadvantage of these mass storage systems is their cost.
Because they are custom-designed and use an expensive mainframe, they
are used at very few sites.

5.2  Multiprocessor file servers

Another approach to increasing I/O performance is to use a multiprocessor
as the file server. Such a system avoids the problem of the server being
a performance bottleneck by using multiple processors, with the associated
gain in server CPU power and memory bandwidth. One example of this is
the Auspex NFS server [Nel90]. The Auspex system uses asymmetric functional
multiprocessing, in which separate processors deal with the Ethernet,
files, disk, and management. The necessary disk bandwidth is provided
by parallel SCSI disks. However, the performance of the Auspex is limited
by its use of NFS, Ethernet, and a single 55 MB/s VME bus; measurements
show it can supply about 400 KB/s to a single client and can saturate
an Ethernet network with 1MB/s per Ethernet connection [Wil90].

The DataMesh project proposed a different approach to
multiprocessing[Wil91]. The proposed system would consist of a large
array of disk nodes, where each node had a fast CPU (20 MIPS) and 8 to
32 MB of memory. These nodes would be connected to a high-performance
interconnection network.

By providing multiple processors, these systems avoid bottlenecks from
limited server CPU bandwidths. However, this tends to be an expensive
solution, since it requires buying enough processors to provide the
necessary memory bandwidth. Even so, the Auspex server still has a memory
bandwidth bottleneck since it uses a single VME bus.

5.3  Striping across servers

High bandwidth access to very large data objects can also be provided
by striping files across multiple servers, so that overall bandwidth
isn't limited by the memory system of a single server. An example is
the Swift system [CL91]. In this system, data is striped across multiple
file servers and networks to provide more bandwidth than a single server
could provide.

A second system that stripes data across multiple servers is Zebra
[HO93]. In Zebra, each client writes its data to a sequential log. This
log is then striped across multiple servers, each with a disk. Zebra
and Sawmill both combine LFS and RAID. The key difference is that Zebra
uses multiple servers with single disks and Sawmill uses a single server
with multiple disks. Swift and Zebra avoid the file server bottleneck
by using multiple servers, while Sawmill avoids the bottleneck by using
a data path in hardware. There is a cost trade-off: Sawmill requires a
special controller, while Swift and Zebra require multiple fast servers.

5.4  RAID parity updates

There are several techniques to reduce the cost of updating parity after
a partial stripe write. One technique is Parity Logging [SGH93]. In this
technique, parity updates are written to a log. At regular intervals,
the log is scanned and the parity modifications are applied to the
standard RAID parity blocks. Floating Parity [MRK93] is a second technique
for minimizing parity cost. In this technique, multiple parity blocks
are reserved on disk. Updates can use the block closest to the current
rotational position of the disk. The Logical Disk [dJKH93] implements
a log-structured file system at the disk level rather than the file
system level by writing all blocks sequentially to a log and maintaining
the mapping between logical disk blocks and physical disk blocks. Since
it writes blocks to a sequential log, it avoids random parity updates.

6. Future Work

Probably the most important unanswered issue with Sawmill is the cost
of cleaning a log-structured file system. Another major issue is how to
improve the performance of small reads; this was discussed earlier. A
related issue is improving large read performance by data reorganization.

6.1  Cleaning cost

One of the key unanswered questions about Sawmill is the overhead due
to log cleaning. Cleaning is the process of garbage-collecting the log
to free up space. Cleaning is not yet operational in Sawmill, so
performance measurements are not available. Previous work [Ros92]
indicated that overall cleaning costs would be low. However, [SBMS93]
found cleaning costs to be high for some workloads, such as transaction
processing. Costs were high particularly in environments with largely
full disks, and cleaning could potentially cause service interruption.
The former work did not have dynamic performance measurements of the
cleaner, and the latter measurements were on a largely untuned
implementation.

The impact of cleaning on the performance of log-structured file systems
continues to be a topic requiring additional investigation, which is
beyond the scope of this paper. At this point, there is no evidence to
suggest that Sawmill will not perform well for the workloads typical
of office environments. Furthermore, even if cleaning costs are high,
Sawmill's ability to avoid the "small-write problem" of RAID devices
could well offset such overheads.

6.2  Data reorganization

Log-structured file systems are write-optimized; that is, they organize
data for fast writes, even though this may disadvantage later reads. In
particular, after random writes data may not be stored sequentially on
disk, resulting in lower performance for later sequential reads. However,
by reorganizing data on disk so the data is sequential, the reorganized
data can be more efficient for reads. Thus, reorganization on Sawmill
could improve sequential read performance. This reorganization could
either take place during idle times in the system, or it could be
integrated with cleaning so cleaned data is written back in a better
pattern for reads.

7. Conclusions

This paper has described Sawmill, a high bandwidth file system that uses
the RAID-II disk array. Sawmill uses a cost-effective file server to
provide high-bandwidth access to a disk array. By taking advantage of
hardware support, Sawmill provides data rates much higher than the memory
bandwidth of the file server.

Measurements show that for large requests Sawmill operates at close to
the raw bandwidth of the disk array, reading data at a peak rate of 21
MB/s and writing data at 15 MB/s, while running on a Sun-4. The
log-structured file system improved performance of a small write stream
by an order of magnitude over the RAID, and with a faster processor
small write performance would be improved even more. The high CPU load
in many of the performance measurements shows that even with hardware
support, there are still significant demands on the file system CPU,
and a fast processor is still required. Our current CPU was a bottleneck
for small writes and concurrent operations.

In conclusion, Sawmill shows that combining a direct data path, a disk
array, and a log-structured file system is a cost-effective method of
providing high-bandwidth storage.

8. Acknowledgments

The authors would like to thank John Bongiovanni, Keith Bostic, and Drew
Roselli for their comments on this paper. The assistance of the RAID
group, in particular Ann Drapeau and Srini Seshan was instrumental in
getting Sawmill running. This research was supported by the California
MICRO program, ARPA grant N00600-93-C-2481, and NSF grant IRI-9116860.

9. References

[CK91]  Ann L. Chervenak and Randy H. Katz. Performance of a disk array
prototype. In Proceedings of the 1991 ACM Sigmetrics Conference on
Measurement and Modeling of Computer Systems, pages 188-197, 1991.

[CL91]  L. Cabrera and D. Long. Exploiting multiple I/O streams to
provide high data-rates. In Proceedings of the 1991 USENIX Summer
Conference, June 1991.

[CMS90]         B. Collins, C. Mercier, and T. Stup. Mass-storage system
advances at Los Alamos. In Digest of Papers, Proc. Tenth IEEE Symposium
on Mass Storage Systems, pages 77-81, May 1990.

[dJKH93]        Wiebran de Jonge, M. Frans Kaashoek, and Wilson C. Hsieh.
The Logical Disk: A new approach to improving file systems. In Proceedings
of the Fourteenth SOSP, Operating Systems Review, volume 27, pages 15-28,
December 1993.

[DSH+94]        Ann L. Drapeau, Ken Shirriff, John H. Hartman, Ethan L.
Miller, Srinivasan Seshan, Randy H. Katz, Ken Lutz, David A. Patterson,
Edward K. Lee, Peter M. Chen, and Garth A. Gibson. RAID-II: A
high-bandwidth network file server. In Proceedings of the 21st
International Symposium on Computer Architecture, Chicago, IL, April
1994. To appear.

[GPS93]         G. A. Gibson, R. H. Patterson, and M. Satyanarayanan.
A status report on research in transparent informed prefetching. ACM
Operating System Review, 27(2):21-34, April 1993.

[HCF+90]        C. Hogan, L. Cassell, J. Foglesong, J. Kordas, M. Nemanic,
and G. Richmond. The Livermore Distributed Storage System: Requirements
and overview. In Digest of Papers, Proc. Tenth IEEE Symposium on Mass
Storage Systems, pages 6-17, May 1990.

[HO93]  John H. Hartman and John K. Ousterhout. The Zebra striped network
file system. In Proceedings of the 14th Symposium on Operating System
Principles, pages 29-43, Asheville, NC, December 1993. ACM.

[IBM91]         IBM. 0661 Functional Specification Model 371, Version
0.7, February 18 1991.

[MJLF84]        M. McKusick, W. Joy, S. Leffler, and R. Fabry. A fast
file system for UNIX. ACM Transactions on Computer Systems, 2(3):181-197,
August 1984.

[MK91]  L. McVoy and S. Kielman. Extent-like performance from a UNIX
file system. In Proceedings of the 1991 USENIX Winter Conference, 1991.

[MRK93]         J. Menon, J. Roche, and J. Kasson. Floating parity and
data disk arrays. Journal of Parallel and Distributed Computing,
17:129-139, 1993.

[Nel90]         B. Nelson. An overview of functional multiprocessing
for network servers. Technical report, Auspex Systems Inc., July 1990.

[PGK88]         David Patterson, Garth Gibson, and Randy Katz. A case
for redundant arrays of inexpensive disks (RAID). In ACM SIGMOD Conference,
pages 109-116, June 1988.

[Ros92]         Mendel Rosenblum. The Design and Implementation of a
Log-structured File System. PhD thesis, U.C. Berkeley, June 1992. Report
UCB/CSD 92/696.

[SBMS93]        M. Seltzer, K. Bostic, M. K. McKusick, and C. Staelin.
An implementation of a log-structured file system for UNIX. In 1993
Winter Usenix, pages 307-326, January 1993.

[SGH93]         D. Stodolsky, G. Gibson, and M. Holland. Parity logging
overcoming the small write problem in redundant disk arrays. Computer
Architecture News, 2(2):64-75, May 1993.

[Wil90]         D. Wilson. The Auspex NS5000 fileserver. Unix Review,
8(8):91-102, August 1990.

[Wil91]         J. Wilkes. DataMesh - parallel storage systems for the
1990s. In Digest of Papers, Proc. Eleventh IEEE Symposium on Mass Storage
Systems, pages 131-136, October 1991.

Author Information

Ken Shirriff is a graduate student in computer science at the University
of California, Berkeley. He is a member of the Sprite network operating
system project. His interests include distributed operating systems,
computer graphics, and fractals. He received a B.Math. degree from the
University of Waterloo in 1987, a M.S. in computer science from UC
Berkeley in 1990, and expects to receive his Ph.D. in 1994. He can be
reached at shirriff@cs.berkeley.edu.

John K. Ousterhout is a Professor in the Department of Electrical
Engineering and Computer Sciences at the University of California at
Berkeley. His interests include operating systems, distributed systems,
user interfaces, and computer-aided design. He is currently leading the
development of Sprite, a network operating system for high-performance
workstations, and Tcl/Tk, a programming system for graphical user
interfaces. Ousterhout is a recipient of the ACM Grace Murray Hopper
Award, the National Science Foundation Presidential Young Investigator
Award, the National Academy of Sciences Award for Initiatives in Research,
the IEEE Browder J. Thompson Award, and the UCB Distinguished Teaching
Award. He received a B.S. degree in Physics from Yale University in 1975
and a Ph.D. in Computer Science from Carnegie Mellon University in 1980.


Figure captions

Figure 1. Read, modify, write. When a small part of the stripe is written
to disk, the new parity is computed from the old data, the old parity,
and the new data. Thus, four disk operations are required.

Figure 2. The RAID-II storage architecture. The RAID-II storage system
(on the right) provides high-bandwidth file access to the network clients
(on the left). The RAID-II controller provides a fast data path between
the disks and the network. The controller is built around a fast crossbar
that connects the disks, memory, and the network. The file server,
running the Sawmill file system, controls the storage system but stays
out of the data path. The current configuration uses 16 disks on four
controllers. These disk are treated as a single RAID stripe group.

Figure 3. Comparison of log layout techniques. On the left, the layout
uses a cache as in traditional log-structured file systems. On the right,
the layout is done on the fly, directly into a write buffer.

Figure 4. Read performance for a single stream of random requests as a
function of the request size. The top graph shows bandwidth through the
raw disk array, the disk array treated as a RAID, and the Sawmill file
system. The lower graph shows CPU utilization and disk utilization
through Sawmill.

Figure 5. Write performance for a single request stream. As in Figure
4, the top graph shows performance of the raw disk array, the RAID, and
Sawmill. The lower graph shows CPU and disk utilization with Sawmill.
Note that because of LFS, Sawmill performance is much better than writing
directly to the disks. Also note that small requests are totally CPU
limited. This is due to LFS: for small writes, almost all the time is
spent grouping them together. CPU utilization remains high, even for
large writes.

Figure 6. Read performance with multiple concurrent requests. The lines
indicate the raw disk array as 16 independent disks, the disk array
treated as a RAID, and the Sawmill file system. Concurrency of RAID and
Sawmill are limited by the CPU load.

Figure 7. Write performance of the raw disk array, RAID, and Sawmill
with multiple concurrent requests. The spikes in RAID write performance
at 960 KB and 1.92 MB illustrate the write performance possible when a
full stripe is written at once; other writes require an expensive parity
update. Note that because of LFS, the performance of Sawmill exceeds
performance of the RAID.

Figure 8.  Random 4KB I/O operations. This table shows the number of
I/O operations RAID-II can handle per second under various configurations.
CPU load limits multiple RAID and Sawmill requests.