Next: Conclusion
Up: Configuring and Scheduling an
Previous: Results of File System
Related Work
This paper combines four elements: (1) eager-writing, (2) data
replication for performance improvement, (3) systematic configuring a
disk array to trade capacity for performance, and (4) intelligent
scheduling of queued disk requests. While each of these techniques
can individually improve I/O performance, it is the integration and
interaction of these techniques that allow one to economically achieve
scalable performance gain on TPC-C-like transaction processing
workloads. We briefly describe some related work in each of those
areas.
The eager-writing technique dates back to the IBM IMS Write Ahead Data
Set (WADS) system which writes write-ahead log entries in an
eager-writing fashion on drums [7]. Hagmann employs
eager-writing to improve logging performance on
disks [10]. A similar technique is used in the Trail
system [14]. These systems require the log entries to be
rewritten to fixed locations. Mime [4], the extension of
Loge [6], integrates eager-writing into the disk
controller and it is not necessary to rewrite the data created by
eager-writing. The Virtual Logging Disk and the Virtual Logging File
Systems eliminate the reliance on NVRAM for maintaining the
logical-to-physical address mapping
and further explore the
relationship between eager-writing and log-structured file
systems [28]. All of these systems work on individual disks.
A more common approach to improving small write performance is to
buffer data in NVRAM and periodically
flush the full buffer to disk. The NVRAM data buffer
provides two benefits:
efficient scheduling of the buffered writes, and
potential overwriting of data in the buffer before it reaches disk.
For many
transaction processing applications, poor locality tends to result
in few overwrites in the buffer, and lack of idle time makes it
difficult to mask the time consumed by buffer flushing. It is also
difficult to build a large, reliable and inexpensive NVRAM data
buffer. On the other hand,
an inexpensive reliable small NVRAM buffer, as the one
employed for the mapping information in the EW-Array, is quite feasible.
The systems that use the NVRAM data buffer differ in the way they flush
the buffer to disk. The conventional approach is to flush data to an
update-in-place disk. In the steady state, the throughput of such a
system is limited by the average head movement distance between
consecutive disk writes. An alternative to flushing to an
update-in-place disk is to flush the buffered data in a log-structured
manner [1,21]. The disadvantage of such a system
is its high cost of disk garbage collection, which is again
exacerbated by the poor locality and the lack of idle time in
transaction processing workloads [22,24]. Some
systems combine the use of a log-structured ``caching disk'' with an
update-in-place data disk [12,13], and they are not immune to
the problems associated with each of these techniques, especially when
faced with I/O access patterns such as those seen in TPC-C.
Several systems are designed to address the small write
performance problem in disk arrays. The Doubly Distorted Mirror
(DDM) [19] is closest in spirit to the EW-Array. The two
studies have different emphasis, though. First, the emphasis of the EW-Array
study is to explore how to balance the excess capacity devoted to
eager-writing, mirroring, and striping, and how to perform disk
request scheduling in the presence of eager-writes. Second, the
EW-Array employs pure eager-writing without the background movement of
data to fixed locations. While this is more suitable for TPC-C-like
workloads, other applications may benefit from a data reorganizer.
Third, the EW-Array study provides a real implementation.
While the DDM and the EW-Array are based on mirrored organizations,
the techniques that may speed up small writes on individual disks may
be applied to parity updates in a RAID-5. Floating parity
employs eager-writing to speed up parity updates [18].
Parity Logging employs an NVRAM and a logging disk to accumulate a
large amount of parity updates that can be used to recompute the
parity using more efficient large I/Os [25]. The amount
of performance improvement experienced by read requests in a RAID-5 is
similar to that on a striped system, and as we have seen in the
experimental results, a striped system may not be as effective as a
mirrored system, particularly if the replica propagation cost of a
mirrored system is reduced by eager writing.
Instead of forcing one to choose between a RAID-10 and a RAID-5, the
AutoRAID combines both so that the former acts as a cache of the
latter [29]. The RAID-5 lower level is log-structured and
it employs a hole-plugging technique for efficiently
garbage-collecting a nearly full disk: live data is ``plugged'' into
free space of other segments. This is similar to eager-writing,
except that eager-writing does not require garbage collection.
An SR-Array combines striping with rotational data replication to
reduce both seek and rotational delay [30]. A mirrored system
may enjoy some similar benefits [2,5].
Both approaches allow one to trade capacity for better performance.
The
difficulty in both cases is the replica propagation cost. The
relative sizing of the two levels of storage in AutoRAID is a
different way of trading capacity for performance.
In fact, for locality-rich workloads, it is possible to
employ an SR-Array or an EW-Array as an upper-level disk cache of an
AutoRAID-like system.
Seltzer and Jacobson have independently examined disk scheduling
algorithms that take rotational delay into
consideration [16,23]. Yu et al. have extended
these algorithms to account for rotational replicas [30].
Polyzois et al. have proposed a delayed write scheduling technique for a
mirrored system to maximize throughput [20]. Lumb et al. have exploited the
use of ``free bandwidth'' for background I/O activities [17].
The EW-Array scheduling algorithms have incorporated elements of these
previous approaches.
Finally, the goal of the MimdRAID project is to study how to configure
a disk array system given certain cost/performance specifications.
The ``attribute-managed storage'' project at HP [8]
shares this goal.
Next: Conclusion
Up: Configuring and Scheduling an
Previous: Results of File System
Chi Zhang
2001-11-16