Next: Conclusion Up: Configuring and Scheduling an Previous: Results of File System

Related Work

This paper combines four elements: (1) eager-writing, (2) data replication for performance improvement, (3) systematic configuring a disk array to trade capacity for performance, and (4) intelligent scheduling of queued disk requests. While each of these techniques can individually improve I/O performance, it is the integration and interaction of these techniques that allow one to economically achieve scalable performance gain on TPC-C-like transaction processing workloads. We briefly describe some related work in each of those areas. The eager-writing technique dates back to the IBM IMS Write Ahead Data Set (WADS) system which writes write-ahead log entries in an eager-writing fashion on drums [7]. Hagmann employs eager-writing to improve logging performance on disks [10]. A similar technique is used in the Trail system [14]. These systems require the log entries to be rewritten to fixed locations. Mime [4], the extension of Loge [6], integrates eager-writing into the disk controller and it is not necessary to rewrite the data created by eager-writing. The Virtual Logging Disk and the Virtual Logging File Systems eliminate the reliance on NVRAM for maintaining the logical-to-physical address mapping and further explore the relationship between eager-writing and log-structured file systems [28]. All of these systems work on individual disks. A more common approach to improving small write performance is to buffer data in NVRAM and periodically flush the full buffer to disk. The NVRAM data buffer provides two benefits: efficient scheduling of the buffered writes, and potential overwriting of data in the buffer before it reaches disk. For many transaction processing applications, poor locality tends to result in few overwrites in the buffer, and lack of idle time makes it difficult to mask the time consumed by buffer flushing. It is also difficult to build a large, reliable and inexpensive NVRAM data buffer. On the other hand, an inexpensive reliable small NVRAM buffer, as the one employed for the mapping information in the EW-Array, is quite feasible. The systems that use the NVRAM data buffer differ in the way they flush the buffer to disk. The conventional approach is to flush data to an update-in-place disk. In the steady state, the throughput of such a system is limited by the average head movement distance between consecutive disk writes. An alternative to flushing to an update-in-place disk is to flush the buffered data in a log-structured manner [1,21]. The disadvantage of such a system is its high cost of disk garbage collection, which is again exacerbated by the poor locality and the lack of idle time in transaction processing workloads [22,24]. Some systems combine the use of a log-structured ``caching disk'' with an update-in-place data disk [12,13], and they are not immune to the problems associated with each of these techniques, especially when faced with I/O access patterns such as those seen in TPC-C. Several systems are designed to address the small write performance problem in disk arrays. The Doubly Distorted Mirror (DDM) [19] is closest in spirit to the EW-Array. The two studies have different emphasis, though. First, the emphasis of the EW-Array study is to explore how to balance the excess capacity devoted to eager-writing, mirroring, and striping, and how to perform disk request scheduling in the presence of eager-writes. Second, the EW-Array employs pure eager-writing without the background movement of data to fixed locations. While this is more suitable for TPC-C-like workloads, other applications may benefit from a data reorganizer. Third, the EW-Array study provides a real implementation. While the DDM and the EW-Array are based on mirrored organizations, the techniques that may speed up small writes on individual disks may be applied to parity updates in a RAID-5. Floating parity employs eager-writing to speed up parity updates [18]. Parity Logging employs an NVRAM and a logging disk to accumulate a large amount of parity updates that can be used to recompute the parity using more efficient large I/Os [25]. The amount of performance improvement experienced by read requests in a RAID-5 is similar to that on a striped system, and as we have seen in the experimental results, a striped system may not be as effective as a mirrored system, particularly if the replica propagation cost of a mirrored system is reduced by eager writing. Instead of forcing one to choose between a RAID-10 and a RAID-5, the AutoRAID combines both so that the former acts as a cache of the latter [29]. The RAID-5 lower level is log-structured and it employs a hole-plugging technique for efficiently garbage-collecting a nearly full disk: live data is ``plugged'' into free space of other segments. This is similar to eager-writing, except that eager-writing does not require garbage collection. An SR-Array combines striping with rotational data replication to reduce both seek and rotational delay [30]. A mirrored system may enjoy some similar benefits [2,5]. Both approaches allow one to trade capacity for better performance. The difficulty in both cases is the replica propagation cost. The relative sizing of the two levels of storage in AutoRAID is a different way of trading capacity for performance. In fact, for locality-rich workloads, it is possible to employ an SR-Array or an EW-Array as an upper-level disk cache of an AutoRAID-like system. Seltzer and Jacobson have independently examined disk scheduling algorithms that take rotational delay into consideration [16,23]. Yu et al. have extended these algorithms to account for rotational replicas [30]. Polyzois et al. have proposed a delayed write scheduling technique for a mirrored system to maximize throughput [20]. Lumb et al. have exploited the use of ``free bandwidth'' for background I/O activities [17]. The EW-Array scheduling algorithms have incorporated elements of these previous approaches. Finally, the goal of the MimdRAID project is to study how to configure a disk array system given certain cost/performance specifications. The ``attribute-managed storage'' project at HP [8] shares this goal.

Next: Conclusion Up: Configuring and Scheduling an Previous: Results of File System

Chi Zhang
2001-11-16