Check out the new USENIX Web site. next up previous
Next: Data Layout Up: Design of the Storage Previous: Design of the Storage

Design Principles

1.
Use only commodity hardware that is available in typical systems.
2.
Avoid storage media dependencies such as use of only SCSI or only IDE disks.
3.
Keep the data structures and mechanisms simple.
4.
Support reliable persistence semantics.
5.
Separate mechanisms and policies with the former inside the driver and the latter in user level applications.

Our driver uses the host processor for performing operations like I/O processing and parity computation. As the driver runs in kernel mode with forced context switching turned off, increased load on the host processor can slow down the system, and result in poorer system response. This forced us to keep the data structures and mechanisms simple. We believe this tradeoff to be advantageous overall.

Our design does not assume any special hardware such as NVRAM or dedicated processor which are generally not available in typical workstation systems. But if NVRAM is available, it can be used (a ramdisk type driver is all that is required). Also we should be able to use the existing storage media; the driver can work with any devices with a block device driver. This keeps the design flexible but makes us unable to use device specific optimizations.

To guarantee reliable persistence semantics, we have investigated making changes to the state of a stripe using both ordered updates and using a separate logging device. State changes occur due to migrations. The second approach is especially suitable in the presence of cRAID5 in the system as many writes become read/modify/write operations. In addition, it speeds up RAID5 partial writes as well as mirrored (declustered) writes for RAID1. Our first implementation on Solaris 2.5.1 used only ordered writes as we had only 2-tiers in that system. Our next prototype on Solaris used both while the Linux prototype will also use both.

The ordered updates guarantee correct operation in the event of a system failure as the state changes, but there is a possibility for some physical storage to go unaccounted during these changes, which can be reclaimed by using a UNIX fsck-like program that we call device-check. Unlike fsck, it is not essential to run this program every time the system crashes. Also this program can be run on a live system. The ordered updates are explained further in the section on migrations.

We have provided enough hooks to implement polices outside the kernel. The driver provides an interface for applications2 to access the data and services of the driver. The device-check program, for instance, that is needed with the ordered write approach uses this interface.

The multi-tier storage is referred to also as temperature sensitive storage (TSS) as migrations enable frequently accessed (``hot'' temperature) to reside in RAID1 and less frequently ones in RAID5/cRAID5 (``warm'' and ``cold'').


  
Figure 1: Storage Organization


next up previous
Next: Data Layout Up: Design of the Storage Previous: Design of the Storage
Dr K Gopinath
2000-04-25