Traditional file system on-disk formats are designed primarily with the goal of improving the performance of the file system in normal use. We argue that the performance and reliability of file system repair should also be a major design goal when designing the on-disk format. Trading off some steady-state performance for advantages in repair and reliability especially makes sense for file systems, for which the ultimate benchmark is how reliably it can store and retrieve data. We describe this philosophy as repair-driven design.
Repair-driven design encourages redundancy and checksums in the on-disk format. Checksums, magic numbers, UUIDs, back pointers, and outright duplication of data are all good examples of repair-driven design. Optimizations intended to compress on-disk data and reduce redundancy are antithetical to repair-driven design as they increase the damage possible from each incident of file system corruption and increase the difficulty of repairing the corruption. Similarly, the on-disk format should avoid ``fragile'' data structures--data structures that are complex to update, highly interdependent with other data structures, and difficult to interpret and repair when partially corrupted. B-trees of all sorts, dynamic metadata allocation, complex on-disk look up structures and the like improve on-line performance but tend to perform badly in repair and recovery. A simple linear directory layout performs poorly with some workloads but is trivial to repair.
One more aspect of repair-driven design is that the on-disk format should be structured and laid out such that traversing the file system in the order needed for repair is fast.
Valerie Henson 2006-10-18