Next: Traditional applications of hashing Up: An Analysis of Compare-by-hash Previous: Abstract

Introduction

Compare-by-hash is a technique that trades on the insight that applications frequently read or write data that is identical to already existing data. Rather than read or write the data a second time to the disk, network, or memory, we should use the instance of the data that we already have. Using a collision-resistant hash, we can quickly determine with a high degree of accuracy whether two blocks are identical by comparing only their hashes and not their contents. After making a few assumptions, we can estimate that the chance of a hash collision is much lower than the chance of a hardware error, and so many feel comfortable neglecting the possibility of a hash collision.

Compare-by-hash is accepted by some computer scientists and has been implemented in several different projects: rsync[14], a utility for synchronizing files, LBFS[4], a distributed file system, Stanford's virtual computer migration project[9], Venti[6], a block archival system, Pastiche[3], an automated backup system, and OpenCM[11], a configuration management system. However, I believe some publications overstate the acceptance of compare-by-hash, claiming that it is ``customary''[3] or a ``widely-accepted practice''[4] to assume hashes never collide in this context. An informal survey of my colleagues reveals that many computer scientists are still either unaware of compare-by-hash or disagree with the technique strongly. Since adoption of compare-by-hash has the potential to change the face of operating systems design and implementation, it should be the subject of more criticism and peer review before being accepted as a general purpose computing technique for critical applications.

In this position paper, I hope to begin an in-depth discussion of compare-by-hash. Section 2 reviews the traditional uses of hashing, followed by a more detailed description of compare-by-hash in Section 3. Section 4 will raise some questions about the use of compare-by-hash as a general-purpose technique. Section 5 will propose some alternatives to compare-by-hash, and Section 6 will summarize my findings and make recommendations.

Next: Traditional applications of hashing Up: An Analysis of Compare-by-hash Previous: Abstract

2003-06-16