USENIX 2004 Annual Technical Conference, General Track Abstract
Pp. 5972 of the Proceedings
Redundancy Elimination Within Large Collections of Files
Purushottam Kulkarni, University of Massachusetts; Fred Douglis, Jason LaVoie, and John M. Tracey, IBM T.J. Watson
Abstract
Ongoing advancements in technology lead to ever-increasing storage capacities.
In spite of this, optimizing storage usage can still provide rich dividends.
Several techniques based on delta-encoding
and duplicate block suppression have been shown to reduce storage
overheads, with varying requirements for resources such
as computation and memory.
We propose a new scheme for storage reduction that
reduces data sizes with an effectiveness comparable to the more
expensive techniques, but at a cost comparable to the faster but less
effective ones.
The scheme, called Redundancy Elimination at the Block Level (REBL),
leverages the benefits of compression,
duplicate block suppression, and delta-encoding to eliminate a broad
spectrum of redundant data in a scalable and efficient manner.
REBL generally encodes more compactly than compression (up to a
factor of 14) and
a combination of compression and duplicate suppression (up to a factor
of 6.7).
REBL also encodes similarly to a technique based on delta-encoding,
reducing overall space significantly in one case.
Furthermore, REBL uses super-fingerprints, a technique that reduces
the data needed to identify similar blocks while dramatically reducing
the
computational requirements of matching the blocks: it turns O(n2)
comparisons into hash table lookups.
As a result, using super-fingerprints to avoid
enumerating matching data objects decreases computation in the
resemblance detection phase of REBL by up to a couple orders of magnitude.
- View the full text of this paper in HTML and PDF.
The Proceedings are published as a collective work, © 2004 by the USENIX Association. All Rights Reserved. Rights to individual papers remain with the author or the author's employer. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. USENIX acknowledges all trademarks within this paper.
- If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.
|