Over the last several years, we have witnessed an unprecedented growth of the volume of stored digital data. In 1999, a study estimated the amount of original digital data generated annually to be in excess of 1,700 petabyte [15]. It is estimated that this number has been nearly doubling annually since then [22]. This explosive growth is reflected on the ever increasing complexity and cost for storage management. One instance of this problem occurs in file stores. The traditional hierarchical file system is no longer adequate for systems that need to store billions of files and capture different types of semantic information that is required to efficiently access, share, and manage those files.
Consider, for example, the case of a digital movie production studio. Digital movies consist of hundreds of scenes. Each scene is composed of thousands of different data objects, including character models, backgrounds, and lighting models. These objects are typically implemented as files that are shared by tens of artists. There is a range of semantic information that needs to be captured and used in this environment. When a new version of the hair of a character is created, it has to be annotated with the changes done. Further, it is compatible with only certain versions of the head. Such information about versions and dependencies among files is important when rendering a scene; it is required to combine objects that are compatible with each other and make sense in some context. When composing a scene, an artist uses material that other people have edited and stored in the system. Content-based searching (e.g., search for ``green lush grass'') as opposed to searching by file name can greatly simplify collaboration and improve productivity. The view of what data are stored in the system may potentially be different depending on application and user. For example, an artist wants to see only objects that are compatible with the version of the character she is working on; a backup system only sees files that are marked as ``persistent'' by the artists. Further, tracking context information, such as the files accessed before, and other statistical information may enable intelligent resource provisioning, data caching and prefetching, and improve search efficiency and accuracy.
Examples of common types of semantic information that needs to be captured include: (i) file versioning, (ii) application-based dependencies, (iii) attribute-based semantics, (iv) content-based semantics, and (v) context-based information.
Considered individually, some of these types of semantic information are captured and used by existing applications and tools, such as versioning control systems or software configuration tools. However, different types of semantic information often depend on each other and are related to other functions of a storage system. For example, application-based dependencies are defined on versions of files. Also, dependencies need to be considered during archiving, to save a consistent snapshot of the application state. We argue that it is easier and more efficient to manage all the above types of semantic information in a single, general-purpose system, that many applications can use.
Along these lines, we propose a semantic-aware file store, named pStore , that extends file systems---a storage abstraction assumed by many applications---to support semantic metadata. The paper makes the following contributions.