pStore proposes using a generic data model to capture different types of semantic information in file stores. The data model should meet the following requirements.
Database systems do not fulfill the above requirements, because of two main reasons. First, DBs typically require a predefined schema and impose strict integrity constraints. They cannot effectively deal with incremental and dynamic schema evolution, which is common in managing unstructured data. Second, not all applications require the heavyweight ACID properties and all the features of a fully-fleshed DB. For example, Unix file systems do not guarantee the ACID properties in the face of system failures.
Based on these requirements, we propose using a data model that is based in the Resource Description Framework (RDF) [23]. RDF has been proposed to encode, exchange and reuse metadata on the Web (a fundamental tool for realizing the Semantic Web vision [21]). RDF has two main advantages. First, it provides the means to capture schemata for metadata that are both human-readable and machine-processable (RDF notations are typically defined in XML). Second, it is designed to allow reuse and extensions of existing schemata for an ever evolving set of semantic metadata.
RDF is a model that describes resources. Relations, in RDF, are expressed as tuples of the form:
subject property object
In our case, the subject is a file in the file store. The properties (one or more) that are associated with the subject capture some type of semantic property of the corresponding file. The object of the relation corresponds to the value of the property for the subject, which may be another file or some metadata structure (a literal or composite). Thus, files and metadata structures are both considered resources. In fact, relations themselves can be used as resources for constructing more complex metadata relations.
RDF provides no vocabulary that assumes or refers to application-specific semantic information, e.g., certain properties for media files or relations of files that are accessed by the same user. Instead, such classes of resources and properties are defined in the form of an RDF schema. The same RDF notation is used to specify RDF schemata [24]. This is achieved by providing a set of predefined resources, namely Classes and Properties. For example, in our case, a Class may refer to files with a certain type of content or files that are used by a certain application. For the model, the specific files are resources that are instances of a certain Class. A Property is defined in the schema to have a domain and a range. Each of them can be defined to refer to resources of one or more classes. Classes and Properties can be defined in a hierarchical fashion resulting in schemata that capture complex semantic information.
The principles of RDF resemble those of graph-based data models that have been proposed to handle structural irregularity and incompleteness of schemata and rapid schema evolution [1]. In such systems, the schema is non-mandatory, i.e., it provides some information about the current type of the data, but it does not constrain the format of the data. We have chosen RDF, as it is simple and standardized.
A remaining issue is how to implement a repository of RDF relations in a system. We intend to use some lightweight, RISC-style database systems, like the one proposed by Chaudhui and Weikum [4].