Semantic data model

Next: Basic relations Up: Architecture of pStore Previous: Architecture of pStore

Semantic data model

pStore proposes using a generic data model to capture different types of semantic information in file stores. The data model should meet the following requirements.

Allow to specify well-defined schemata (schema definition language).
Support dynamic schema evolution to capture new or evolving types of semantic information.
Be simple to use, lightweight, make no assumptions about the semantics of the metadata.
Be platform independent and provide interoperability between applications that manage and exchange metadata.
Facilitate integration with resources outside the file store and support exporting metadata to the web.
Leverage existing standards and corresponding tools, such as query languages.

Database systems do not fulfill the above requirements, because of two main reasons. First, DBs typically require a predefined schema and impose strict integrity constraints. They cannot effectively deal with incremental and dynamic schema evolution, which is common in managing unstructured data. Second, not all applications require the heavyweight ACID properties and all the features of a fully-fleshed DB. For example, Unix file systems do not guarantee the ACID properties in the face of system failures.

Based on these requirements, we propose using a data model that is based in the Resource Description Framework (RDF) [23]. RDF has been proposed to encode, exchange and reuse metadata on the Web (a fundamental tool for realizing the Semantic Web vision [21]). RDF has two main advantages. First, it provides the means to capture schemata for metadata that are both human-readable and machine-processable (RDF notations are typically defined in XML). Second, it is designed to allow reuse and extensions of existing schemata for an ever evolving set of semantic metadata.

RDF is a model that describes resources. Relations, in RDF, are expressed as tuples of the form:

subject property object

In our case, the subject is a file in the file store. The properties (one or more) that are associated with the subject capture some type of semantic property of the corresponding file. The object of the relation corresponds to the value of the property for the subject, which may be another file or some metadata structure (a literal or composite). Thus, files and metadata structures are both considered resources. In fact, relations themselves can be used as resources for constructing more complex metadata relations.

RDF provides no vocabulary that assumes or refers to application-specific semantic information, e.g., certain properties for media files or relations of files that are accessed by the same user. Instead, such classes of resources and properties are defined in the form of an RDF schema. The same RDF notation is used to specify RDF schemata [24]. This is achieved by providing a set of predefined resources, namely Classes and Properties. For example, in our case, a Class may refer to files with a certain type of content or files that are used by a certain application. For the model, the specific files are resources that are instances of a certain Class. A Property is defined in the schema to have a domain and a range. Each of them can be defined to refer to resources of one or more classes. Classes and Properties can be defined in a hierarchical fashion resulting in schemata that capture complex semantic information.

The principles of RDF resemble those of graph-based data models that have been proposed to handle structural irregularity and incompleteness of schemata and rapid schema evolution [1]. In such systems, the schema is non-mandatory, i.e., it provides some information about the current type of the data, but it does not constrain the format of the data. We have chosen RDF, as it is simple and standardized.

A remaining issue is how to implement a repository of RDF relations in a system. We intend to use some lightweight, RISC-style database systems, like the one proposed by Chaudhui and Weikum [4].

Next: Basic relations Up: Architecture of pStore Previous: Architecture of pStore

Magnus Karlsson
ti 17 jun 2003 14.32.10