Next: Component taxonomy
Up: Background
Previous: COM, OLE and Automation
2.2 File formats
Office 2000 supports two native file formats: the traditional
OLE-based binary format (hereafter, ``OLE archive'') and a new
XML-based format. The OLE archives [13,14,15] rely
on the OLE Structured Storage Interface (SSI) to provide a unified
view of the compound document in a single file. SSI implements an
abstraction similar to a file system within a single file. It supports
two types of objects: storages and streams. Storages are
analogous to directories and contain streams or more storages. Streams
are analogous to files and contain the components' data. Office
applications vary in the way they use the OLE SSI to store embedded
objects. Word and Excel, for instance, use a separate storage for every
embedded component, making the component structure of the document
visible to the OLE SSI. For example, figure 1 shows the
structure of a Word archive with two embedded components. Notice how
Word keeps each embedded object in a separate SSI storage. In
contrast, PowerPoint compresses embedded object native data and stores
it in the main application stream. While this strategy increases
document compression, it limits the ability of third-party
applications to manipulate components within a PowerPoint document.
Figure 1:
Word archive. The figure shows two embedded objects, an Excel chart and a PowerPoint slide, each stored in a separate SSI storage.
|
The new XML format [17] provides a more browser-friendly
option for storing Office documents. While an OLE archive appears as
a single file, an XML document appears as an entire directory of XML
files, approximately one per component, image, or slide. The current
implementation of Office supports two forms of XML output: a compact
low-fidelity representation that can be read by browsers but cannot be
edited by Office tools, and a larger high quality representation that
supports editing. In this study we focus on the latter XML
representation because it is semantically comparable to the OLE
archive.
Aside from the number of files that they use, the two file formats
differ mostly in their representation of text and formating
information. Images and embedded component native data have similar
representation in both formats, with the caveat that component data in
the XML-based format is stored in a compressed OLE archive. Moreover,
both formats keep in persistent storage two versions of the OLE
components they embed. The first one consists of the embedded
component's native data, which is used to initialize the
component. This data is created and managed by the component
itself. The second representation is a cached image of the state of
the component the last time it was instantiated. This image, although
created by the component, is managed by the container
application. This image serves two purposes. First, it allows the
document to be rendered quickly, since the code that understands the
component's specific type need not be executed until the user wishes
to modify the component. Second, the cached image allows the document
to be rendered even on systems where some component types are not
installed.
There is a significant difference in the way Office supports these two
file formats. Office is able to load OLE archives incrementally
over a random access file system. In contrast, XML documents must be read in
their entirety before control is returned to the user, leading to higher
latencies for opening and storing XML-based documents.
Next: Component taxonomy
Up: Background
Previous: COM, OLE and Automation
Eyal DeLara
2000-05-16