Next: Background
Up: Opportunities for Bandwidth Adaptation
Previous: Abstract
1 Introduction
Microsoft Office is the most popular productivity suite for creating
documents. Its popularity derives, to some extent, from its ability
to create compound documents that include data from more than
one application. The potentially large size of these documents
results in long download and upload latencies for mobile clients
accessing the documents through bandwidth-limited
links [3,11,22]. To reduce latency and
improve the user's experience, compound documents and the applications
that operate on them need to adapt to the available bandwidth.
To identify opportunities for adapting compound documents we need to
understand their main characteristics. However, most studies of
content types, especially those done on the
Web [4,21,23,24] have consistently ignored
compound documents or treated them as opaque data streams, ignoring
the rich internal structure that can be used to enhance bandwidth
adaptation. In this paper we present an analysis of Office compound
documents downloaded from the Web. We focus on those characteristics
of Office documents that have implications for bandwidth-limited
clients, and identify opportunities for adaptation. Although we
report our findings with an emphasis on bandwidth-limited clients, we
believe that these results will be useful for office suite designers
and people interested in working with compound documents in general.
We undertook this study as part of our Puppeteer project, which uses
component-based technology to adapt applications for different
operating environments. Puppeteer is well suited for adapting
compound documents that include data generated by several software
components. By exposing the hierarchy of component data in the
compound document, and making calls to the run-time APIs that the
components expose, Puppeteer adapts applications without changing
their source code. In contrast, traditional adaptation approaches have not
been successful for applications that operate on compound documents
mainly because the complex and proprietary nature of these
applications thwarts source code
modifications [9,10] and the inclusion of several
complex data types, usually embedded in a single file, makes
system-based adaptation hard [12,18,20].
For this paper, we studied compound documents generated by three
popular applications of the Office suite: Word, PowerPoint,
and Excel. We chose to focus on Office applications
based on four factors. First, Office is the most widely-used
productivity suite. Moreover, a significant number of Microsoft Office
documents are available on the Web, enabling us to gather the data for
our experiments. Second, the Office file formats, although
proprietary, are reasonably well documented. Third, the Office
applications are highly integrated with each other and have published
run-time APIs that can be used by Puppeteer to adapt the
applications. Fourth, Office 2000 supports two native
file formats: the proprietary OLE-based binary format and a new XML
format. By using Office 2000 to convert old files to the new XML
format, we can compare the tradeoffs of using a proprietary
binary-based file format against a modern standards-based text format,
both as intermediate formats suitable for document editing, and as
publishing formats, suitable only for reading.
Although we concentrate exclusively on Office documents, we believe
that our results apply to compound documents generated by other
productivity suites. Since most of these suites support roughly the
same features (embedding, images, etc), and document content is driven
largely by user needs, it is likely that the main characteristics of
documents produced by various productivity suites
(e.g. distribution of document size, percentage that have images,
number of pages, slides, etc.) would be similar.
We downloaded over 12,500 documents, comprising over
4 GB of data, from 935 different sites.
Our main results are:
- 1.
- Office documents are large, with average sizes of 196 KB,
891 KB, and 115 KB for Word, PowerPoint and Excel respectively. Their
large sizes suggest a need for adaptation in low bandwidth situations.
- 2.
- Office documents are component rich. 18.19% of Word documents and 46.38%
of PowerPoint documents have at least one embedded component. Images
were the most common component type.
- 3.
- In large documents, images and components account for the
majority of the data, suggesting that they should be the main
target of the adaptation effort.
- 4.
- For small documents, the XML format produces much larger documents
than OLE. For large documents, there is little difference.
- 5.
- Compression considerably reduces the size of documents in both
formats. Moreover, once compressed there is no significant difference
in the sizes of the two file formats.
- 6.
- XML formats are easier to parse and manipulate than the OLE-binary formats.
The rest of this document is organized as follows.
Section 2 provides some background on compound
documents and their enabling technology. We also discuss relevant
characteristics of the three Office applications that we use
in this study. Section 3 describes the documents we
used in our experiments. Section 4 presents our
experimental results. Section 5 discusses the
relevance of our findings to other productivity suites. Finally,
section 6 discusses our conclusions.
Next: Background
Up: Opportunities for Bandwidth Adaptation
Previous: Abstract
Eyal DeLara
2000-05-16