We collected Word, PowerPoint, and Excel documents from the Web.
First, we used the AltaVista search engine [1] to obtain
an initial set of URLs. In the first two weeks of October 1999, we
searched for pages having links to files with suffixes we were
interested in (doc, ppt, and xls). For example, we
used the query link:ppt domain:edu to search for HTML pages in
the edu domain that have links to PowerPoint documents. Then, we used
GNU Wget [19] to recursively retrieve documents from our
initial search results.
The reliance on a search engine to obtain the documents raises the
question of the set representativity. On one hand, a search
engine is likely to produce results that are dependent on the
popularity of certain pages and documents, skewing the distribution
towards these particular document types and producing a non-random set
of documents. On the other hand we observe that our documents are
fairly well distributed among domains, covering a wide range of user
types. Moreover, the shape of the document size plots of
section 4.1 and their close fit to the
power-law distribution are similar to the results obtained by Cunha
et. al. [7] in a study of client-based traces covering
over half a million user requests for WWW documents.
All downloaded documents were in the binary OLE archive format.
Because Office file formats vary from one version of Office to another,
we first converted all our data to the Office 2000 formats. We
removed documents that appeared to be corrupt or were not actually
Office documents. The doc suffix, in particular, tends to be
used by many applications other than Microsoft Word. We also
eliminated duplicates, removing approximately 5% of our data set.
We converted all the data to Office 2000 formats and we obtained the
XML-based representation using Office's OLE Automation
interfaces [16]. We wrote a simple Java application that
uses OLE Automation to remotely control Office applications to perform
data conversions.
Table 1 shows a summary of the documents. For each
application, it presents the number of documents, and the number of
Web sites from which they originated.
Table 1:
Data set. This table presents for each application, the number of documents and the number of Web sites from which they originated.