Images are the most common type of non-text data found in Office
documents. As table 4 shows, 34.62% of Word and
77.01% of PowerPoint documents have at least one image. We
do not present results for Excel documents as very few of them
have any images at all.
Figures 9 and 10 show the
average number of distinct images and the average size of images for
PowerPoint documents. We plot the number of distinct images instead
of the total number of images because Office applications cache a
single copy of every image regardless of the actual number of times
the image appears in the document.
Both plots show similar trends, with increases in the number and size
of images as documents get bigger. These results are consistent with
the findings of section 4.2, where the
size contribution of images to document size becomes the dominant
factor as document size increases. The results for Word are similar,
and are omitted for brevity.
We compared the average size of images in Office documents to the
findings of previous Web studies [2,23]. In general,
these studies report the average size of images between 5 KB and
22 KB. In comparison, Office documents, especially PowerPoint
documents, tend to have larger images. These results suggest that
image distillation and other adaptation techniques are at least as
important for compound documents as they are currently for Web
documents.
We measured the reuse of images across our PowerPoint documents by
calculating the Adler-32 checksum [8] of the image's data and
counting the number of documents that have images with the same
signatures. We found that of the 16,189 images embedded in
PowerPoint documents, only 14,016 are distinct, while 1,241 images, or
8.85%, appeared in more than one document. We calculated the
potential bandwidth savings of a perfect cache for a PowerPoint client
reading all the documents in our dataset that came from the same Web
site. We found that 26% of the Web sites get some bandwidth savings
from the perfect cache, while 11% of the sites see reductions in
required bandwidth that are greater than 20 %.
Figure 9:
Average number of images in PowerPoint documents.
Figure 10:
Average image size in PowerPoint documents.
Table 4:
Images statistics for Word and PowerPoint documents. The table shows the percentage of documents that have at least one images, the average number of images in documents with images, and the average image size.