Check out the new USENIX Web site. next up previous
Next: Compression Up: Experimental results Previous: Size breakdown

4.3 Comparing OLE archives and XML

The results of our comparison are shown in table 3 and figures 56, and 7. The data reveals that the XML representation can be significantly larger, requiring up to five times more space. XML efficiency is particularly low for small files, which according to our data are the most prevalent. However, XML efficiency improves dramatically as documents get larger.

To understand this, we must consider what happens when a document is converted from an OLE archive to XML. Text and formatting represented in XML takes more space than in Microsoft's internal representation. This explains the inefficiency of XML for small files. However, the XML conversion compresses embedded component data. PowerPoint already compresses its embedded component data, but Word and Excel do not. Because larger documents tend to be mostly images and components (see Figure 4), the XML representation becomes more efficient for large documents, and is even more efficient than the OLE archive for Word documents larger than 1 MB. Excel documents are primarily text and are most efficiently represented as OLE archives.
 
Table 3: Size statistics for documents in raw OLE, OLE compressed with gzip, raw XML, and XML compressed with gzip. The statistics for OLE differ from those presented in Table 2 due to the conversion to Office 2000 formats.
    Application
    Word PowerPoint Excel
Format Statistic raw gzip raw gzip raw gzip
OLE average (KB) 209.19 61.43 579.53 481.18 110.23 25.67
  stdev (KB) 534.59 248.89 1671.36 1597.20 401.83 97.88
XML average (KB) 226.43 74.14 795.17 549.03 336.90 28.37
  stdev (KB) 583.79 297.21 1851.92 1713.56 1562.04 92.02


  
Figure 5: Size distribution of Word documents, with and without compression, for OLE archive and XML formats. Sizes are normalized by the size of the uncompressed OLE archive.
\begin{figure}\psfig{file=plots/word_xml.epsi,width=2.8in}
\end{figure}


  
Figure 6: Size distribution of PowerPoint documents, with and without compression, for OLE archive and XML formats. Sizes are normalized by the size of the uncompressed OLE archive.
\begin{figure}\psfig{file=plots/ppt_xml.epsi,width=2.8in}
\end{figure}


  
Figure 7: Size distribution of Excel documents, with and without compression, for OLE archive and XML formats. Sizes are normalized by the size of the uncompressed OLE archive.
\begin{figure}\psfig{file=plots/xls_xml.epsi,width=2.8in}
\end{figure}


next up previous
Next: Compression Up: Experimental results Previous: Size breakdown
Eyal DeLara
2000-05-16