Figure 2:
Size distribution of Word,
PowerPoint, and Excel documents. Shown are documents
with sizes up to 180 KB.
Table 2:
Document size statistics.
Application
Statistic
Word
PowerPoint
Excel
average (KB)
196.24
891.48
115.02
stdev (KB)
528.44
2145.35
438.70
Table 2 shows general statistics for Word, PowerPoint,
and Excel documents1. The most
striking aspects of the data are the large average size of the documents
and the large standard deviations of our sample.
Figure 2 shows the size distribution of Word,
PowerPoint, and Excel documents. The histogram plots documents with
sizes up to 180 KB. We observe that the distributions have the same
general shape: a cluster around a common small value with a fairly
long tail.
Figure 3 characterizes the distributions' tails by
plotting document size frequencies for documents larger than 100 KB on
a log-log scale. The linear fit2
of the transformed data (
)
with
R2 = 0.8938
suggests that the tail of the size distribution closely follows a
power-law distribution, which is consistent with the large standard deviations
of Table 2. The log-log scale histograms for the
individual Word, PowerPoint, and Excel documents are not shown here
since they are all similar to the cumulative distribution, with linear
fits of
,
,
,
and
R2=0.8612,
R2=0.8352, and
R2=0.8226, respectively.
Interestingly, these results are similar to the findings of Cunha
et. al. [7] where the size of HTML-based Web documents was
found to follow the power-law distribution. However, while Cunha
et. al. found that most HTML documents are quite small (usually
between 256 and 512 bytes), Office documents tend to be
much larger. Common sizes of Word and Excel documents size range from
12 KB to 24 KB, and common PowerPoint documents range from 48 KB to
80 KB.
Figure 3:
Size distribution of larger Office
documents on a log-log scale. Document size frequencies are measured
with 16384 byte bins.