Although the above two-pass process can effectively provide accurate web page access reconstruction in most cases, there could still be some accesses grouped incorrectly. To filter out such accesses, we must better approximate the actual content of a web page.
All the accesses to a web page usually exhibit a set of different access patterns. For example, an access pattern can contain all the objects of a web page, while other patterns may contain a subset of them (e.g., because some objects were retrieved from a browser or network caches). We assume the same access patterns of those incorrectly grouped accesses should rarely appear repeatedly. Thus, we can use the following statistical analysis on access patterns to determine the actual content of web pages and exclude the incorrectly grouped accesses.
First, from the Client Access Table created in
Subsection 5.2, EtE monitor collects all possible
access patterns for a given web page and identifies the probable
content template of the web page as the combined set of all objects
that appear in all the accesses for this page.
Table 1 shows an example of
a probable content template.
EtE monitor assigns an index for each object. The column URL lists
the URLs of the objects that appear in the access patterns for the web
page. The column Frequency shows the frequency of an object in
the set of all web page accesses.
In Table 1, the indices
are sorted by the occurrence frequencies of the objects. The
column Ratio is the percentage of the object's accesses in the
total accesses for the page.
Sometimes, a web page may be pointed to by several URLs. For example, https://www.hpl.hp.com and https://www.hpl.hp.com/index.html both point to the same page. Before computing the statistics of the access patterns, EtE monitor attempts to merge the accesses for the same web page with different URL expressions. EtE monitor uses the probable content templates of these URLs to determine whether they indicate the same web page. If the probable content templates of two pages only differ due to the objects with small percentage of accesses (less than 1%, which means these objects might have been grouped by mistake), then EtE monitor ignores this difference and merges the URLs.
Based on the probable content template of a web page, EtE
monitor uses the indices of objects in the table to describe the
access patterns for the web page. Table 2
demonstrates a set of different access patterns for the web page in
Table 1. Each row in the table is an
access pattern. The column Object Indices shows the indices of
the objects accessed in a pattern. The columns Frequency and
Ratio are the number of accesses and the proportion of the
pattern in the total number of all the accesses for the web page. For
example, pattern 1 is a pattern in which only the object index.html is accessed. It is the most popular access pattern for
this web page: 2271 accesses out of the total 3075 accesses represent
this pattern. In pattern 2, the objects index.html, img1.gif and img2.gif are accessed.
|
With the statistics of access patterns, EtE monitor further attempts
to estimate the true content template of web pages, which
excludes the mistakenly grouped access patterns. Intuitively, the
proportion of these invalid access patterns cannot be high. Thus, EtE
monitor uses a configurable ratio threshold to exclude the invalid
patterns
(in this paper, we use
as a configurable ratio
threshold).
If the ratio of a pattern is below the threshold, EtE
does not consider it as a valid pattern. In the above example,
patterns 8 and 9 are not considered as valid access patterns. Only
the objects found in the valid access patterns are considered as the
embedded objects in a given web page. Objects 1, 2, and 3 define the
true content template of the web page shown in
Table 3. Based on the true content
templates, EtE monitor filters out all the invalid accesses in a Client Access Table, and records the correctly constructed page
accesses in the Web Page Session Log, which can be used to
evaluate the end-to-end response performance.