Check out the new USENIX Web site. next up previous
Next: 6. Metrics to Measure Up: 5. Page Reconstruction Module Previous: 5.2 Reconstruction of Web

   
5.3 Identifying Valid Accesses Using Statistical Analysis of Access Patterns

Although the above two-pass process can effectively provide accurate web page access reconstruction in most cases, there could still be some accesses grouped incorrectly. To filter out such accesses, we must better approximate the actual content of a web page.

All the accesses to a web page usually exhibit a set of different access patterns. For example, an access pattern can contain all the objects of a web page, while other patterns may contain a subset of them (e.g., because some objects were retrieved from a browser or network caches). We assume the same access patterns of those incorrectly grouped accesses should rarely appear repeatedly. Thus, we can use the following statistical analysis on access patterns to determine the actual content of web pages and exclude the incorrectly grouped accesses.

First, from the Client Access Table created in Subsection 5.2, EtE monitor collects all possible access patterns for a given web page and identifies the probable content template of the web page as the combined set of all objects that appear in all the accesses for this page. Table 1 shows an example of a probable content template. EtE monitor assigns an index for each object. The column URL lists the URLs of the objects that appear in the access patterns for the web page. The column Frequency shows the frequency of an object in the set of all web page accesses. In Table 1, the indices are sorted by the occurrence frequencies of the objects. The column Ratio is the percentage of the object's accesses in the total accesses for the page.

 
Table 1: Web page probable content template. There are 3075 accesses for this page.
Index URL Frequency Ratio (%)
1 /index.html 2937 95.51
2 /img1.gif 689 22.41
3 /img2.gif 641 20.85
4 /log1.gif 1 0.03
5 /log2.gif 1 0.03
 

Sometimes, a web page may be pointed to by several URLs. For example, https://www.hpl.hp.com and https://www.hpl.hp.com/index.html both point to the same page. Before computing the statistics of the access patterns, EtE monitor attempts to merge the accesses for the same web page with different URL expressions. EtE monitor uses the probable content templates of these URLs to determine whether they indicate the same web page. If the probable content templates of two pages only differ due to the objects with small percentage of accesses (less than 1%, which means these objects might have been grouped by mistake), then EtE monitor ignores this difference and merges the URLs.

Based on the probable content template of a web page, EtE monitor uses the indices of objects in the table to describe the access patterns for the web page. Table 2 demonstrates a set of different access patterns for the web page in Table 1. Each row in the table is an access pattern. The column Object Indices shows the indices of the objects accessed in a pattern. The columns Frequency and Ratio are the number of accesses and the proportion of the pattern in the total number of all the accesses for the web page. For example, pattern 1 is a pattern in which only the object index.html is accessed. It is the most popular access pattern for this web page: 2271 accesses out of the total 3075 accesses represent this pattern. In pattern 2, the objects index.html, img1.gif and img2.gif are accessed.

 
Table 2: Web page access patterns.
Pattern Object Indices Frequency Ratio (%)
1 1 2271 73.85
2 1,2,3 475 15.45
3 1,2 113 3.67
4 1,3 76 2.47
5 2,3 51 1.66
6 2 49 1.59
7 3 38 1.24
8 1,2,4 1 0.03
9 1,3,5 1 0.03
 

With the statistics of access patterns, EtE monitor further attempts to estimate the true content template of web pages, which excludes the mistakenly grouped access patterns. Intuitively, the proportion of these invalid access patterns cannot be high. Thus, EtE monitor uses a configurable ratio threshold to exclude the invalid patterns (in this paper, we use $1\%$ as a configurable ratio threshold). If the ratio of a pattern is below the threshold, EtE does not consider it as a valid pattern. In the above example, patterns 8 and 9 are not considered as valid access patterns. Only the objects found in the valid access patterns are considered as the embedded objects in a given web page. Objects 1, 2, and 3 define the true content template of the web page shown in Table 3. Based on the true content templates, EtE monitor filters out all the invalid accesses in a Client Access Table, and records the correctly constructed page accesses in the Web Page Session Log, which can be used to evaluate the end-to-end response performance.

 
Table 3: Web page true content template.
Index URL
1 /index.html
2 /img1.gif
3 /img2.gif
 


next up previous
Next: 6. Metrics to Measure Up: 5. Page Reconstruction Module Previous: 5.2 Reconstruction of Web