Check out the new USENIX Web site. next up previous
Next: 5.2 Reconstruction of Web Up: 5. Page Reconstruction Module Previous: 5. Page Reconstruction Module

   
5.1 Building a Knowledge Base of Web Page Objects

The goal of this step is to reconstruct a special subset of web page accesses, which we use to build a Knowledge Base about web pages and the objects composing them. Before grouping HTTP transactions into web pages, EtE monitor first sorts all transactions from the Transaction Log using the timestamps for the beginning of the requests in increasing time order. Thus, the requests for the embedded objects of a web page must follow the request for the corresponding HTML file of the page. When grouping objects into web pages (here and in the next subsection), we consider only transactions with successful responses, i.e. with status code 200 in the responses.

The next step is to scan the sorted transaction log and group objects into web page accesses. Not all the transactions are useful for the Knowledge Base construction process. During this step, some of the Transaction Log entries are excluded from our current consideration:

To group the rest of the transactions into web page accesses, we use the following fields from the entries in the Transaction Log: the request URL, the request referer field, the response content type, and the client IP address. EtE monitor stores the web page access information into a hash table, the Client Access Table depicted in Figure 2, which maps a client's IP address to a Web Page Table containing the web pages accessed by the client. Each entry in the Web Page Table is a web page access, and composed of the URLs of HTML files and the embedded objects. Notice that EtE monitor makes no distinction between statically and dynamically generated HTML files. We consider embedded HTML pages, e.g. framed web pages, as separate web pages.
  
Figure 2: Client Access Table.
\begin{figure}
\centering
\def 0.60 ...

When processing an entry of the Transaction Log, EtE monitor first locates the Web Page Table for the client's IP in the Client Access Table. Then, EtE monitor handles the transaction according to its content type:

1. If the content type is text/html, EtE monitor treats it as the beginning of a web page and creates a new web page entry in the Web Page Table.

2. For other content types, EtE monitor attempts to insert the URL of the requested object into the web page that contains it according to its referer field. If the referred HTML file is already present in the Web Page Table, EtE monitor appends this object at the end of the entry. If the referred HTML file does not exist in the client's Web Page Table, it means that the client may have retrieved a cached copy of the object from somewhere else between the client and the web server. In this case, EtE monitor first creates a new web page entry in the Web Page Table for the referred HTML file. Then it appends the considered object to this page.

From the Client Access Table, EtE monitor determines the content template of any given web page as a combined set of all the objects that appear in all the access patterns for this web page. Thus, EtE monitor scans the Client Access Table and creates a new hash table, as shown in Figure 3, which is used as a Knowledge Base to group the accesses for the same web pages from other client's browsers that do not set the referer fields.


  
Figure 3: Knowledge Base of web pages.
\begin{figure}
\centering
\def 0.6 ...


next up previous
Next: 5.2 Reconstruction of Web Up: 5. Page Reconstruction Module Previous: 5. Page Reconstruction Module