The goal of this step is to reconstruct a special subset of web page accesses, which we use to build a Knowledge Base about web pages and the objects composing them. Before grouping HTTP transactions into web pages, EtE monitor first sorts all transactions from the Transaction Log using the timestamps for the beginning of the requests in increasing time order. Thus, the requests for the embedded objects of a web page must follow the request for the corresponding HTML file of the page. When grouping objects into web pages (here and in the next subsection), we consider only transactions with successful responses, i.e. with status code 200 in the responses.
The next step is to scan the sorted transaction log and group objects into web page accesses. Not all the transactions are useful for the Knowledge Base construction process. During this step, some of the Transaction Log entries are excluded from our current consideration:
When processing an entry of the Transaction Log, EtE monitor first locates the Web Page Table for the client's IP in the Client Access Table. Then, EtE monitor handles the transaction according to its content type:
1. If the content type is text/html, EtE monitor treats it as the beginning of a web page and creates a new web page entry in the Web Page Table.
2. For other content types, EtE monitor attempts to insert the URL of the requested object into the web page that contains it according to its referer field. If the referred HTML file is already present in the Web Page Table, EtE monitor appends this object at the end of the entry. If the referred HTML file does not exist in the client's Web Page Table, it means that the client may have retrieved a cached copy of the object from somewhere else between the client and the web server. In this case, EtE monitor first creates a new web page entry in the Web Page Table for the referred HTML file. Then it appends the considered object to this page.
From the Client Access Table, EtE monitor determines the content template of any given web page as a combined set of all the objects that appear in all the access patterns for this web page. Thus, EtE monitor scans the Client Access Table and creates a new hash table, as shown in Figure 3, which is used as a Knowledge Base to group the accesses for the same web pages from other client's browsers that do not set the referer fields.