Pre-processing Phase.

Pre-processing Phase.

As Figure 2 illustrates, the data processing starts from a large web repository maintained by Google. Our goal is to inspect URLs from this repository and identify the ones that trigger drive-by downloads. However, exhaustive inspection of each URL in the repository is prohibitively expensive due to the large number of URLs in the repository (on the order of billions). Therefore, we first use light-weight techniques to extract URLs that are likely malicious then subject them to a more detailed analysis and verification phase.

**Figure 2:** `URL` selection and verification workflow.
$\includegraphics[width=1.2in]{graphs/workflow.eps}$

We employ the mapreduce [9] framework to process billions of web pages in parallel. For each web page, we extract several features, some of which take advantage of the fact that many landing URLs are hijacked to include malicious payload(s) or to point to malicious payload(s) from a distribution site. For example, we use ``out of place'' IFRAMEs, obfuscated JavaScript, or IFRAMEs to known distribution sites as features. Using a specialized machine-learning framework [7], we translate these features into a likelihood score. We employ five-fold cross-validation to measure the quality of the machine-learning framework. The cross-validation operates by splitting the data set into 5 randomly chosen partitions and then training on four partitions while using the remaining partition for validation. This process is repeated five times. For each trained model, we create an ROC curve and use the average ROC curve to estimate the overall accuracy. Using this ROC curve, we estimate the false positive and detection rate for different thresholds. Our infrastructure pre-processes roughly one billion pages daily. In order to fully utilize the capacity of the subsequent detailed verification phase, we choose a threshold score that results in an outcome false positive rate of about $10^{-3}$ with a corresponding detection rate of approximately . This amounts to about one million URLs that we subject to the computationally more expensive verification phase.

In addition to analyzing web pages in the crawled web repository, we also regularly select several hundred thousands URLs for in-depth verification. These URLs are randomly sampled from popular URLs as well as from the global index. We also process URLs reported by users.

Niels Provos 2008-05-13