Virtual machines have been used as honeypots for detecting unknown
attacks by several researchers
[4,16,17,25,26].
Although, honeypots have traditionally been used mostly for detecting
attacks against servers, the same principles also apply to client
honeypots (e.g., an instrumented browser running on a virtual
machine). For example, Moshchuk et al. used client-side
techniques to study spyware on the web (by crawling 18 million URLs in
May 2005 [17]). Their primary focus was not on detecting
drive-by downloads, but in finding links to executables labeled
spyware by an adware scanner. Additionally, they sampled
URLs for drive-by downloads and showed a decrease over time.
However, the fundamental limitation of analyzing the malicious nature
of URLs discovered by ``spidering'' is that a crawl can only follow
content links, whereas the malicious nature of a page is often
determined by the web hosting infrastructure. As such, while the
study of Moshchuk et al. provides valuable insights, a truly
comprehensive analysis of this problem requires a much more in-depth
crawl of the web. As we were able to analyze many billions of URLs ,
we believe our findings are more representative of the state of the
overall problem.
More closely related is the work of Provos et al. [20]
and Seifert et al. [24] which raised awareness of the
threat posed by drive-by downloads. These works are aimed at
explaining how different web page components are used to exploit web
browsers, and provides an overview of the different exploitation
techniques in use today. Wang et al. proposed an approach for
detecting exploits against Windows XP when visiting webpages in
Internet Explorer [26]. Their approach is capable
of detecting zero-day exploits against Windows and can determine which
vulnerability is being exploited by exposing Windows systems with
different patch levels to dangerous URLs. Their results, on
roughly
URLs, showed that about
of these were
dangerous to users.
This paper differs from all of these works in that it offers a far more comprehensive analysis of the different aspects of the problem posed by web-based malware, including an examination of its prevalence, the structure of the distribution networks, and the major driving forces.
Lastly, malware detection via dynamic tainting analysis may provide deeper insight into the mechanisms by which malware installs itself and how it operates [10,15,27]. In this work, we are more interested in structural properties of the distribution sites themselves, and how malware behaves once it has been implanted. Therefore, we do not employ tainting because of its computational expense, and instead, simply collect changes made by the malware that do not require having the ability to trace the information flow in detail.
Niels Provos 2008-05-13