Michalis Polychronakis
FORTH-ICS, Greece, email: mikepo@ics.forth.gr
Panayiotis Mavrommatis
Google Inc., email: panayiotis@google.com
Niels Provos
Google Inc., email: niels@google.com
We support our findings by providing a broad overview of the post-infection network behavior of web-based malware, as well as in-depth examinations of the botnets and leaked information we found during the course of our study.
A thriving Internet underground [6] has grown up in the past several years, employing the hundreds of thousands of malware infected commodity PCs to provide an infrastructure for conducting a wide range of criminal enterprises. Such activities range from carrying out electronic fraud through phishing, to sending out billions of spam email messages.
In previous work, we examined how vulnerable computer systems become infected with malware simply by browsing the web [11,10]. Our analysis was based on detecting drive-by downloads on billions of web pages. In a drive-by download attack, a malicious web page exploits a vulnerability in a web browser, media player, or other client software to install and run malware on the unsuspecting visitor's computer. By loading suspicious web pages with a web browser inside a virtual machine, we found several million web pages capable of compromising vulnerable computer systems.
While prior research has tried to understand the behavior of individual malware binaries, little work has been done to understand the network-level behavior of a large population of computer systems infected with a diverse set of web-based malware. To better understand this issue, we instrumented our virtual machines with light-weight responders to capture and respond to any network payloads sent to the Internet by malware installed as a result of a successful drive-by download attack. Our responders are capable of emulating HTTP, IRC, SMTP and FTP sessions. For any protocols not directly emulated, we also built a generic responder that captures all other payloads.
Over the course of two months, we collected over 448,000 responder sessions. We subsequently analyzed these sessions, looking at overall trends and performing several in-depth case studies. Our analysis organizes malware's behavior into three categories: propagation, data exfiltration and remote control. We show how these aspects taken together provide a compelling perspective on the life cycle of web-based malware. The observed malware activities range from capturing email addresses from compromised machines, to joining infected systems into botnets responsible for operating large-scale spam campaigns.
Furthermore, the botnets created by web-based malware are not only controlled via traditional mechanisms such as IRC, but often employ other protocols such as HTTP. Our in-depth examinations turned up a variety of interesting trends, including rich data exfiltration activity and the use of custom protocols for command and control communication. We believe these results provide the first large scale empirical look at web-based malware.
The remainder of the paper is organized as follows. In the next section, we provide a brief overview of our malware analysis and collection architecture. In Section 3, we investigate the life-cycle of web-based malware based on broad trends in our data, and discuss in-depth case studies of botnets and data exfiltration activities. Finally, we discuss related work in Section 4 and conclude in Section 5.
We build upon a scalable system developed with the goal of detecting harmful URLs on the web. In this section, we describe our extensions to this system to collect and analyze malicious network activity occurring after infection. To provide context for our research, we first give a brief overview of the overall system described in depth in prior work [10]. This is followed by a detailed description of the light-weight responders which allow us to automatically capture the network-level activity of drive-by downloads.
Our system consists of an efficient first-pass filter followed by a verification component. Figure 1 provides an overview of the system components and their interaction. The first-pass filter is essentially a mapreduce [5] over billions of web documents. For each web page, we extract several features, including links to known malware ``distribution'' sites, suspicious HTML elements, or the presence of code obfuscation. The combination of these features is scored by a model trained on a specialized machine-learning system [3]. URLs with a high score are considered potentially malicious, and are submitted to the verification component for further analysis. The URLs that are verified to be malicious are then exported to Google Web Search and, via the Google Safe Browsing API [1], to other clients. The verification results are also used to retrain the machine learning model.
To verify if a candidate URL is indeed malicious, we have deployed a network of client honeypots based on virtual machines running Windows and Internet Explorer. To start the verification of a URL, we load it with Internet Explorer, which is configured to use a web proxy running outside the VM. The proxy records all HTTP requests and scans all HTTP responses using several anti-virus engines. New processes, file system changes, and registry modifications are monitored from within the VM. Upon infection, we typically find abnormal activity in all of the monitored areas. A combination of these signals is used to decide whether the URL is malicious or not.
Since malware may communicate over non-standard ports or using custom protocols, all outgoing traffic towards any port other than the standard ports of the emulated services is redirected to a generic responder. The purpose of the generic responder is twofold: i) to hand off connections over non-standard ports that use one of the emulated protocols to the appropriate responder, and ii) to elicit further information about the malware from connections to non-emulated services, or connections that use unknown protocols.
Upon receiving a connection, the generic responder attempts to heuristically identify the application-level protocol used. For client-initiated protocols, in which the client first sends a message to the server (e.g., HTTP, IRC), the generic responder can determine which service it should emulate by looking in the client's message for known protocol keywords and structure.
For protocols in which the server initiates the dialog by sending a message to the client (e.g., SMTP, FTP), the responder acts as follows: after it accepts a connection, if no data is received within a few seconds, the responder assumes that the client is waiting for a message from the server. In that case, the responder blindly initiates the dialog by sending a generic welcome banner message. Depending on the client's response, the generic responder can identify the actual protocol being used and hand off the connection to the appropriate emulated service. As discussed in Section 3, the generic responder successfully identified many HTTP and IRC connections over non-standard ports and handled them appropriately.
The generic responder is useful even if the malware uses unsupported or unknown protocols. Without the responder, we would observe just a connection attempt, with no transmitted data. However, by accepting all TCP connections to any port, the generic responder can potentially elicit the first application-level message sent by the malware, which may provide useful insight about the intended activity. Indeed, as discussed in the following section, the generic responder captured several sessions of custom malware communication protocols over arbitrary ports.
Once web-based malware infects a computer, it often interacts with other hosts on the Internet to either report information about the compromised system, or to receive instructions for further actions. For example, a newly infected computer is likely to become part of a botnet. Web-based malware may also download other malware components, report stolen information and credentials, or attempt to propagate further.
In the following, we provide a large-scale analysis of post-infection malware activity as observed by our light-weight responders. We aim to provide insight into what happens after a vulnerable system gets infected as a result of visiting a malicious URL, and how it interacts with the Internet. We do not study malware in isolation, but as it behaves after compromising a typical user's system running as a virtual machine. For example, in most drive-by downloads, multiple malware components are being installed at the same time.1
Our network-level analysis of malware, i.e., malware's interaction with other hosts and our responders, is organized into three categories: propagation, data exfiltration and remote control. As we explore the post-infection activity, we show how these behaviors taken together provide revealing insights into the life cycle of web-based malware.
Our analysis covers a two-month period, from January 17, 2008 to March 25, 2008. During this period, our virtual machines analyzed URLs from 5,756,000 unique hostnames--we report on unique hostnames instead of unique URLs, as URLs from the same host usually install the same set of malware. There were 307,000 hostnames serving at least one harmful URL, while 152,700 of these sites (49%) had URLs that resulted in HTTP requests initiated from processes other than the web browser. About 18,000 sites (5%) had URLs that triggered responder sessions.
Across all URLs, the total number of responder sessions with transmitted data exceeded 448,000. There were many more cases where malware made network connections without transmitting any data. For example, we observed connections to popular SMTP servers without actual data transmission. We speculate that these are attempts to test the victim's firewall configuration or Internet connectivity.
We begin our analysis by providing high-level statistics about the overall post-infection network activity of the analyzed URLs.
Figure 3 presents the destination port distribution of all outgoing connections from the virtual machine upon infection. For each port, we report the number of unique host names on which we found at least one URL installing malware that transmitted data to that port. This removes any bias from host names for which the system happened to analyze multiple URLs. A total of 416 different destination ports were contacted and is indication of the diverse and obscure nature of malware's post-infection network behavior.
Figure 4 shows the network protocol distribution of the observed outgoing connections. The classification was made using payload inspection, without considering the destination port number. Potential reasons for the increased diversity in destination port numbers are custom protocols as well as standard protocols over non-standard ports -- probably for making the purpose of the traffic less obvious. For example, in addition to port 80, we witnessed HTTP connections to 63 other port numbers. Similarly, we found malware communicating with IRC servers on 44 different ports.
The exact distribution of HTTP and IRC connections according to the destination port number is shown in Figures 5 and 6, respectively. Most of the HTTP connections to non-standard ports are either GET or POST requests. Using technologies such as PHP and JSP, malware authors can implement flexible communication mechanisms for sending commands or updates and retrieving information from the infected hosts. We also witnessed some CONNECT requests which probably correspond to probes for open proxies that support the CONNECT method for constructing tunnels. About one third of the connections, corresponding to the other category in Figure 4, used less popular or unknown protocols to 317 different destination ports.
One of the most common network activities of malware upon infecting a host is to scan for other vulnerable systems, either in the same LAN or the Internet, to further propagate. As shown in Figure 4, we observed a significant number of network connections using common Windows protocols. About half of the connections were to ports 139 (NETBIOS) and 445 (SMB), which are often related to exploiting Windows vulnerabilities. Ports 135 (DCOM) and 1433 (MSSQL) are also commonly associated with exploits against Windows and Microsoft SQL servers, respectively.
As our responders do not emulate these protocols, we observed only the first protocol packet.2 The majority of probes were to IP addresses on the same network with the virtual machine, which implies that most malware first scan neighboring computers, either for vulnerable services or network shares with unrestricted access.
Part of the malware life cycle consists of notifying its author upon successful installation. This activity accounted for the majority of the emails captured by our SMTP responders. Table 1 shows the most common email subjects we observed. The subjects of the captured emails signify this type of activity quite clearly: XP Hacked, or Installation Report. The email bodies usually contain further information about the victim's host, such as its IP address, access credentials, and the port numbers of any installed back doors.
Table 2 shows the SMTP domains most frequently used by malware to send installation reports. In most cases, the emails were sent to drop box accounts on popular free web mail services, as well as ISP mail services, usually employing several different SMTP servers for each service.
The HTTP protocol is also frequently used to inform malware authors about infections. The following example shows a GET request made by the DoDoLook trojan that uses the MAC address (here sanitized) of the infected computer as an identifier for retrieving targeted updates:
[fontsize=\scriptsize] GET /geturl.php?version=1.1.2&fid=7493&mac=00-00-00-00- 00-00&lversion=&wversion=&day=0&name=dodolook&recent=0 HTTP/1.1 Accept: */* User-Agent: Mozilla/4.0 (compatible; ) Host: loader.51edm.net:1207 Cache-Control: no-cache
We also found malware utilizing custom communication protocols for reporting home. In some cases, the malware reported successful infections using a custom XML-like format. Specifically, the captured connections of this particular protocol were destined to 129 remote hosts using 29 different destination ports. Two examples of the transmitted data3 are shown below:
[fontsize=\scriptsize] HGZ5.<FT>2008-01-28 12:55:30</FT><IM>80</IM><GR>_&</GR> <SYS>Windows XP 5.1</SYS> <NE>XP</NE><pid>488</PID><VER>Ver1.22-0624</VER> <BZ></BZ><P>1</P><V>0</V><IP>0.0.0.0</IP> 000......<LC></LC><GR>-</GR><IM>25</IM><NA>XP</NA> <CS>English (United States)</CS><OS>Windows XP</OS> <MEM>1024MB</MEM><CPU>2200 MHz</CPU> <NET>LAN</NET><video>0</video><BZ>-</BZ>
The leaked information includes the IP address, OS version, country, and other machine properties. We observed several different instances in the above format with only slight variations in the field types and order. Another example of custom infection notification, observed only on port 6346, looked as follows:
[fontsize=\scriptsize] 105|OnConnect|United States|SYSTEM|XP - SYSTEM |0.0.0.0|Not Detected|4.0.4 (BAZ) |United States|OnConnect|
In this case, different fields are separated using pipe characters. There were also cases in which most of the contents of the captured stream consisted of binary data, with some interspersed ASCII strings representing machine information.
Moving from reporting successful installations to exfiltrating more sensitive information is the next logical step in the malware life cycle. Many responder sessions contained signs of data exfiltration, including browser history files and stored passwords, usually captured by keyboard loggers or browser hooks. As one of the email subjects -- Vip Passw0rds -- indicates, SMTP is one method of achieving this goal. We observed several emails sending back stored passwords from the compromised machine.
The large number of POST requests in Figure 5 suggests that HTTP is also employed for sending sensitive information back to data collection servers. Moreover, almost all of the observed FTP sessions corresponded to uploads of harvested data. The malware connected to our FTP responder, supplying a login and password, and started uploading data. We were able to analyze the stolen information by accessing some of the FTP servers that were still operational using the malware's credentials.
In the following, we give a few examples of the types of information we found. Some of the servers functioned as drop boxes for exfiltrated email addresses from users' address books, organized in separate files according to the name of the computer from which they were harvested. One server had 4,729 files containing more than 250,000 addresses, all dated within two days of our inspection. This indicates that the server's administrators collect and remove the information regularly. However, it also means that malware authors have a supply of valid email addresses and even their social context readily available.
More sensitive information was found in extensive logs periodically uploaded by malware, containing the victim's IP address, DNS server, gateway, MAC address, username, as well as the URL and intercepted form and password fields of any HTTP request made by the user's machine. We analyzed over 250 MB of logs from a single server, extracting user names and passwords of 500 users for over 250 unique sites, including myspace.com, yahoo.com, live.com, google.com as well as many online banking sites.4 Banking credentials can be monetized easily, while even the recently introduced two-factor authentication that relies on secure cookies cannot provide complete protection, since the adversary may decide to steal the cookie file, too. On the other hand, credentials to web content management systems can be used to turn more web servers into malware infection vectors.
Self propagation, reporting, and data exfiltration are disconcerting, but a more troubling aspect of malware lies in its ability to connect a compromised system to a network of bots that can be collectively controlled by a single entity. Command and control channels can be implemented using either well-known application-level protocols, such as HTTP and IRC, or through custom communication mechanisms. In the following, we give an overview of the different botnets we encountered.
Internet Relay Chat provides the basis for the most commonly studied type of C&C communication. By joining a predefined channel, each bot can receive commands from its author and send back collected information. We observed IRC sessions to 90 servers, utilizing 1587 different nicknames in 95 channels. Table 3 presents the most frequently contacted IRC servers. As we can see from Figure 6, most of the IRC sessions use servers bound to well-known IRC ports, although there is a considerable number of IRC connections to non-standard ports.
We observed that some malware families use seemingly regular nicknames and channels on public and sometimes popular IRC servers. This saves the burden of running a dedicated IRC server, while at the same time offers some degree of concealment among legitimate users. On the other hand, we found cases using artificial nicknames, e.g., [0]USA|XP[P]152102 or Inject-2l087876, that are usually unique to the infected host, and sometimes provide further information, such as the victim's IP address and OS.
Among the HTTP-based botnets we observed, a case of particular interest involved a botnet that was used for orchestrating large-scale spam campaigns. Each bot periodically downloaded ZIP-archives with instructions on participating in spam campaigns using HTTP requests like the following:
[fontsize=\scriptsize] GET /g/FA3521-9EE5C0-69ED87 HTTP/1.1 Host: 208.72.169.153 X-Flags: 0 X-TM: 34 [...]
Each response contained a ZIP-archive containing nine files with detailed instructions on how to participate in the spam campaign: 000_data22 - a list of domains and their authoritative name severs used to form the sender's email address, 001_ncommall - a list of common first names used as part of the sender's email address, 002_otkogo_r - a list of possible ``from'' names related to the subject of the spam campaign, 003_subj_rep - a list of possible email subjects, 004_outlook - the template of the spam email, config - a configuration file that instructs the bot how to construct emails from the data files, how many emails to sent in total, and how many connections are allowed at a given time, message - the message body of the spam campaign, mlist - a list of email addresses to which to send the spam, and mxdata - a binary file containing information about the mail-exchange servers for the email addresses in mlist.
We downloaded about 700 such ZIP-files, amounting to approximately 700,000 different email addresses. Table 4 shows the most frequently occurring email domains. We reported our findings to another researcher5 who provided us with a set of 250 million email addresses captured from the same botnet in only 24 hours. We noticed that the most frequent domains captured by us within an hour did not completely overlap with the larger data set. This indicates that the email addresses are not handed out at random.
As part of infecting the system, the malware attempted to install a malicious kernel driver named ntio922.sys but failed. To help the malware authors debug their software, the installer attempted to upload a small memory dump file containing a stack trace in case of a failed driver installation. To us, this indicates a high-level of sophistication on part of the malware authors.
Taking a step back, we outline how these individual pieces might fit into a much larger puzzle. Figure 7 shows what we deem to be the life cycle of web-based malware. The malware is seeded to millions of users from compromised web servers that infect new visitors. The infected PCs are transformed into a platform for conducting large-scale electronic fraud.
Stolen address books are used to create databases containing hundred millions of email addresses. This information, together with the computational resources provided by the compromised machines, is currently being used for sending a significant fraction of the email spam currently encountered on the Internet. With access to login credentials, the malware can potentially gain access to web servers which can be turned into additional malware delivery vectors. From this point of view, the life cycle of web-based malware may be self perpetuating.
Virtual machines, virtual honeypots, and lightweight responders have been used by several researchers to capture and better understand attacks [9,2,12,14]. Pang et al. [8] used active responders emulating protocols associated with commonly exploited services to elicit attack payloads from darkspace traffic. Our light-weight responders are similar in that they respond to network connections initiated by adversaries, in our case malware running inside a virtual machine. However, we use them to analyze the post-infection behavior of malware, rather than to capture new attacks.
Our previous work [11,10] analyzed the maliciousness of a large collection of web pages. Although we provided some details on the prevalence of malware, we did not give any insights on the network activities of malware once installed on a system. While there is already significant research on malware analysis [7,4,15], our analysis focuses on a large collection of web-based malware and provides insights into the activities of currently deployed malware. Our approach is much less sophisticated than analysis systems that employ whole-system emulation and information flow tracking, nonetheless using a very simple approach based on light-weight responders provides interesting insights when applied to a large collection of malware. CWSandbox uses a similar approach [13] but unlike our system it hooks the malware and emulates protocols inside the virtual machine.
Although malware analysis has developed into its own research area, resulting in increasingly sophisticated analysis techniques, we showed that simple approaches motivated by low-interaction honeypots can yield a surprising amount of information on malware's activities. We explored the life cycle of web-based malware by employing light-weight responders to capture the network profile of infected machines. Our responders are capable of emulating protocols such FTP, HTTP, IRC and SMTP as well as capturing payloads from any protocols not directly emulated.
We supported our findings by analyzing more than 448,000 responder sessions collected over a period of two months. Our analysis divided malware's behavior in three different categories: propagation, data exfiltration and remote control. Our in-depth investigation of these areas allowed us to explore several different aspects of malware's life-cycle.
Besides notifying adversaries about compromised systems and exfiltrating sensitive data, web-based malware often joins compromised hosts into botnets. These botnets make use of traditional C&C communication via IRC, but are also using other protocols such as HTTP to establish communication to a C&C server. One of the botnets we analyzed is currently responsible for a significant fraction of spam on the Internet and demonstrated surprising sophistication, even going as far as providing malware developers memory dumps of failed installations.
In future work, we plan on extending the protocol emulation to more services and hope to increase our understanding of currently unclassified network communication. Furthermore, the light-weight responders may provide additional signals for determining whether a URL is indeed malicious, especially for cases where the process activity and malware scanning provide insufficient information.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons -no_navigation main.tex
The translation was initiated by mikepo on 2008-04-04