There is No Free Phish:
Phishing is a form of identity theft in which an attacker attempts to elicit confidential information from unsuspecting victims. While in the past there has been significant work on defending from phishing, much less is known about the tools and techniques used by attackers, i.e., phishers. Of particular importance to understanding the phishers' methods and motivations are phishing kits, packages that contain complete phishing web sites in an easy-to-deploy format. In this paper, we study in detail the kits distributed for free in underground circles and those obtained by crawling live phishing sites. We notice that phishing kits often contain backdoors that send the entered information to third parties. We conclude that phishing kits target two classes of victims: the gullible users from whom they extort valuable information and the unexperienced phishers who deploy them.
Phishing is a major threat on today's Internet. In its most basic form, phishers create replicas of target web sites, such as on-line banking, auction, or e-mail pages. These copies are then deployed on publicly-accessible locations, by either acquiring web hosting space or exploiting vulnerable web servers. Finally, the phishers lure victims to visit their replicas and provide confidential information, such as usernames and passwords. This information is stored for later use or resale to third parties .
Phishing activity has rapidly changed in recent years: it evolved from an artisanal, small-scale process into a largely automated operation, involving multiple actors with well-defined roles. Tools are available to streamline the operation of creating the initial copy of the target web site, to add the code that collects sensitive information, and to simplify the configuration of the phishing web site (for example, by specifying who will have access to the phished information) . Furthermore, various features have been introduced to make the phishing sites more stealthy or more resilient to take-down actions by affected targets .
Concurrently to these technical advancements, a number of changes to the “business model” of phishing have emerged. In particular, miscreants started to create phishing kits and offer them for sale. These kits are complete phishing web sites contained in a ready-to-deploy package. They are easy to use: the recipients of the stolen information can be configured by changing one line in the kit's code, and, in addition, some phishing kits even contain detailed usage instructions.
The most recent step in the commoditization of phishing was the distribution of free phishing kits. These kits are actively advertised and distributed at no charge. However, as the economist Milton Friedman would have pointed out , there is no free lunch in the underground economy. Often, free phishing kits hide backdoors through which the phished information is sent to recipients (probably the original kits' authors) other than the intended ones. In other words, far from being a display of generosity on behalf of the authors, free phishing kits respond to rational economical motivations. That is, kits' authors minimize the effort and risks associated with deploying the phishing site and attracting victims, and maximize their return on investment by harvesting the work of unwitting users.
The main contribution of this paper is the detailed analysis of the phishing kits distributed for free on underground sites as well as those left on live phishing web sites. We focus on the structure of these kits and the backdooring mechanisms used by phishers. We think that this analysis is interesting under two points of view. First, it examines in detail some of the techniques employed in phishing kits and analyzes their technical sophistication. Second, our study sheds some light on the dynamics of the phishing community. It gives additional evidence of the current transformation of underground circles into for-profit organizations , ruled by economical principles , in which more experienced practitioners resort to treachery against newcomers. This shows that miscreants do not only target unsuspecting regular users but also that they have no hesitation to attack fellow (or competing) phishers.
The goals of our analysis are:
We use two different sources to locate and obtain phishing kits. First, we search for “distribution sites,” which are sites that collect a number of kits and offer them for download. Some of these sites are openly advertised in the underground community on web forums and IRC channels. In this case, we directly access the distribution sites. We also noticed that distribution sites generally have a similar structure, and, in particular, the page through which the kits are downloadable has common elements (for example, the heading “Official Scam Pages Site”). By searching for such common elements in search engines, it is possible to locate additional sites.
Second, it is common for phishers to deploy a phishing site by uploading a kit to a web server. In some cases, however, after unpacking the kit, they forget to remove it. If the server allows the listing of directory contents, it is possible to locate and download the kit. This has the advantage of retrieving kits actively in use, and, thus, possibly identifying the current recipients of the phished information. To locate active phishing sites, we use two sources: the PhishTank database  and an infrastructure that we set up to collect email spam traffic (spam trap).
After obtaining a phishing kit, we analyze it to determine the email addresses used to exfiltrate the phished information and to identify any possible backdoor.
The analysis that identifies recipient email addresses is automated. Each phishing kit is uploaded to a virtualized environment, consisting of an Ubuntu system equipped with the Apache web server and the PHP module. The kit is uncompressed inside the document root of the web server. Then, a browser instance is directed at the index page of the kit and used to fill in the information collected by the kit. At the end of this process, the kit sends one or more emails with the entered information.
The navigation of the phishing web site is performed using a script that leverages the Selenium library  to programmatically control an instance of the Firefox browser. The script requests a page, parses its content, and identifies forms and input fields. It then applies various heuristics to fill each input field with appropriate values. This is necessary since phishing kits often enforce type constraints on some inputs. For example, password values generally have a minimum length and must contain both letters and numbers; credit card numbers have well-defined length and, at a minimum, must pass the Luhn test . The phishing kit checks these constraints and refuses to complete its process (and disclose its email addresses) if these constraints are not satisfied. Note that some of the tests performed are implemented also on the original web site, others (e.g., the Luhn test or whether a credit card number belongs to a known credit company) are inserted by the kit's authors. We recognize each input field's type by looking at the name field of the corresponding HTML element. The names generally indicate the intended use of the field, such as ssn (social security number), or cvv2 (card verification value).
After obtaining information from a victim, phishing kits often attempt to verify its validity. For example, to check the correctness of a victim's username and password, a kit may try to login into the legitimate web site. The kit checks that this operation is successful by searching the page returned by the legitimate server for a specific set of words, e.g., “Welcome” or “Hello,” followed by a name. Furthermore, a kit may validate an email address by verifying (e.g., via the PHP getmxrr() function) that the address' domain defines at least one mail exchange (MX) DNS record. If the checks are not successful, the kit displays an error message and asks the victim to retype the wrong piece of information.
To automate our analysis process, we need to bypass these checks. Therefore, we configured the system to use a DNS server installed locally, which defines appropriate MX records and resolves all names to the local address 127.0.0.1. Thus, the DNS server effectively redirects all the HTTP requests made by a phishing kit (using domain names rather than IP addresses) to the local web server. The web server responds to all requests for non-existent resources (such as the login page of a banking web site) with a static HTML page that contains words typically searched for by phishing kits to validate credentials.
Finally, to facilitate the automatic analysis of a phishing kit, we perform a number of preprocessing steps that remove unwanted features from the kit. First, we rewrite links to always use normal HTTP connections rather than HTTPS connections. This prevents the browser from detecting errors in digital certificates and stopping its analysis to request the user's intervention. Second, the Selenium library works by loading the target site inside a frame in the current page. Therefore, we eliminate statements that “de-frame” the site (for example, assignments of the value of self.location to the top.location property), since they would prevent Selenium from working correctly.
The second component of the analysis consists of a logging mechanism that collects all the emails sent and saves the recipient addresses in a database. To collect all email addresses that receive phished information, we modified the default configuration of PHP so that emails sent through the mail() function are handled by our custom program instead of the standard mail transport agent (sendmail). This custom program simply logs all emails. We also modified the implementation of the mail() function so that it passes to our handler additional, useful information, such as the file name and line number of the script where the function was invoked.
The last step of the analysis consists of identifying the backdoors hidden in the kit. We first discard email addresses that appear in clear in the source code of the kit. Any remaining address must have been obfuscated to covertly receive the phished information. For each of these addresses, we identify the location in the code where the corresponding email was sent (this information is recorded by the logging component). We manually inspect this location, identify the variable holding the destination address, and keep note of the technique used to obfuscate its value. For each obfuscation technique, we develop a signature. A signature consists of a pattern that matches the obfuscation code and a set of commands that recover the hidden email address. We use these signatures to statically identify obfuscation locations in an automatic way. More precisely, we apply the signature to each file in a kit: if the pattern matches, the hidden email address is automatically recovered and saved in a database.
Finally, we compare the email addresses identified statically with those collected by our analysis environment. If there is a mismatch, that is, we cannot statically locate all email addresses that were recorded by navigating the phishing kit, we repeat the manual analysis, identify a new obfuscation technique, and extend the set of recognized obfuscation signatures.
We collected phishing kits for two months, starting in April 2008. In total, we obtained 584 kits. All kits were written in the PHP language. We believe phishers use PHP since it is supported by most web servers and is typically enabled by hosting providers.
We manually identified 21 distribution sites from which we obtained a total of 414 kits, 379 of which were distinct, as determined by computing their MD5 digests. 26 kits were not working because of errors, such as a missing file or a syntax error.
The identification of kits on active phishing sites was completely automated. We downloaded 15,770 reports from the PhishTank database. Notice that this database contains noisy data: it has duplicated entries, misclassified sites, and incorrect URLs. Therefore, we performed various preprocessing steps to eliminate undesired data. We removed 8 entries that referred to incorrect URLs (e.g., with misspelled protocol schemes, such as htps), 192 entries referring to pages hosted on sites known to be legitimate (e.g., natwest.com), and 3,003 (19%) that use wildcard DNS entries to point at the same resource through different URLs. This left us with 12,567 reports. 1,075 of these (about 8%) referred to phishing sites that were still on-line and allowed directory listing when we accessed them. We consider a phishing site to be live if it has an index page that contains (or redirects to a page that contains) a form with at least one input of type “password.” From these sites, we gathered 151 kits. In other words, about 15% of the open listing sites contained phishing kits. One additional kit was obtained from our spam collection infrastructure. In the following, we refer to these kits as “live kits.” All live kits were unique. Two kits contained errors that prevented their correct execution. One had an invalid directive in a .htaccess file, the other contained syntax errors in the code used to transmit the phished information. Thus, our data set contained a total of 503 distinct phishing kits. Table 1 summarizes the results of our analysis.
Targeted organizations. The collected phishing kits targeted a total of 49 organizations, mostly banks and auction sites, but also mail providers and video game portals. The five most common targets of kits found on distribution sites were Bank of America (21 kits), eBay (19), Wachovia (18), HSBC (18), and PayPal (15). Among the 21 organizations targeted by live kits, the five most frequent ones were PayPal (63 kits), followed by Halifax (19), Bank of America (14), Wells Fargo (9), and Royal Bank of Scotland (8). Most of the kits contained files for only one target organization. In fact, we found only two kits that contained copies of multiple target sites (9 in both cases).
Drop mechanisms and backdoors. The information exfiltrated by a phishing kit to phishers is often called a drop. The vast majority of kits use email to transmit drops. Only two live kits stored drops in a file on the compromised server, and only one sent it to an outside server through a POST request.
We consider a kit to be backdoored if it sends the phished information to addresses other than those found in clear in the kit's code. We found 129 of the kits from distribution sites (slightly more than one third) to be backdoored. Among live kits, 61 (40%) are backdoored. Of these, 20 send the phished information to addresses also found in 8 kits obtained from distribution sites. Assuming that authors and users of kits are different individuals, this shows that backdoors are effective. That is, in a significant number of cases, they do not appear to be detected. At the same time, it seems that, when identified, backdoors are updated to send the stolen information to new recipients.
From our automated analysis of the 503 phishing kits, we extracted 379 unique email addresses. They are registered at 60 different domains: gmail.com is the most frequently used (49%), followed by yahoo.com (18%) and hotmail.com (3%). Only 7 addresses are hosted at domains that do not host free mail providers. At least one address was clearly mistyped (the top-level domain was comr instead of .com). Among the addresses obtained from live kits, 101 were present in multiple kits.
Infrastructure. In the case of live kits, it is interesting to investigate the techniques used to obfuscate the URL pointing to the phishing site. We use the classification proposed by Garera et al. : type I URLs use an IP address in place of the hostname; type II URLs contain a valid-looking domain name and insert the name of the organization being phished in the path; type III URLs include the organization name in the hostname and make it follow by a long string; type IV URLs have no apparent relationship with the phished organization. It can be argued that type III URLs are likely to correspond to domains that were explicitly registered to host a phishing site, while type I, II, and IV URLs are more likely to correspond to vulnerable sites (for example, running web applications containing vulnerabilities) that were compromised and used to host phishing pages.
Of the 12,567 links that we analyzed, 5% were of type I, 23% of type II, 34% of type III, and 38% of type IV. Live kits were found on type I sites (7%), type II (63%), and type IV (30%). We do not have a definite explanation as to why no kits were found on type III domains. However, since the setup of these domains requires a certain level of planning and technical sophistication, it is plausible that they are primarily used by experienced phishers, who are more effective at hiding their tools and covering their tracks.
Furthermore, 17% of type III URLs resolved to more than one IP address, an indication of the use of fast-flux techniques to improve the life-time of an attack campaign [10, 17].
Finally, on 39 of the live phishing sites, we found PHP shells, which are tools used by attackers to remotely control the vulnerable machine. This hints at the possibility that the same compromised server is used to carry out a number of other malicious activities.
Limitations. The main threat to the validity of the statistics presented above is the problem of the “coverage” of the examined kits, i.e., the variety of the recovered kits. Of course, there is no methodology that guarantees to recover all possible kits used by phishers. However, we adopt several techniques to maximize the chances of observing the largest possible number of kits.
With regard to kits obtained from distribution sites, we monitored a variety of underground forums where phishing techniques and tools are openly discussed.
Live kits pose a number of challenges. First, live sites have to be identified. To do this, we leverage the PhishTank database, which is considered the “most complete and timely” repository of phishing reports . Second, it is well-known that phishing sites have generally short life-spans. Thus, we aggressively query the PhishTank database and visit a reported URL in a matter of seconds from its recording, without waiting for the validation process to complete.
Phishing kits contain two types of files: those needed to display a copy of the targeted web site, and the scripts used to save the phished information and send it to phishers.
PHP scripts included in the kit handle the forms used to phish information. These scripts collect the provided information and send it to the phisher. As we have seen, drops are almost always transmitted using email. We conjecture that this is because, of all transmission methods, email does not require any additional infrastructure, does not force the attacker to visit the phishing site after the initial seeding, and is as reliable as the mail provider chosen by the phisher. Destination addresses are most often configured by setting a variable in one of the scripts. In three kits, addresses were obtained by requesting a page on a third-party site. In one case, the site was inaccessible. In the remaining two cases, it returned an obfuscated email address.
The code to transfer the phished information to the scammer consists of a few lines of PHP code, which define variables used to store the recipient address, subject, content of the email, and optional headers. The actual mail transmission is performed using the built-in mail() function. Often, comments instruct the phishers how to set their email address in the appropriate place in the code.
The goal of planted backdoors is to send the phished information to recipients other than the intended one. We describe here the various obfuscation techniques used to hide the presence of backdoors. Additional examples are provided in the Appendix.
One requirement of backdoors is to hide or obfuscate email addresses so that they are not immediately identifiable by manual inspection or pattern matching. To do so, kit writers use a variety of techniques, ranging from standard encoding and compression algorithms to simple, custom cryptographic methods.
Base64-encoding is a popular obfuscation choice. The email address is encoded using its base64 representation and the built-in base64_decode() function is used to retrieve its original value. Another commonly-used encoding is ASCII. In this case, the address is obfuscated by substituting each character with the corresponding ASCII value, typically in hexadecimal format. A function mapping a value to the corresponding character (e.g., the built-in pack() function) is then used to recover the email address. Code examples for these techniques are shown in the Appendix.
Among custom techniques, obfuscations based on Caesar ciphers are popular. Each letter of the email address is replaced with the letter that is some fixed number of positions further down in the alphabet. Another common technique is the use of simple permutations. The following snippet is used to obfuscate the address firstname.lastname@example.org:
Less frequently (it occurred in three of the kits we obtained), additional email addresses are obtained by downloading a file from a second web site. Also in this case, ASCII encoding is used as an obfuscation mechanism:
After applying the pack() function on the long numeric string, one obtains http://freescams.3x.ro/email.php. The URL is then retrieved using the built-in function file_get_contents(). Its content is decoded, again using pack(), and the resulting email addresses are ready to be used.
A second goal of backdoors consists of creating new, hidden drops, i.e., covertly sending emails with the phished information to addresses different than the intended ones. Also in this case, various techniques are used to divert suspicion.
Simple misspellings may be enough to evade superficial analyses. For example, the following piece of code saves the phished information in the message variable, which will then be used as the body of the email. However, intermixed with this code, a second variable, named messege, is also initialized. It will contain an email address, email@example.com, that will be used as the recipient parameter of a second mail() invocation. Besides the misspelling, this backdoor also uses the fact that the PHP interpreter automatically initializes undefined string variables (as messege here) to the empty string to blend in with the normal code.
A similar, simple trick is used by the following backdoor. Here, the code leverages the fact that PHP is case-insensitive for function names, but case-sensitive for variable names. Thus, the apparently repeated mail statements have, in reality, two different recipients.
More sophisticated obfuscation techniques are based on PHP features such as dynamic code creation (through the create_function() function) and evaluation (through the eval() function). In this case, the text of the PHP code that is used to covertly send the email is divided into multiple substrings, which are hidden in unusual locations of the phishing kit. For example, they are disguised as comments or attribute values in an HTML file. At run-time, these strings are extracted from the file and composed together. The resulting string, i.e., the backdoor's program, is dynamically evaluated and the email is sent. An example of this technique is reported in the Appendix.
Phishing kits extensively resort to simple social engineering techniques, in the form of deceiving comments in the code, to divert the attention of a kit's user from a backdoor or to prevent modifications that may disable it. For example, in several kits, the part of the script that transmits the phished information is preceded by the comment:
In other cases, comments sound outright sarcastic. In one instance, the indexes of the array used in a permutation-based obfuscation read “good for your scam.”
Our study is related to two main areas of research: phishing and information security economics. We also report on phishers using treacherous techniques against fellow attackers. While there is a large literature on these subjects, for reasons of space, we will provide here just a brief overview of the proposed approaches and techniques.
Phishing. Phishing has been the subject of much work in recent years. A first line of research has focused on describing the techniques and the psychological processes that make phishing a successful attack [3, 4].
A second area of work consists of the design and implementation of methods to prevent phishing attacks. Some of these techniques are automatic and are based, for example, on the filtering of web pages contents , the restriction of information flow [13, 25, 34], or the obfuscation of confidential information . Other prevention techniques require some form of user's cooperation, in the form, for example, of reaction to visual cues in the browser [2, 9], or the use of external trusted devices . Several studies have pointed out the limitations of approaches that require human intervention [11, 27, 33].
The detection of spoofed web sites has also received considerable attention, and various techniques have been proposed, based, for example, on the measurement of visual similarity between web pages , anomaly detection techniques , or information retrieval approaches .
A number of studies have focused on the operational aspects of phishing, for example, the impact of take-down actions , the infrastructure used for hosting phishing pages , and the effectiveness of manual assessing of phishing reports .
Finally, new attack vectors have been discussed, for example, the use of homographic domains , picture-in-picture browsers , and trojaned routers .
Different from these studies, our work describes in detail the kits used by phishers, one of the fundamental tools of attackers. We also discuss the techniques used in kits to verify the stolen information and transmit it to fraudsters.
Underground economy. Several recent studies have characterized the cyber underground community and explored its economical behavior, in particular, its shift from a reputation-based society into a profit-driven economy [5, 29].
Treachery. The use of treachery by part of attackers has received only limited attention so far. Franklin et al. observe that administrators of IRC channels used by fraudsters seem to offer fallacious commands (e.g., to check the validity status of a credit card) to steal sensitive data from naive participants . The use of backdoors by phishers has been reported before in blogs and other online forums . Backdoors inserted into exploit tools have also been found in the past, e.g., in the Sub7 trojan  and the more recent anti-CNN tool . Finally, there has been anecdotal evidence of all-out attacks among rivaling gangs .
In this paper, we provide a more comprehensive report on the use of treachery among phishers, discussing in detail the techniques used to hide backdoors in phishing kits.
The most effective tools available to phishers are phishing kits. These are packages that contain a complete phishing site ready to be deployed on a public web server. In this paper, we have analyzed a large collection of phishing kits obtained from a variety of sources and discussed the kits' technical characteristics. We have also observed that many kits contain backdoors that transmit the phished information to third parties. This work is the first systematic analysis of the different techniques used by kit writers to steal from phishers.
This work has been supported by the Austrian Science Foundation (FWF) under grant P-18764, the FIT-IT Pathfinder Project, Secure Business Austria (SBA), and the National Science Foundation, under grants CCR-0238492, CCR-0524853, and CCR-0716095.
We provide here some additional examples of the obfuscation techniques used in phishing kits.
The following snippet of code shows how base64-encoding is used to hide email addresses. The code defines a new, hidden parameter, Send, which will be used as destination of an email. Its value is the base64-encoding of the email address Mr-Brain@Evil-Brain.Net.
The following code is an example of obfuscations that use ASCII encoding:
The code scans the contents of the login.php file for the pattern 329 and extracts the subsequent 46 bytes (in this case, 70696f6e6565722e627261696e40676d61696c2e636f6d). Then, the standard function pack() interprets this string as a sequence of hexadecimal character codes and decodes them, revealing the address firstname.lastname@example.org. Notice how misspelling is used to disguise the variable erorr for the legitimate error variable.
Finally, the following case demonstrates the use of dynamic evaluation in PHP to covertly send emails:
The function geterrors() is called towards the end of the script, right before error checking is performed. Despite its name, it has a very different task than checking for errors. To understand its real behavior, we need to examine the functions that it invokes. The function getc() returns the contents of the file passed as its only parameter. The function gets() searches for a pattern (specified as its first parameter) in the file details.php and returns the string following this pattern. The function end_of_line() uses getc() and gets() to extract the strings pack (the search pattern is "(2), the string h* (via the pattern (3,) and the string images/style_002.css (through the pattern (((). Similarly, the function clean() extracts the string eval (the search pattern is (1,), creates a function that evaluates its only parameter, and returns the result of applying it to the parameter str. Finally, the function geterrors() combines all these subroutines to obtain:
The file images/style_002.css apparently contains legitimate CSS data, except for a section in the middle of the file that resembles a long alphanumeric string. After applying pack() to the file's contents, one obtains a long string containing unprintable characters at the beginning and at the end. The central section of the file is instead transformed into a snippet of PHP code that, when evaluated by the eval() function, emails the phished information to two additional addresses.
This document was translated from LATEX by HEVEA.