USENIX 2006 Annual Technical Conference Refereed Paper
[USENIX 2006 Annual Technical Conference Technical Program]
Cutting through the Confusion:
A Measurement Study of Homograph Attacks
Tobias Holgers, David E. Watson, and Steven D. Gribble
Department of Computer Science & Engineering
University of Washington
1 Introduction
Domain names are crucial to the usability of the Web, but the same
characteristics that make them useful to people also make them
vulnerable to attack. When a user follows a hyperlink, the domain
name within the URL provides her with the first and most important
indication of the identity of the organization with which she will
interact. If the user is fooled into misreading a domain name, she
will believe she is interacting with one organization, but she might
actually be interacting with an attacker. By spoofing the content of
the user's intended destination, the attacker might trick the user
into revealing sensitive information. In this scenario, SSL is no
help to the victim, since the attacker could obtain a valid
certificate for the confused domain name.
A homograph attack is one technique for carrying out this
scheme. A homograph is a letter or string that is visually confusable
with a different letter or string. For example, using most sans-serif
fonts, the Latin letter l (lower case 'el') is visually confusable
with the Latin letter I (upper case 'eye'). Rendered with such a
font, the following are confusable, if not indistinguishable:
An attacker who registers the confusable domain name
paypai.com
therefore may be able to lure victims to their site, for example by
sending spam that appears to contain a hyperlink to the authoritative PayPal site.
Web homograph attacks have existed for some time, and the recent
adoption of International Domain Names (IDNs) support by browsers and
DNS registrars has exacerbated the problem [Gabr02]. Many
international letters have similar glyphs, such as the Cyrillic letter
(lower case 'er,' Unicode 0x0440) and the Latin letter p.
Because of the large potential for misuse of IDNs, browser vendors,
policy advocates, and researchers have been exploring techniques for
mitigating homograph attacks [Mozi05, Appl05, Oper05, Mark05].
There has been plenty of attention on the problem recently, but we are
not aware of any data that quantifies the degree to which Web
homograph attacks are currently taking place. In this paper, we use a
combination of passive network tracing and active DNS probing to
measure several aspects of Web homographs. Our main findings are
four-fold.
First, many authoritative Web sites that users visit have several
confusable domain names registered. Popular Web sites are much more
likely to have such confusable domains registered.
Second, registered confusable domain names tend to consist of single
character substitutions from their authoritative domains, though
we saw instances of five-character substitutions. Most
confusables currently use Latin character homographs, but we
did find a non-trivial number of IDN homographs.
Third, Web sites associated with non-authoritative confusable domains
most commonly show users advertisements. Less common functions
include redirecting victims to competitor sites and spoofing the
content of authoritative site.
Fourth, during our nine-day trace, none of the 828 Web clients we
observed visited a non-authoritative confusable Web site.
Overall, our measurement results suggest that homograph attacks
currently are rare and not severe in nature. However,
given the recent increases in phishing incidents, homograph attacks
seem like an attractive future method for attackers to lure users to
spoofed sites.
2 Homographs and Confusability
As previously mentioned, a homograph is a letter or string that has
enough of a visual similarity to a different letter or string that the
two may be confused for one another. The precise degree of similarity
necessary to cause confusion is difficult to quantify, as it depends
on the observer, the fonts and font sizes used, and the context in
which the homograph is observed.
There are many different categories of confusable characters. They
may be drawn from the same script, such as the Latin characters '-'
(hyphen) and '--' (en dash). Different scripts may be involved, such
as with the Latin character a and the Cyrillic character
a (small letter a). Font choices can affect confusability;
the Latin characters 'rn,' if rendered with a sans-serif font appear
as rn and can be confused with the Latin character 'm.'
Two characters with very different glyphs may appear to be identical
if a browser does not have support for one of them. For example, an
ä ('a' with an umlaut) might be rendered without the umlaut.
Further compounding the problem is the fact that confusable characters
do not need to be used when constructing confusable strings. The word
recieve may be confused with receive, and even more
complicated misspellings may be overlooked by a causal observer.
Given all of this complexity, in this paper we do not attempt to
establish perceptual thresholds of confusability and rigorously
examine all possible confusable characters. Instead, we make the
simplifying assumption that two characters are confusable if and only
if they are listed as confusable in the Unicode Technical Report on
security considerations [Davi05].
This assumption gives us only a rough approximation to the real world
notion of confusability, however, as we will show in
Section 3, many registered domain names do have
confusable domains registered consisting of character substitutions.
In most cases, these confusable domains do not have a legitimate
purpose.
With this assumption in place, we can operationally define the
confusability of two strings: one string is confusable with another
string if and only if they are related by some number of confusable
character substitutions. As an example, consider the following
string, which contains four character substitutions:
The underlined characters are Cyrillic confusables of their Latin
character counterparts. For this particular string, the set of all
confusable strings related to it is enormous (32,459,975,614,080),
since most of the characters in the string have at least one
confusable character associated with them, and we must consider all
possible one, two, three, ..., twenty-one character substitutions.
Increasing the number of confusable character substitutions in a
string tends to make the string less confusable. Accordingly, in
practice confusable strings tend to contain only one or two
subtitutions, though as we will show in
Section 3, some popular domains have
registered confusables with up to five character substitutions.
In this paper, we examine a simple kind of homograph attack, in which
an attacker registers a domain name that is confusable with some other
domain name, presumably to lure victims to their site. In principle,
two registered domains may both be associated with legitimate
organizations, yet still be confusable with each other. In practice,
a given set of confusable domain names tends to consist of a single
authoritative domain, and a collection of non-authoritative,
illegitimate domains. Though authoritativeness is a subjectively
defined characteristic, we have found in all cases it is simple to
distinguish between the authoritative domain that people intend to
visit, and the non-authoritative confusables that attackers create.
3 Measurement Study
We gathered a nine-day-long trace of the Web activity generated by the
population of clients in the Department of Computer Science and
Engineering at the University of Washington. The department consists
of approximately 40 faculty, 40 staff, 275 graduate students, and 450
undergraduate students.
There is a mixture of static IP assignment and DHCP usage in the
department, but the majority of hosts that rely on DHCP receive the
same IP address in practice. Accordingly, the number of IP addresses
we observed in the trace, 828, is a reasonable (though not perfect)
estimate of the number of hosts that were active during the trace
period.
We installed a passive network tap on the router connecting the
departmental subnets to the campus backbone. This tap allowed us to
observe all packets flowing between department computers and
external hosts. The peak traffic rate through the router was low
enough that our network monitoring host dropped no packets.
Using Snort,
we collected a trace consisting of all outbound HTTP GET requests. We
post-processed the trace to extract the domain name associated with
each request. To perform this extraction, we looked in the "Host"
HTTP header field; this field is required in HTTP/1.1, and is
generated by all modern browsers. Using this field saved us from
having to perform reverse DNS lookups, and it also allowed us to
disambiguate between multiple domains hosted on the same IP address.
Given this list of domain names, we calculated the popularity of a
domain name by counting the number of GET requests directed to it. To
transform our object-related popularity measure into an approximate
page-relative popularity measure, we excluded requests for image data
types, since otherwise a single page containing many embedded images
would have a higher contribution to domain popularity than a single
page containing few embedded images.
It is clear that the Web activity of a computer science department is
not wholly representative of Internet-wide Web activity. However, the
set of popular Web sites within the departmental trace has a
substantial overlap with the set of top 500 global Web properties listed
by Alexa Internet [Alex05]: 31 of the top 50 domains in the Alexa
list appeared in our trace. As we will show in
Section 3.2.2, popular Web sites are more likely to
have confusable domain names registered.
Table 1: Registered confusables for popular domains. This table
lists the registered confusable domains for the 10 most popular
English language Web sites within the Alexa 500 list, as well as two
financial sites.
3.1 Active DNS probing
Once we obtained the list of domain names from the departmental trace,
our next step was to search for registered confusable domain names
associated with each one. To accomplish this, for each traced name,
we generated confusable names by substituting one or more characters
with corresponding confusable characters. Then, we
performed a DNS lookup on each generated name to test whether it was
actually registered.
There is a combinatorial explosion in the number of confusable names
associated with a given string when performing multiple character
substitutions. Because of this, we limited our search to confusable
names with at most three confusable characters. However, to explore
the degree to which this caused us to miss registered confusables with
a greater number of substitutions, we performed an exhaustive search
of the full space for a few of the traced domains for which we found
the most registered confusable names.
Since the department trace may be biased towards university and
research topics, we conducted a similar evaluation using the list of
the Top 500 most popular domain names, according to
Alexa [Alex05]. The Alexa list contains domains ordered by a
"traffic rank." This metric is the geometric mean of reach (percent
of Internet users visiting the site) and page views (percentage of all
daily global page views).
3.2 Results
Table 2: Overall results. This table provides summary statistics
describing our trace.
In Table 2, we show high-level results from
our study. We observed 828 clients accessing 3,425 different Web
server domain names, issuing a total of 452,654 HTTP GET requests.
Web sites visited in our trace were authoritative: no client ever
visited a Web site with a non-authoritative, confusable domain name.
However, our DNS probing found 399 registered domains whose names are
confusable with authoritative Web domains visited by our users.
Looking at this data another way, 298 authoritative Web domains have
one or more non-authoritative, confusable, registered domains. None
of our users appeared to have fallen victim to a homograph attack
during our trace period, even though the potential for such an attack
does exist.
For those authoritative domains that had confusable domains
registered, we typically found a very small number of registered
confusable names. Even though a large number of confusable names are
possible for a given authoritative domain name, there are usually just
a handful of confusable domains registered.
In Table 1, we show a list of registered confusable
domains found for the top 10 most popular English language Web sites
within the Alexa 500 list, as well as two financial sites. Note that
this table only reports on registered DNS names with three or fewer
confusable character substitutions, as previously described in
Section 3.1.
3.2.1 Number of character substitutions
Figure 1: # confusable character substitutions. This graph
shows how many registered confusables have one, two, or
three confusable character substitutions.
Intuitively, one should expect that registered confusable domain names
will tend to consist of a small number of confusable character
substitutions. Each confusable character may not always render
identically to the intended character. Accordingly, while one
confusable character in a confusable domain name may escape notice,
two or three such characters may not.
Figure 1 shows that most registered
confusable domain names only contain a single confusable character,
suggesting this intuition is correct. As well, this data validates
our choice of limiting the search space of our DNS probes to
names with no more than three character substitutions: less than
3% of confusable names we found had three substitutions.
To further validate this choice, we performed an exhaustive search for
confusables using the two domain names with the most registered
confusables, microsoft.com and paypal.com. This full search of all
48,552 possible microsoft confusables and 3,456 paypal confusables
found only one confusable domain that our limited search missed: a
microsoft.com confusable with five confusable character substitutions.
3.2.2 Popularity and registered confusables
Figure 2: Popularity vs. registered confusables. This
CDF shows, for a site of a given popularity, the fraction of
registered confusable names found that are associated with
authoritative sites of equal or greater popularity. Popular sites
have more registered confusable names.
Figure 2 shows, for an authoritative site of a
given popularity rank, the fraction of all registered confusable names
found that are associated with authoritative sites of equal or greater
popularity. As well, the figure includes a logarithmic curve fit for
the "UW IDN" data series. The graphs show that popular
authoritative sites have more registered confusable names than
unpopular authoritative sites.
If registered confusable domain names were uniformly distributed
across authoritative sites, these lines would have a constant slope.
Instead, we see that for both UW IDN and UW Latin confusables, popular
authoritative sites have more confusable names registered for them
than unpopular authoritative sites. This effect is most striking for
IDN confusables; 80% of registered IDN confusables found are
associated with the top 30% of authoritative sites. The effect is
less striking for Latin confusables, but we hypothesize that the
effect would reveal itself more prominently with a longer trace that
would include additional unpopular domains.
3.2.3 Latin vs. IDN names
Our search for registered confusable domain names included domains
consisting entirely of Latin character substitutions, and IDN domains
that included some Unicode character substitutions. In
Table 3, we show how many of each of these
exist for both the Alexa 500 list and domains visited in the UW trace.
Table 3: Latin vs. Unicode confusables. This table shows the
number of registered confusable domains found that contain only
Latin confusable characters, and the number of IDN domains that
contain some Unicode confusable characters.
Our results show that most registered confusable domains consist
entirely of Latin characters: IDN confusable domains containing
Unicode characters account for only 15% and 12% of the Alexa
and UW lists, respectively. While a relatively small fraction, IDN
confusable domains do have a noticeable presence, and they can be
expected to grow as browser support for IDN increases. For example,
the upcoming Microsoft Internet Explorer version 7 browser is expected
to have IDN support, making confusable Unicode domain names
potentially more attractive to attackers.
3.2.4 The intent behind confusable domains
Our data shows that many non-authoritative, confusable domain names
have been registered. We now turn our attention to understanding what
goal attackers had when registering them. Homographs can be used to
construct elaborate Web spoofing or phishing attacks, in which the
victim is fooled into revealing sensitive information. However,
attackers may have other less dangerous goals in mind, such as
attracting victims to a site in order to display advertisements.
To understand the attacker's intent behind a confusable domain, and to
gauge the current risk that homograph attacks pose, we manually
examined all non-authoritative confusable domains that we found
registered. Based on our examination, we categorized each site into
one of the following seven categories in decreasing order of
(subjectively assigned) risk to the victim:
-
Web spoofing: the confusable site spoofs the content
of the authoritative site.
- Redirect to competitor: the victim is redirected
to a commercial competitor of the authoritative site.
- Advertisement: ads are shown to the victim.
- For sale: the registered confusable domain name is
advertised as being for sale.
- Unrelated: the site has content which is unrelated
to the authoritative site.
- No content: the registered confusable domain name does
not have an active Web server, or the server returns blank pages.
- Redirect to authoritative: the victim is redirected to a
the authoritative site, perhaps as a defensive measure put in
place by the authoritative site itself.
A given site may belong in more than one category, such as a site that
is for sale and also shows ads. We attempted to emphasize the more
subtle, and thus potentially more dangerous, uses of homographs and
thus categorized each site only in its highest risk category.
Table 4: Intent of registered confusables. This table shows the
fraction of registered confusable domains that were observed to
have the listed intent.
Table 4 summarizes the results. Advertising, a
relatively benign function, was overwhelmingly the most popular use
for confusable domain names. There were very few spoofed sites among
registered domains we observed. Additionally, we verified that none
of these spoofed sites attempted to trick the user into submitting
sensitive information. Instead, these spoofed sites either consisted
of parodies of the authoritative site, or they served to warn
potential victims about the dangers of homograph attacks.
4 Related work
Web spoofing attacks were first considered by [Felt97].
[Gabr02] first discussed using homographs as a part of a web
spoofing attack. Early versions of the attack relied on similarities
between Latin letters and numbers. For example, an attacker could
register an address where o is replaced by 0 (zero), or
l with 1 (one).
With the introduction of International Domain Names (IDN) the number
of visually confusable characters has increased dramatically. IDN
attacks have been possible in Mozilla [Mozi05],
Safari [Appl05] and Opera [Oper05] for at least one publicly
available release, though the latest versions have adopted some
defensive mechanisms.
Browser-based solutions to the homograph problem are currently
incomplete, however, as they either rely on trusted registrars or
disable significant portions of the IDN namespace.
Registrars issuing IDN domains have been asked to put in place
policies to prevent two homographic domains from being registered to
different sites [Mark05].
Relying on registrars to help solve the problem has disadvantages,
since registrars must contend with multiple jurisdictions and
potentially conflicting regulatory restrictions. However, this
approach is compatible with other solutions to the Web spoofing
problem. For example, trust bars [Herz04], the eBay
Toolbar [eBay], and SpoofGuard [Chou04] give users immediate
and unforgeable security context information.
[Goth05] evaluates the current rate and cost of phishing scams,
and concludes that while the cost has been reduced in recent years, it
is still costing billions of dollars. [Weny05] discusses using
Web crawlers to look for visually similar Web pages. Others researchers in the usability, cryptography, and anti-phishing
communities have proposed several mechanisms to defend against
phishing attacks. For example, Jakobsson [Jako05] proposes an
economic analysis to quantify the risks of an attack and to develop
methods for defending against them. As another example, Adida et
al. propose the adoption of identity-based ring signatures to provide
digitally signed email to eliminate spam-based phishing
attacks [Adid05]. Dhamija and Tygar propose the concept of
"security skins," a browser extension that allows remote sites to
prove its identity to users in a way that is usable but hard for
attackers to spoof.
5 Conclusions
While visually confusable, non-authoritative domains have been
registered in practice, the threat actually posed by these domains
currently does not live up to the potential feared by the
community [Oper05, Mozi05, Appl05]. Many popular Web sites do have
associated confusable domains registered, but the most common
functions of these confusable domains are benign, such as serving
advertisements. However, as support for IDN names grows, homograph
attacks do have the potential to become more common and malicious.
Overall, our results show that: (1) users often visited sites that
have confusable domains registered, but no user visited one of these
non-authoritative domains during our trace; (2) popular sites are much
more likely to have registered non-authoritative confusable domains
than unpopular sites; (3) confusable domains tend to have a single
confusable character within them, and currently only 12-15% of
confusable domains rely on Unicode confusable characters; and (4) most
confusable domains have relatively benign intent, such as showing
advertisements. Though a small fraction do spoof the authoritative
site, even these spoofed sites appear to have relatively benign
intent, such as parody.
Acknowledgments
This work was supported in part by the National Science Foundation
under grants CNS-0430477 and ANI-0132817, by an Alfred P. Sloan
Foundation Fellowship, and by gifts from Intel Corporation and Nortel
Networks.
References
- [Adid05]
- Ben Adida, Susan Hohenberger and Ronald L. Rivest, Separable Identity-based Ring Signatures: Theoretical Foundations for Fighting Phishing Attacks. DIMACS Workshop on Theft in E-Commerce, Piscataway, New Jersey, April 2005.
- [Alex05]
- Alexa Web Search, Alexa Internet Inc.,
Global Top 500 Sites, June 7 2005.
- [Appl05]
- Anonymous, About Safari International Domain Name support, Apple Computer Inc., March 2005,
https://docs.info.apple.com/article.html?artnum=301116
- [Chou04]
- Neil Chou, Robert Ledesma, Yuka Teraguchi,
Dan Boneh and John C. Mitchell, Client-side defense against
web-based identity theft. Proceedings of the 11th Annual Network
and Distributed System Security Symposium (NDSS '04), San Diego, CA,
February 2004.
- [Davi05]
- Mark Davis, Draft Unicode Technical Report
#36, Security Considerations in the Implementation of Unicode
and Related Technology, February 20, 2005.
- [Dham05]
- Rachna Dhamija and J.D. Tygar, The Battle Against Phishing: Dynamic Security Skins. Proceedings of the 2005 ACM Symposium on Usable Security and Privacy, July 2005.
- [eBay]
- eBay Toolbar, available at
https://pages.ebay.com/ebay_toolbar.
- [Felt97]
- Edward W. Felten, Dirk Balfanz, Drew Dean,
and Dan S. Wallach, Web Spoofing: An Internet Con Game,
Technical Report 540-96, Department of Computer Science, Princeton
University, February 1997.
- [Gabr02]
- E. Gabrilovich, A. Gontmakher, The Homograph Attack described, Communications of the ACM, 45(2):128, February 2002
- [Goth05]
- Greg Goth, Phishing Attacks Rising, But Dollar Losses Down, IEEE Security and Privacy, Volume 3 Issue 1, January 2005
- [Herz04]
- A. Herzberg, A. Gbara, TrustBar: Protecting (even Naïve) Web Users from Spoofing and Phishing Attacks, Bar Ilan University, 2004
- [Jako05]
- Markus Jakobsson, Modeling and Preventing Phishing Attacks, Financial Cryptography 2005
- [Mark05]
- Gervase Markham, IDN Update, March 24, 2005,
https://weblogs.mozillazine.org/gerv/archives/007785.html
- [Mozi05]
- Anonymous, Mozilla Foundation Security
Advisory 2005-29, Mozilla Organization, February 17th, 2005.
- [Oper05]
- Anonymous, Advisory: Internationalized
domain names (IDN) can be used for spoofing., Opera Software ASA,
February 25th, 2005.
- [Veri05]
- VeriSign, Inc., i-Nav Internationalized Domain Name browser plug-in, https://www.idnnow.com/index.jsp
- [Weny05]
- Liu Wenyin, Guanglin Huang, Liu Xiaoyue, Zhang Min, Xiaotie Deng, Detection of phishing webpages based on visual similarity, Special interest tracks and posters of the 14th International conference on World Wide Web, May 2005.
|