Ghulam Memon, Reza Rejaie
University of Oregon
{gmemon, reza}@cs.uoregon.edu
Yang Guo
Corporate Research, Thomson
Yang.Guo@thomson.net
Daniel Stutzbach
Stutzbach Enterprises
daniel@stutzbachenterprises.com
In this paper, we propose the idea of minimally visible monitors to capture the traffic at a large number of peers with minimum disruption to the DHT. We implement and validate our proposed technique, called Montra, on the Kad DHT. We show that Montra accurately captures around 90% of the query traffic while monitoring roughly 32,000 peers and can accurately identify destination peers for 90% of captured destination traffic. Using Montra, we characterize the traffic in Kad and present our preliminary results.
Many characterizations of deployed DHTs require accurate measurement of their traffic. A commonly used approach to capture traffic in a deployed DHT, is to add instrumented peers that passively participate in the DHT and log exchanged messages [8,2]. Deploying a small number of monitoring peers may not provide a representative and sufficiently detailed view of traffic in the system. Alternatively, inserting a large number of instrumented peers artificially increases the number of peers, which may disrupt the target DHT and lead to incorrect results. In addition, this approach requires significant resources. Furthermore, the connectivity of active peers (with real users) and thus observed traffic could be different from passive peers that do not submit any query.
This paper presents a new technique for scalable monitoring of DHT traffic, called Montra. The key idea in Montra is to make monitors minimally visible. This minimizes the disruption of monitors on the system and significantly reduces the required resources for each monitor. We implement Montra in the form of a highly parallel, scalable, python based client and validate it over Kad DHT. We show that our implementation of Montra can concurrently monitor around 32,000 Kad peers using a moderately configured PC (an Intel Core 2 Duo with 1 GB RAM), while dropping 0.009% percent of packets. We use our own crawler [13], running on a separate machine, for continuous discovery of monitored peers. Montra can capture 90% of query traffic observed by monitored peers and identify destination peers for 90% of captured traffic. Finally, we demonstrate the capabilities of Montra by presenting our preliminary characterization of traffic in Kad DHT.
The rest of this paper is organized as follows: Section II provides a brief overview of Kad. In Sections III and IV, we present Montra and validate two angles of its accuracy, respectively. Section V presents our preliminary characterization of Kad traffic using Montra. Section VI reviews the related work and Section VII concludes the paper.
(i) Lookup Phase: A client performs a lookup on the target ID to find several alive peers near the target ID. While publishing, content is replicated at 10 nodes to ensure availability of content even after departure of few peers. While searching, queries are sent to multiple nodes to get maximum unique results. During the lookup phase, the client sends a Request message which carries the target ID, and the requested number of contacts c. The value of reveals whether a lookup message is followed by a publish message ( = 4) or a search message ( = 2). Whenever a peer receives a Request message, it checks its own routing table to find the closest contacts to the target ID. Then, these contacts are embedded in a Response message and sent to the requesting peer. Both publish and search operations start with Request messages.
(ii) Search/Publish Phase: Once several alive peers near the target ID are identified, the requesting Kad client sends either Search or Publish messages to these target peers. Publish messages contain the cryptographic hash for a keyword or a file, and notify the target node that the request originator is a source for the content. Search messages also contain the cryptographic hash for a keyword or a file. The target node responds to search messages with the addresses of the nodes that previously published the same target ID.
To resolve this conundrum, we notice that traffic can be classified into two categories:
While monitoring Kad, the following information can be extracted from the captured Request messages: (i) the type of request (publish or search), (ii) the requested content ID, and (iii) the ID of the destination peer. From this information, however, we are unable to determine whether a request (search or publish) is for a keyword or a file. Neither can we learn about the characteristics of requested files (e.g., size).
We extend our measurement technique to collect more information as shown in Figure 2. When an MVM receives a Request message from the request originator during the lookup phase, it may send a response to the originator. The response does not carry any next hop contacts, but it informs the request originator that the MVM is alive and can receive a request during the second phase. As a result, the MVM receives the (search or publish) requests during the Search/Publish phase that carry the following additional information about the files or keywords being requested: (i) the type of content (file or keyword) being requested/published, and (ii) the size of the file when the requested content is a file.
This extension, shown with dotted lines in Figure 2, results in a slight disruption to regular operation of the system. More specifically, a content that is published at 10 closest peers, is instead published at 9 closest peers. In addition, sending Response to a Request message increases the visibility of the MVM. To minimize the side-effects of this extension, MVMs respond to requests for a given content ID only once, in order to obtain the above additional information. MVMs do not respond to any further request for the previously observed content ID.
This extension is specific to Kad. We believe Montra, in its original form, can be used to capture traffic from any real-world DHT. However, we also envision such DHT-specific extensions to Montra to extract more fine-grained information from different DHTs.
The size of a target zone is the primary factor that affects the accuracy.
Therefore, zone size is the main variable for assessing the accuracy of
Montra.
Note that decreasing the zone prefix length by one bit doubles the size of
the target zone. On average, the approximate number of
peers in a zone can be estimated as
, where is prefix length.
For example, a prefix length of 0 includes the entire system. A
prefix length of 1 includes half the system. A prefix length of 2
includes one-quarter of the system, and so forth. At the time of this
writing, we estimate the total number of non-NAT Kad peers to be around two million. Table I shows the approximate zone size as a
function of the zone prefix length. Due to memory constraint, the MVM monitoring tool is able to monitor a zone with prefix length up to five bits.
|
[a. Percentage of messages that are captured by Montra] |
[b. Percentage of captured messages that correctly identify the destination] |
[a. Publish Keyword Rate] | [b. Publish File Rate] | [c. Search Keyword Rate] | [d. Search File Rate] |
Figure 3a presents the percentage of messages destined to the monitored zone that are correctly captured by Montra in both instrumented source and destination experiments as a function of zone size (i.e., prefix length). The top and bottom lines of each box reflects the confidence interval and the middle line represents the mean. Figure 3a demonstrates that approximately of messages are captured successfully by the monitor regardless of zone size. However, zone size has a more pronounced impact on the accuracy in the instrumented source experiments in Figure 3a. For zones with , and -bit prefixes, around of messages are captured. But the accuracy of instrumented source experiments drops to around of messages for 5-bit zones. To explain this, we note that decreasing the prefix length of a target zone by one bit doubles the population of peers in a zone which in turn increases the time to crawl a zone. Figure 4 depicts the average crawl time of a zone for a given prefix length. Longer crawling time of a zone leads to a longer delay in discovering the new peers and thus a longer delay in attaching an MVM to new peers which in turn leads to a larger error. Zone size does not have the same impact on the instrumented destination experiments, since we only count the messages that are received by the instrumented peer. The instrumented peer is added at the beginning of the experiments and stays in the system for the entire duration of the experiments, masking the impact of peer churn. Figure 3b presents the fraction of monitored messages that are correctly mapped to their destination peers. In most cases, Montra correctly identifies the destinations of approximately of the messages. However, in the instrumented source experiment over large zones (i.e., 5-bit prefix length), this percentage drops to due to the long crawling duration for larger zone size. In summary, the validation experiments show that Montra captures close to of messages for zones up to -bit prefix length. Furthermore, Montra correctly pinpoints the destination peer of around of the captured messages.
Data Sets: Given our validation results in Section IV, we focus on zones with prefix length of to strike a good balance between monitoring accuracy and the number of monitored peers. A bit zone contains approximately 32,000 non-NAT peers on average. Between May 2008 and August 2008, we collected 50 traces by monitoring 44 (out of 64) unique 6-bit Kad zones. Each measured zone was monitored for a 6-hour period since publish file requests in Kad are sent once every 4 hours. We started measurement of each zone at different times of the day (namely 3pm, 9pm, 3am, and 9am PST) in order to observe any potential variability caused by the time-of-day.
To succinctly present the Cumulative Distribution Function (CDF) of desired properties for all 44 monitored zones, we use min-max envelopes of these CDFs. More specifically, for each value, we identify the min and max values across all CDFs and use them to draw the envelopes. The gap between these min and max lines indicate the variability of the CDF across different zones.
Message Rates: We begin by examining the rates at which Keywords and Files are published and searched. Figure 5 separately shows the min-max envelope for the Complementary Cumulative Distribution Functions (CCDFs) of Publish Keyword, Publish File, Search Keyword and Search File across all zones in log-log scale. Figure 5a demonstrates that of published keyword content IDs have a request rate of more than per minute, while have a request rate of more than per minute. For all four types of messages, the distribution is heavily skewed, with the vast majority of requests targeting a tiny fraction of the observed content IDs. This result is consistent with our earlier results from unstructured file-sharing systems [15].
Comparing the request rates across different message types reveals that the publish rates vary over a wider range than search rates. For example, some keywords have publish request rate greater than 100 requests per minute, while the highest observed Keyword Search request rate is less than 2 requests per minute. Presumably, this is in part because publish requests are automatically generated while search requests are triggered mainly by user action. Thus, the majority of request messages are generated by Kad nodes for the purpose of DHT protocol maintenance. The traffic generated by user activity accounts for only a small percentage of total traffic.
Publish requests are sent periodically. Given the observed request rate and the known re-publish interval (from eMule source code), we can estimate the number of Kad nodes in the system that publish a given file or keyword. The estimated number of Kad nodes that publish content is shown as a second -axis on the top of Figures 5a and 5b. For example, some popular files are published at the rate of 30 requests per minute. Since each source sends a Publish File message once every four hours, this rate is equivalent to approximately 7,200 concurrent Kad nodes who possess a given file.
Relation Between Published & Searched Files: A DHT in essence provides a distributed mechanism for users who publish and users who search files to find each other. This section examines the balance between availability and demand for individual files.
We quantify the balance between the publish and search rate for content using the following ratio ( ) for each content ID, where P is the rate of Publish Requests and S is the rate of Search Requests. In order to properly measure availability, we identify content publishers by using IP, port combination and discard duplicate publish requests. We do not discard any search requests because these requests show actual user demand. Figure 6a and 6b depict the CDF of ( ) per content ID for files and keywords, respectively. Interestingly, Figure 6a reveals that 15% of files are searched but never published ( ). It is highly likely that these files recently became popular (e.g., The Dark Knight) and are not widely available in the system yet. On the other hand, 60% of files are published but never searched during our measurement window ( ). The availability of these files show that they were popular at one point but are not searched anymore. Figure 6b shows that 95% of keywords are published but never searched during our measurement window ( ). The significant imbalance between searched and publish rate for keywords indicates that only a small fraction of the keywords associated with files are actually used by users in searches.
[a. Publish vs Search Requests for Files] | [b. Publish vs Search Requests for Keywords] |
Steiner et al. [10] developed a traffic monitoring tool for Kad, called Mistral, which captures traffic by placing a large number of peers into a zone. In the study they place around 65,000 monitoring peers into an 8-bit zone (which normally would contain only around 8,000 peers; see Table I). Additionally, Mistral routes any incoming traffic only to its own monitors, effectively shutting out most natural peers. The technique appears sound, but the study provides no measurements validating the accuracy and comprehensiveness of Mistral for capturing destination traffic.
Mistral and Montra both use tens of thousands of monitors. More precisely, Steiner et al. use twice as many monitors as our study. However, our monitors are more efficiently placed, allowing us to monitor four times as much address space with half the monitors. To monitor an 8-bit zone as they did, we would need only around 8,000 peers. In a nutshell, their approach is to add a large number of monitors to guarantee that the monitors will observe traffic. Our approach is to surgically add a smaller number of monitors to cover the same area.
Since Montra does not significantly disrupt the overlay structure of the existing peers, it provides richer data compared to Mistral. In addition to monitoring the quantity of traffic entering a zone, Montra can determine the final peer that received the message. The additional information could be used to study the distribution of load across peers or to measure how many packets fail to reach the theoretically correct final peer.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons -no_navigation p.tex
The translation was initiated by Ghulam Memon on 2009-03-24