Many peer-to-peer systems require a way of accurately assessing the bandwidth of their constituent peers. However, nodes cannot be trusted to accurately report their own capacity in the presence of incentives to the contrary and thus self-reporting can lead to poor network performance and security flaws. We present EigenSpeed, a secure and accurate peer-to-peer bandwidth evaluation system based on opportunistic bandwidth measurements, combined using principal component analysis. We test EigenSpeed using data gathered from a real peer-to-peer system and show that it is efficient and accurate. We also show that it is resistant to attacks by colluding groups of malicious nodes.
In many peer-to-peer systems, nodes need estimates of the available bandwidth to each of its peers in the network. Such estimates are used for many purposes: to load balance routing in the network, or to select a peer from which to download a particular resource, or to enforce upload/download ratios, or even to simply measure network characteristics.
Although each peer is in the best position to determine its own bandwidth, it cannot be trusted to report accurate data when an incentive to lie exists, such as when a network preferentially allocates resources to nodes with higher bandwidths. If self-reported information is used to make security-sensitive decisions, dramatic reductions in overall security can result [2]. Thus, there is a need for a system enabling nodes to evaluate their peers' bandwidths that is resistant to manipulation by malicious nodes.
In this paper, we show how principal component analysis (PCA), a well-studied technique in other areas [8], can be applied to this problem. We show that unmodified PCA contains significant vulnerabilities to malicious nodes, and introduce modifications that exploit the specifics of the bandwidth estimation problem and resist attacks. We show that our final system, EigenSpeed, is secure, efficient, and accurate by testing it with bandwidth data collected from the Tor network [7].
We first define the specifics of the problem we are trying to address. Our goal is to come up with a consensus evaluation of the bandwidth of each peer in a peer-to-peer network. In a fully general network, each peer would need to maintain bandwidth information regarding each other peer. However, on the Internet, a bottleneck link on a connection between two hosts is likely to be close to one of the endpoints, rather than in the network core [1,10]. This means that we can treat bandwidth as a node, rather than link, property and still obtain useful results. This view will allow nodes to pool their observations of a particular node and obtain more accurate and globally useful bandwidth assessment. A consensus view is also helpful to ensure that all nodes maintain a similar behavior, preventing partitioning attacks that draw selected nodes towards a particular behavior for further exploitation.
One naïve method for a node to determine the bandwidth of its peers is simply to track the performance of each as it communicates with them. However, this approach is far from ideal: first of all, it will take a long time for a node to gather accurate information about all of its peers, and decisions made before that time will be far from optimal. Additionally, this scheme is vulnerable to partition attacks: a malicious node that wishes to attract the traffic of one of its peers can reserve all or most of its bandwidth for communication with that peer. The target node will then have a greatly skewed view of the network and preferentially choose the malicious node over honest peers. Using this technique, an attacker can, for example, force a particular user in a peer-to-peer system to download a malicious version of whatever file it requests, while hiding this attack from the rest of the network.
In addition to a consensus view, our goal is to satisfy the following requirements:
In this paper, we focus on an attacker or colluding group of attackers attempting to increase their measured bandwidth. This is a common case in peer-to-peer networks where nodes are provided with service in proportion to the service they provide to others. It is also applicable to the Tor network, where routers are selected with probability proportional to their bandwidth; if an attacker can artificially inflate their bandwidth enough, they can with high probability insert themselves a path of interest and attempt to link sources and destinations, thus violating anonymity.
Our main approach for creating a bandwidth evaluation system is to use opportunistic observations based on regular interactions between peers. This results in low overhead, as measurements are integrated into the normal operation of the P2P system. We further combine the observations across multiple nodes and arrive at a consensus bandwidth value using principal component analysis.
Each node in the overlay network maintains a vector of the observed performance of other nodes; for each node, we store the bandwidth achieved while communicating with that node. Note that this observation vector does not attempt to measure the total bandwidth of the peer nodes, but rather their current, per-flow bandwidth. In addition to being more responsive to fluctuations in available bandwidth, this serves as a rough proxy for the bandwidth an additional flow through that node can expect. This vector provides a starting point for evaluating bandwidth, but it can be too sparse, as a node may communicate with only a subset of its peers. Furthermore, the vectors observed at each node will be different, enabling partitioning of the nodes.
To address the first problem, a node can take into account not only its own observations but those made by other nodes. However, we weight the remote observations based on the observed bandwidth of the remote node. This is because a high-bandwidth node can better evaluate the bandwidth of other nodes, whereas a low-bandwidth node is limited by its own bottleneck link.
If we represent the initial observations as a matrix1 , with
representing the observation of node by node , then the new
observation, taking remote vectors into account is:
This process will eventually converge, so long as the communication graph (i.e., the graph of which nodes talk to each other) is connected and not bipartite. The final (normalized) matrix, , will have each of its rows equal to the principal eigenvector of the matrix , . We can therefore use this eigenvector to represent the consensus bandwidth evaluation. In most cases, the convergence will be quite fast as well2 , so can be approximated by computing for relatively small .
The general technique of computing (or rather approximating) the principal eigenvector is known as principal component analysis. It has been successfully used by several reputation systems, such as EigenTrust [9] and Google's PageRank [3]. In Section 4, we will describe our modifications to PCA that use properties specific to the problem of bandwidth evaluation and make it more resilient to subversion by malicious nodes.
We evaluate the basic PCA-based algorithm using a simulated 1000-node network, with node bandwidths based on measurements of the Tor network [7]. We then created a workload consisting of 10000 flows between the nodes. The flows either directly connected two nodes (representing communication between peers), or were relayed by a third, intermediate node (representing anonymous tunnels in Tor); thus each node was a part of 20 or 30 flows respectively. In each case, peers allocated their bandwidth to each flow using fair queuing.
We then repeated this experiment over multiple rounds, or ``ticks,'' creating 10000 new flows each time. Peers were chosen either at random, or weighted by their available bandwidth, using the bandwidth estimation results from the previous round. Multiple observations were combined using an exponentially-weighted moving average (). We ran the experiment for a total of 1440 ticks, representing a day's worth of one-minute flows. To compute the consensus evaluation, we iterated the matrix multiplication until , where is a uniform vector. We used a value of for all experiments in the paper.
Figures 1 and 2 show the fractional bandwidth estimated by EigenSpeed plotted against the actual bandwidth of each node. We find that two-hop flows (the ``Tor scenario'') give less accurate results, as a flow's bandwidth can be impacted by an unrelated bottleneck; bandwidth-weighted router selection also decreases accuracy due to non-uniform sampling. But even in the worst case, the log-log correlation between EigenSpeed's estimation and actual node capacity exceeds .
In some cases, the relative rank of a node is more important than its absolute bandwidth [14]. Figures 3 and 4 plot the estimated rank vs. the actual rank. The accuracy is once again quite good, with a correlation of over 0.99 in the worst case and over 0.9995 in the best case.
Finally, we examined the convergence time of EigenSpeed, i.e., the number of iterations such that . Figure 5 plots this for each tick of the simulation. Initially, the observation matrix is sparse, resulting in slower convergence. However, within less than 15 ticks (each of which represents a minute), convergence time is down to 10 iterations or fewer. Our unoptimized Perl implementation of EigenSpeed can perform an iteration for routers in approximately one second, so for the medium-sized peer-to-peer networks we are targeting, the expected CPU load of EigenSpeed should be quite low. When the number of routers is increased five-fold, to , this time compute a single iteration increases to approximately 26 seconds, which is in line with the time complexity ( iterations, each taking to complete for routers) of EigenSpeed. For larger but less dense graphs, the per-iteration time can be reduced to operations for non-zero entries in the observation matrix by using sparse matrix techniques.
In this section, we extend the basic algorithm to handle nodes joining and leaving (i.e., network churn) and to exploit some features of bandwidth estimation in order to better resist attacks by malicious nodes.
As discussed above, the PCA algorithm functions well with a sparse graph; i.e., one where nodes have communicated with relatively few other nodes. However, it does require that the observation matrix represent a connected graph. This is potentially problematic, since new nodes joining the network have not yet communicated with any other nodes, corresponding to an all-zero row in the matrix. Complicating the situation is the possibility that some new nodes have communicated only with each other, forming a nontrivial disconnected component. EigenSpeed accounts for this by detecting the giant component in the communication graph and labeling nodes outside the component as ``unevaluated,'' leaving them out of the main PCA computation.
Figure 6 shows the effect of churn on the convergence rate of EigenSpeed; the vertical line corresponds to an average node lifetime of 16 minutes that has been observed in P2P networks [11]. The convergence time remains manageable even in the face of high churn rates.
So far we have seen how EigenSpeed works when all nodes are being honest. Malicious nodes, however, can skew the results of the basic algorithm. One possible attack is for a node to become a ``sink'' by setting its bandwidth observation vector for any other node to 0. It is easy to see that such a node's bandwidth estimate will increase with each iteration at the expense of other nodes.
Pure sinks are, of course, easy to detect, but a node can also become a
``near-sink'' by setting its observations of other nodes to be very low.
More formally, a near-sink in row will have
. Such
nodes have traditionally been a problem for PCA-based algorithms. PageRank
and EigenTrust address this attack by introducing a minimum weight for any
observation vector entry by adjusting the transition matrix:
We take a different approach to this problem, exploiting the fact that the bandwidth between a pair of two nodes is observed at each node independently. This means that the matrix (before normalization) should be symmetric. We enforce this constraint by setting the entries and to be the minimum of the two. We also reset all entries on the diagonal to be 0, since we do not trust a node to honestly represent its own bandwidth. This means that a node that tries to become a sink will end up cutting itself off from the rest of the network and will be considered ``unevaluated'' by the giant component detector in the previous section. For near sinks, the effect is still mitigated by the PairMin approach: Figure 7 shows the effect of a clique with and without PairMin. We simulated a ``best possible'' near-sink clique by setting the observations of other nodes to 0 but turning off giant component detection. With PairMin, the estimated bandwidth for all nodes remains close to the line.
With the PairMin modification, the normalized transition matrix
represents the transition matrix for a random walk along an undirected,
weighted graph, where the edge between nodes and has the weight
. The stationary distribution of this random walk (equivalently,
the principal eigenvector of ) is known to have the form:
The resulting eigenvector is, intuitively, a very good estimate of a node's bandwidth, since it corresponds to the sum of all external observations of a node. In addition, we can compute this eigenvector much more quickly than with the standard PCA approach. However, the eigenvector is subject to manipulation by malicious nodes. PairMin guarantees that a single node cannot meaningfully alter its observation vector without being penalized; however, a group of colluding nodes can inflate their observations of each other arbitrarily. Since the estimated bandwidth is the sum of all observations for a node, setting even a single observation to a very high value can severely affect the sum. This suggests a modification to the clique attack, which we will call a ``fat pipe'' attack: the malicious nodes state a very high bandwidth in their observations of each other, and retain the correct bandwidth values in their observations of other nodes (to address PairMin).
This attack can be slightly mitigated by reducing the maximum reportable bandwidth, though at some point this will start to underweight highly provisioned nodes. It turns out that a better defense can be found in our original approach to estimating the principal eigenvector. Our estimate converges quickly when the graph is fast-mixing, which is true whenever all observations are honest. However, under the above attack, the graph will have two fast-mixing components--the honest nodes and the malicious clique--that have a narrow link between them--the normalized values of where is malicious and is honest will be quite low. Thus mixing between the components will happen very slowly and many iterations will be required to reach the stationary vector.
We can exploit this by using a starting vector that contains a small set of trusted nodes, with for each trusted node and 0 otherwise. (If it is hard to find trusted nodes, each node can in fact use itself as the only trusted node.) We then estimate the consensus vector by performing a limited number of iterations of the PCA algorithm; i.e., we compute , for a small . If we pick correctly, will have good mixing within each component, but limited flow between the honest and malicious components. Thus, if all the trusted nodes are honest, the consensus vector will put limited weight on the malicious nodes.
In practice, we found that we can estimate a good choice for by using our normal termination condition, i.e., the smallest such that . The algorithm turns out to be not very sensitive to the choice of : we use , but still results in quite accurate estimation and still limits the bandwidth of the malicious clique. Figure 9 shows the effect of the limited number of iterations on the fat pipe attack.
As can be seen in Figure 9, malicious nodes still benefit from misreporting their bandwidth. The estimation factor for these nodes is inflated by a factor of 4. This is a relatively minor problem in absolute terms, but if a node's relative rank is used [14], one of the nodes was able to increase its rank from 287 to 3.
To reduce the extent of this problem, we perform ``liar detection.'' Once we obtain an estimate for the consensus bandwidth vector , we can calculate the liar metric for a node by comparing it with the initial observation vector of that node: .3 If the liar metric for a node exceeds a threshold , we call it a liar. We can then either eliminate it from consideration completely, or, more conservatively, add it to the unevaluated set. After this, the computation of the consensus vector is repeated, and a new round of liar detection is applied, iterating until no more liars are identified.
The final algorithm is shown in Figure 10. In the absence of new or malicious nodes, it reduces to the simple PCA algorithm. Figure 9 shows the effect of liar detection on the fat pipe attack. Seven out of the ten malicious nodes are identified as liars and are removed from the calculation (they do not appear on the graph). The remaining nodes are seen to obtain only a slight advantage over telling the truth. Liar detection works hand-in-hand with limited convergence: a node that severely misrepresents its observation vector will be detected as a liar, whereas a node that deviates only slightly will not obtain much benefit due to limited convergence.
Tor, the second-generation onion-routing project, also uses self-reported bandwidth to do load balancing and increase network performance [6]. Routers altering their self-reported bandwidth have been shown to be able to effectively compromise a large fraction of Tor paths [2].
In previous work, we have proposed using a small sample of randomly selected peers to make routing decisions [14]. This approach is resilient to misreported bandwidths, but it enables partitioning attacks and this has prevented its adoption. A small sample also leads to much lower accuracy than EigenSpeed.
The BitTorrent file-sharing protocol also uses direct observations in its tit-for-tat protocol. However, since data about these ratios is not shared between peers, attacks such as the ``Large View'' exploit [13] can exploit the optimistic feature of tit-for-tat to download at a high speed from the network without ever uploading.
A related problem is the estimation of latency between peers. Here, the dimensionality of the problem can be reduced by creating virtual coordinates using spring relaxation techniques [5]. Veracity protects these coordinates by introducing checks by a random (but verifiable) set of monitor nodes [12]. The authors suggest that the same approach could be used for bandwidth, but low-bandwidth nodes will have difficulty monitoring high-bandwidth ones. EigenSpeed, in essence, is able to use all peers as monitors without introducing extra traffic.
As mentioned above, PCA is used in many applications, including PageRank [3] and EigenTrust [9]. When resilience to dishonest nodes is needed, the usual approach is to modify the transition matrix to include a small chance of transitioning to a random node. This is necessary since these applications cannot exploit the symmetry of as we do in this paper. The disadvantages of this approach were discussed in Section 4.2.
We have presented EigenSpeed, a novel algorithm for distributed estimation of bandwidth in a peer-to-peer system, that is accurate, efficient, and secure. We demonstrated that EigenSpeed produces accurate results by testing it with a set of peer bandwidths collected from the Tor network and showed that it is resilient to both network churn and malicious attacks. EigenSpeed can be used to achieve consensus bandwidth estimation and balance load across all nodes in a peer-to-peer network in a secure fashion while maintaining similar accuracy to self-reporting.
In our future work, we will explore how to compute the consensus bandwidth in a decentralized fashion, following an approach similar to that of EigenTrust. We will also evaluate the accuracy of EigenSpeed as extended to support asymmetric upload and download bandwidths and investigate the possibility of new attacks introduced by such a change.
We also plan to perform a whole-system evaluation of EigenSpeed as a component of a peer-to-peer network such as Tor in a testbed environment. Finally, we will derive formal bounds on the influence of malicious peers.