Duc A. Tran
Computer Science Department
University of Dayton
Dayton, OH 45469
duc.tran@notes.udayton.edu
We employ the commonly used Cosine distance function to measure the semantic similarity between two objects and : where is the dot product between vectors and and is the Euclidean vector norm. The smaller is, the more semantically similar are and to each other.
We consider two types of queries: point queries and range queries. A point query is described by a term vector (, , .., ). We expect to return those data objects such that is minimum. In some applications, the user may also specify a small constant to find those objects such that . There are two types of range query, namely simple and composite. A simple range query is described by a hyperrectangular region [, ] [, ] .. [, ]. A composite range query is a set of simple range queries. For a range query , we expect to return those data objects that belong to the region . While the system is aimed to be fully decentralized, we assume that a new user knows at least one existing user before the former can join the network.
For building the cluster overlay, we propose to use the Zigzag hierarchy, which we originally devised for streaming multimedia [2,3]. The main advantage of Zigzag is its capability to handle the dynamics of P2P networks. We first present Zigzag and then propose how similarity search can be fulfilled efficiently using this hierarchy.
An illustration is given in the top diagram of Figure 1, where 52 nodes are organized into a Zigzag-4 hierarchy. Hereafter, we denote by and the head and associate-head, respectively, of a cluster or a peer. Since a peer may have different associate-heads at different layers, we use the notation to refer to the associate-head of at layer . For instance, in Figure 1, = 21, = 17. Below are the terms we use for the rest of the paper:
The above rules guarantee a tree structure including all the peers; we call this tree the Zigzag tree. A Zigzag- hierarchy of peers provides the following desirable properties: (see [3] for complete proofs): (1) The maximum nodal degree in the Zigzag tree is , (2) The maximum height of the Zigzag tree is 2, (3) Recovery of a peer failure requires at most peer reconnections, and (4) As peers join and leave, a cluster may be split or merged with another cluster to satisfy the [, ] cluster size constraint. The worst-case number of peer reconnections due to a split or merger is .
As an example, we consider the hierarchy in Figure 1 and suppose that the index zones owned by the 13 layer-0 clusters are , , .., (respectively, from left to right). Thus, , , , etc. Because peer 9 has two children (peer 1 and peer 14), peer 9 keeps the value {(1, ), (14, )} and . The index zone assignments are similar for other peers and shown in Figure 1 (bottom diagram). Since peers other than the heads and associate-heads at layer 0 do not own any index zone, they are not present in this index zone tree.
The advantage of the zone assignment policy is its support for efficient search. A search query just follows the links in the Zigzag tree to branches that lead to the smallest index zone(s) containing the query. The next subsection details the search algorithm.
Case 1: is a leaf in the Zigzag tree (e.g., peers 15, 36, 40 in Figure 1) and needs to process query . Since does not have any index zone information, it sends the query to its associate-head . computes . If , some results of , that correspond to , can be found locally. Indeed, just needs to broadcast query to all layer-0 clustermates asking them to return to peer the objects inside . Furthermore, when receives , if it stores any index (peer_location , term_vector ) such that , asks peer to send object to peer . We must also return the results that correspond to query if because these results are not in the local cluster. In this case, creates a new query . How processes query is similar to the case below.
Case 2: is a non-leaf node in the Z-tree (e.g., peers 22, 37, 42 in Figure 1) and needs to process query . In this case, must own a zone . Query is broken into two subqueries and , which will be handled in parallel as follows.
The search path length is at most the maximum distance in hops between two peers in the Zigzag tree, or . The search overhead is proportional to the total number of peers contacted for all the subqueries, which depends on the range of the original query. In our performance study, we found that this overhead is indeed very small.
Initially, there is only one peer in the network. It is the head of its self-formed cluster , which grows larger as subsequent peers join. The index zone owned by this cluster is and the ID of this zone is kept at the head node. When the cluster size exceeds , we need to partition into two smaller clusters, and , whose sizes are in the interval [, ]. We propose to partition along a selected dimension into two halves [0, 1/2) and [1/2, 1] , each to be owned by and . It is possible that a peer in cluster has an object in (similarly, a peer in cluster may have an object in ). In this case, we store the index of this object in the other cluster. The number of such objects is called the index migration overhead. We want to minimize this overhead so that (1) the communication overhead due to index relocation is low, and (2) peers in the same cluster have highly similar objects. This is equivalent to minimizing where . The algorithm for this purpose is run by - the head of cluster . Upon a request sent by , each peer in submits to a set of tuples (, , ), for all [1, ]. Upon receipt of those sets from all the member peers, we can devise a simple greedy but optimal algorithm for to find the best , , and dimension . The complexity of this algorithm is .
Each cluster randomly selects two nodes as its head and associate-head (the old head of cluster , however, is preferred to remain head of the newly created cluster it belongs to). The heads will automatically belong to layer 1 and form a new cluster. Since layer 1 now is the highest layer, only the head needs to be designated; this head is randomly chosen between the two member peers. The index zone owned by this cluster is the union of the zones owned by its child clusters; in this case, it is .
Having the Zigzag hierarchy initially constructed, we need maintain it under network dynamics such as when a peer publishes or removes objects, and joins or quits the network. The detailed solutions to these sub-problems are presented in [4], which shows that removal of a peer requires peer reconnections, addition of a peer requires peer contacts, and addition or removal of an object also requires peer contacts.
We conducted simulation for EZSearch. Peers arrived at rate 2 peers per second and might quit the network randomly at anytime. Thus the contents and indices stored in the network changed dynamically. The results were promising. For instance, Figure 2 shows the effect of the constant used in the Zigzag- hierarchy. In all scenarios, the query and any of its subqueries do not travel more than 12 nodes (among 10,003 nodes) before knowing the locations of the answers. Normalized search overhead is computed as , where is the number of nodes a query and its subqueries visit during the search, the number of nodes currently in the system, and the volume of the query. For a query of volume 0.2, the broadcast-based search would incur a normalized search overhead of 5. EZSearch has a normalized search overhead always less than 0.6 (on average) and 1.8 (worst-case), and much smaller when is larger. EZSearch therefore is fast and highly efficient. Our future work includes (1) refining the current algorithms for stronger index locality preservation within each cluster, and (2) considering various distributions of objects over peers.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons main.tex
The translation was initiated by on 2005-02-14