Duc A. Tran
Computer Science Department
University of Dayton
Dayton, OH 45469
duc.tran@notes.udayton.edu
We employ the commonly used Cosine distance function to measure the semantic similarity between two objects and
:
where
is the dot product between vectors
and
and
is the Euclidean vector norm. The smaller
is, the more semantically similar are
and
to each other.
We consider two types of queries: point queries and range queries. A point query is described by a term vector
(
,
, ..,
). We expect to return those data objects
such that
is minimum. In some applications, the user may also specify a small constant
to find those objects such that
.
There are two types of range query, namely simple and composite. A simple range query is described by a hyperrectangular region
[
,
]
[
,
]
..
[
,
]. A composite range query is a set of simple range queries. For a range query
, we expect to return those data objects that belong to the region
. While the system is aimed to be fully decentralized, we assume that a new user knows at least one existing user before the former can join the network.
For building the cluster overlay, we propose to use the Zigzag hierarchy, which we originally devised for streaming multimedia [2,3]. The main advantage of Zigzag is its capability to handle the dynamics of P2P networks. We first present Zigzag and then propose how similarity search can be fulfilled efficiently using this hierarchy.
An illustration is given in the top diagram of Figure 1, where 52 nodes are organized into a Zigzag-4 hierarchy. Hereafter, we denote by and
the head and associate-head, respectively, of a cluster or a peer. Since a peer may have different associate-heads at different layers, we use the notation
to refer to the associate-head of
at layer
. For instance, in Figure 1,
= 21,
= 17. Below are the terms we use for the rest of the paper:
![]() |
The above rules guarantee a tree structure including all the peers; we call this tree the Zigzag tree. A Zigzag- hierarchy of
peers provides the following desirable properties: (see [3] for complete proofs):
(1) The maximum nodal degree in the Zigzag tree is
,
(2) The maximum height of the Zigzag tree is 2
,
(3) Recovery of a peer failure requires at most
peer reconnections, and
(4) As peers join and leave, a cluster may be split or merged with another cluster to satisfy the [
,
] cluster size constraint. The worst-case number of peer reconnections due to a split or merger is
.
As an example, we consider the hierarchy in Figure 1 and suppose that the index zones owned by the 13 layer-0 clusters are ,
, ..,
(respectively, from left to right). Thus,
,
,
, etc. Because peer 9 has two children (peer 1 and peer 14), peer 9 keeps the value {(1,
), (14,
)} and
. The index zone assignments are similar for other peers and shown in Figure 1 (bottom diagram). Since peers other than the heads and associate-heads at layer 0 do not own any index zone, they are not present in this index zone tree.
The advantage of the zone assignment policy is its support for efficient search. A search query just follows the links in the Zigzag tree to branches that lead to the smallest index zone(s) containing the query. The next subsection details the search algorithm.
Case 1: is a leaf in the Zigzag tree (e.g., peers 15, 36, 40 in Figure 1) and
needs to process query
. Since
does not have any index zone information, it sends the query to its associate-head
.
computes
. If
, some results of
, that correspond to
, can be found locally. Indeed,
just needs to broadcast query
to all layer-0 clustermates asking them to return to peer
the objects inside
. Furthermore, when
receives
, if it stores any index (peer_location
, term_vector
) such that
,
asks peer
to send object
to peer
. We must also return the results that correspond to query
if
because these results are not in the local cluster. In this case,
creates a new query
. How
processes query
is similar to the case below.
Case 2: is a non-leaf node in the Z-tree (e.g., peers 22, 37, 42 in Figure 1) and
needs to process query
. In this case,
must own a zone
. Query
is broken into two subqueries
and
, which will be handled in parallel as follows.
The search path length is at most the maximum distance in hops between two peers in the Zigzag tree, or . The search overhead is proportional to the total number of peers contacted for all the subqueries, which depends on the range of the original query. In our performance study, we found that this overhead is indeed very small.
Initially, there is only one peer in the network. It is the head of its self-formed cluster , which grows larger as subsequent peers join. The index zone owned by this cluster is
and the ID of this zone is kept at the head node. When the cluster size exceeds
, we need to partition
into two smaller clusters,
and
, whose sizes are in the interval [
,
]. We propose to partition
along a selected dimension
into two halves
[0, 1/2)
and
[1/2, 1]
, each to be owned by
and
. It is possible that a peer in cluster
has an object in
(similarly, a peer in cluster
may have an object in
). In this case, we store the index of this object in the other cluster. The number of such objects is called the index migration overhead. We want to minimize this overhead so that (1) the communication overhead due to index relocation is low, and (2) peers in the same cluster have highly similar objects. This is equivalent to minimizing
where
. The algorithm for this purpose is run by
- the head of cluster
. Upon a request sent by
, each peer
in
submits to
a set of tuples (
,
,
), for all
[1,
]. Upon receipt of those sets from all the member peers, we can devise a simple greedy but optimal algorithm for
to find the best
,
, and dimension
. The complexity of this algorithm is
.
Each cluster randomly selects two nodes as its head
and associate-head
(the old head of cluster
, however, is preferred to remain head of the newly created cluster it belongs to). The heads will automatically belong to layer 1 and form a new cluster. Since layer 1 now is the highest layer, only the head needs to be designated; this head is randomly chosen between the two member peers. The index zone owned by this cluster is the union of the zones owned by its child clusters; in this case, it is
.
Having the Zigzag hierarchy initially constructed, we need maintain it under network dynamics such as when a peer publishes or removes objects, and joins or quits the network. The detailed solutions to these sub-problems are presented in [4], which shows that removal of a peer requires peer reconnections, addition of a peer requires
peer contacts, and addition or removal of an object also requires
peer contacts.
We conducted simulation for EZSearch. Peers arrived at rate 2 peers per second and might quit the network randomly at anytime. Thus the contents and indices stored in the network changed dynamically. The results were promising. For instance, Figure 2 shows the effect of the constant used in the Zigzag-
hierarchy. In all scenarios, the query and any of its subqueries do not travel more than 12 nodes (among 10,003 nodes) before knowing the locations of the answers. Normalized search overhead is computed as
, where
is the number of nodes a query and its subqueries visit during the search,
the number of nodes currently in the system, and
the volume of the query. For a query of volume 0.2, the broadcast-based search would incur a normalized search overhead of
5. EZSearch has a normalized search overhead always less than 0.6 (on average) and 1.8 (worst-case), and much smaller when
is larger. EZSearch therefore is fast and highly efficient. Our future work includes (1) refining the current algorithms for stronger index locality preservation within each cluster, and (2) considering various distributions of objects over peers.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons main.tex
The translation was initiated by on 2005-02-14