NSDI '04 Abstract
Pp. 211224 of the Proceedings
Hybrid Global-Local Indexing for Efficient
Peer-to-Peer Information Retrieval
Chunqiang Tang and Sandhya Dwarkadas, University of Rochester
Abstract
Content-based full-text search still remains a particularly
challenging
problem in peer-to-peer (P2P) systems. Traditionally, there have been
two index partitioning structurespartitioning
based on the document space or partitioning based on keywords. The
former
requires search of every node in the system to answer a query whereas
the latter transmits a large amount of data when processing multi-term
queries.
In this paper, we propose eSearcha P2P keyword search system
based
on a novel hybrid indexing structure. In eSearch, each node
is responsible for certain terms. Given a document, eSearch uses a
modern information retrieval
algorithm to select a small number of top (important) terms in the
document
and publishes the complete
term list for the document to nodes responsible for those top
terms. This selective replication of term lists allows a
multi-term
query to proceed local to the nodes responsible for query terms.
We also propose automatic query expansion to alleviate the degradation
of
quality of search results due to the selective replication, overlay
source multicast to reduce the cost of disseminating term lists,
and
techniques to balance term list distribution across nodes.
eSearch is scalable and efficient, and obtains search results as
good as
state-of-the-art centralized systems. Despite the use of replication,
eSearch actually consumes less bandwidth than systems based on keyword
partitioning
when publishing metadata for a document. During a retrieval
operation, it searches only a small number of nodes and typically
transmits a
small amount of data (3.3KB) that is independent of the size of the
corpus and
grows slowly (logarithmically) with the number of nodes in the system.
eSearch's efficiency comes at a modest storage cost, 6.8 times that of
systems based on keyword partitioning. This cost can be further
reduced by adopting index compression or pruning techniques.
- View the full text of this paper in HTML and PDF.
The Proceedings are published as a collective work, © 2004 by the USENIX Association. All Rights Reserved. Rights to individual papers remain with the author or the author's employer. Permission is granted for the noncommercial reproduction of the complete work for educational or research purposes. USENIX acknowledges all trademarks within this paper.
- If you need the latest Adobe Acrobat Reader, you can download it from Adobe's site.
|