Check out the new USENIX Web site. USENIX - Summaries


These reports were originally published in ;login:, Vol. 23, No. 2, April 1998.

Our thanks to the Summarizers: Petros Maniatis <maniatis@cs.standord.edu> and Mark Mellis <mkm@mellis.com>

USENIX Symposium on Internet Technologies and Systems

MONTEREY, CALIFORNIA
December 8-11, 1997

KEYNOTE ADDRESS
Puberty ­ The Approach to Maturity
Heiden Heiden, UUNET Technologies

Summary by Petros Maniatis

This talk presented the history of the pubescent Internet, from its early beginnings all the way to the present, also hinting at what its future is likely to look like, although confident predictions when the Internet is concerned were deemed "a joke" by the speaker.

The Internet started in 1969 as ARPANET (with four nodes in Southern California). Before that, all there was were terminals connected to mainframes. In 1975, the Internet became operational and used NCP as the protocol stack. In the same year, the Department of Defense decided it was time for them to have a centralized data network. In 1976, they awarded the contract to TRW Aerospace. In 1979, the ISO OSI (Open Systems Interconnect) reference protocol specifications came out.

In 1981, TRW had their first pertinent IOC (initial operations capabilities) document out. There were lots of doubts about it because this was a network to which users had to figure out how to connect. Would it work after it got built? They commissioned a new network to compete against it in 1982. It was called DDN, was based on ARPANET technology, and used TCP and IP combined. The government had a shootout with AUTODIN II (which they already had developed), and DDN won, becoming the new internetworking champion.

In April of 1982, DDN was mandated as the network of choice for data communications, receiving large amounts of federal funding. Although a network was being designed, the driving idea was clearly the protocols used by that network. If you could put these protocols in the hands of the users, demand would drive the military, the government, and commerce in the right direction. It would be a user-driven network. As a result, in January of 1983, the TCP/IP stack, the stack that DDN was using, was also mandated as the protocol stack of choice. The specifications called for high (99%) availability and low end-to-end delays.

At that point, ARPANET and MINET had been combined. In 1983, they were split into a government-driven component (operated as DDN) and the rest, which was called ARPANET.

In 1984, the DDN topology was pretty large, running throughout the US. Its biggest circuit had a bandwidth of 56 Kbps. In Europe, 9.6 and 2.4 Kbps were still the best bandwidth available, although connectivity was quite good.

As increased computer power was becoming more available and affordable; T1 (1.544 Mbps) lines started getting bought. Then Mosaic came.

At the same time (from the mid 80s to mid 90s, war was waged between TCP/IP and ISO's OSI. Many sides were trying to have TCP/IP established, and, in the end OSI died a slow death.

By the end of the 1980s, the 1984 deregulation was raging. It seems it will take a long time, though. Now, 14 years later, it's still just catching on.

The ingredients for the success of the Internet are:

1. It's user driven. The commercial products have to be released and working fast. Having it driven by the government would allow for longer delays.

2. It doesn't bill according to time and distance of operations. Users pay for the size of their connections and have access to everybody else in the world. The economics of this billing model worked, and commercial enterprises learned how to be profitable from it. We are now trying desperately to hang onto that.

3. It was built on open systems and interconnected existing networks. First it mainly consisted of TCP/IP on top of X25. Now it's TCP/IP on top of Ethernet or ATM.

4. It more or less stuck to a single standard (TCP/IP) everywhere, and a lot of work from many people went into it. It's the one set of protocols that everybody is using, and we are not sure where we would be without TCP/IP as the one and only protocol stack.

The speaker then spoke of the present state of UUNET. Because UUNET's was one of the core components of the public Internet, one can gain a good idea of what the Internet looks like right now.

UUNET is a large wholesaler. It provides a large backbone, to which other Internet service providers connect. Its core business is direct access to the Internet. Resellers offer dialup access and Web services.

UUNET's DS-3 network was connecting Washington state, California, Texas, North Carolina, New York and the Washington DC area in March of 1996. It seems its design had problems, though, because a year later demand increased by 1000% and the existing infrastructure

couldn't keep up. Most large wholesalers have actually started running out of bandwidth and can't buy enough to cover the demand. At the same time, the prices have remained stationary. "All you can eat for $19.95." The price is right, so the Internet is flourishing. At this time, 1.5 DS3 connections are installed per day.

UUNET was bought out twice, first by MFS and then by WorldCom. This was fortunate, because UUNET was now paired with the facilities providers. They could actually tell them what they needed worked out, twice a year, and have them do it. WorldCom in fact more than doubled the size of their network, from 11,000 miles to 25,000 miles. The two busiest areas were Silicon Valley and Washington, DC, as well as east-west connections.

Latency has always been the big issue. The objective was to get the latency down to 100ms. This is still the objective.

The persisting fact is that there is a tenfold increase of demand per year. UUNET's backbone demand doubles every four months, and by December 1999, the network will be (at least) 1,000 times larger. Still, 20% of UUNET's sites have latency higher than 100ms. At the same time, there is an OC12 network overlayed on top of the DS3s, with plans to move on to OC48s in the next few months. As the speaker put it, "If you're not scared by all this, you don't understand what's going on."

The new routers installed are given a year. After that, nobody knows what is going to happen. In other words, the state of the current network is one year ahead, and if no significant improvements are made in a year, sales will have to stop. Five-year planning is a joke.

In the more global view, Europe is three to four years behind (mostly E1/T1 links), but has already been deregulated. Market penetration is not that high yet

(only 5% of European homes have a PC), but Europe is growing fast. The Pacific Rim is three to four years behind Europe.

On the metropolitan dial-in front, the Microsoft Network is a venture funded by Microsoft, but owned by UUNET. It involved about 300,000 modems at the time of the symposium (within the DOW network), with large increases planned. Its design goal is to obliterate blockage.

The future is unclear. The speaker argued that we're not in the middle or the end of this race; we're at the beginning. Commercial companies are still learning how to deploy this technology and be able to pay for it.

Heiden predicted that FAX is going to be the next major Internet application (late 1990s). It's now 50% of all transAtlantic communications. Next year it is going to be on the Internet, incurring 5,000,000 sessions a day on dial-in networks. He also suggested that IP voice will be the next big Internet application after that, while system design (of Intranets/ Extranets) is going to become more active. Boiled down, the argument rests between packet-switched and circuit-switched technology.

The speaker identified the next wave of challenges as:

  • regulatory (applications): IP voice, international regulation
  • infrastructure distribution: Intranets and Extranets
  • technology: really fast switches and routers
Concluding, Heiden predicted that the next five years will be very exciting and neurotic. Anyone whose business lies within this technology will do well.

TECHNICAL SESSIONS

Session: Caching I
Summaries by Mark Mellis

Study of Piggyback Cache Validation for Proxy Caches in the World Wide Web
Balichander Krishnamurthy, AT&T Labs ­ Research and Craig E. Wills, Worcester Polytechnic Institute

Several factors need be considered in the construction of Web caches ­ size of the cache, i.e. the amount of main memory and disk space allocated to it; replacement policy, i.e. choosing which valid data items to keep in the cache, and coherency policy, i.e. ensuring that the data in the cache is consistent with the data on the server. Current Web cache implementations typically use time-to-live (TTL) techniques to maintain cache coherency. Objects that don't have explicit expiration times are flushed from the cache after a time that may be fixed or determined heuristically. The subject of the authors' research, Piggyback Cache Validation (PCV), maintains cache coherency by piggybacking: whenever a cache communicates with a server, it adds validation requests for potentially stale objects "on the back of" the message. In this way, the number of cache-server messages is kept low while the coherency of the cache is improved.

As is the case in much of the work presented at USITS 97, this research was conducted by simulating the performance of a Web proxy cache using large traces of actual proxy traffic. The authors' PCV work showed a 16-17% reduction in number of cache-server messages, 6-8% reduction in average cost, and 57-65% reduction in cache staleness ratio, all in comparison to the best TTL-based policy. Krishnamurthy indicated that there is more to come along this line of research ­ he considered this paper to be the "least publishable unit."

Exploring the Bounds of Web Latency Reduction from Caching and Prefetching
Thomas M. Kroeger and Darrell D.E. Long, University of California, Santa Cruz, and Jeffrey C. Mogul, Digital Equipment Corporation

When examining approaches to optimizing systems, one of the first questions a designer asks is, "Where are the biggest wins, and how big are they?" By developing a sense for the bounds of performance in the ideal case, the designer decides upon implementations that best meet his or her goals. Kroeger presented research examining performance bounds for latency reduction using two techniques: caching and prefetching. Since the point of the research was to find bounds, the authors looked at optimal caches ­ those with infinite size and with complete knowledge of future events. Using these optimal caches (limited in various ways), they used trace-driven simulations to determine that in the caching-only model, latency could be reduced at best by 26%, while in the prefetching-only model, latency could be reduced by at best 57%. A model that used both caching and prefetching could reduce latency by at best 60%.

The principal factor limiting the potential performance improvement of caching was found to be the rapid turnover of information on the Web. The distance into the future that a prefetching cache can predict is a principle factor in its ability to limit latency, with four minutes of prescience being enough to produce substantial performance improvement.

The Measured Access Characteristics of World Wide Web Client Proxy Caches
Brad Duska, David Marwood, and Michael J. Feeley, University of British Columbia

David Marwood presented research results that helped provide a baseline for intuition: what do real client access patterns look like and what implications do those have on existing cache performance?

Using both trace-driven simulation and static analysis, the authors examined some 47 million requests made by nearly 24,000 clients in seven discrete data sets, preserving the privacy of the clients by passing the traces through a one-way function that preserved the uniqueness of each client while obscuring its identity.

Marwood showed that with second-level cache sizes ranging from 2 to 10 gigabytes, hit rates between 24% and 45% can be expected. Between 2% and 7% of accesses are "false misses" caused by weaknesses in the squid and CERN cache coherence algorithms. Sharing is bimodal between widely shared and narrowly shared objects, and widely shared objects also tend to be shared by clients from different traces (everyone reads Dilbert . . .).

More information on the work described in this presentation can be found at <http://www.cs.ubc.ca/spider/marwood/Projects/SPA>

Session: Servers
Summaries by Mark Mellis

A Highly Scalable Electronic Mail Service Using Open Systems
Nick Christianson, Tim Bosserman, & David Beckmeyer, Earthlink

The rise of national Internet service providers has created needs for traditional computing services delivered on scales far larger than those previously contemplated. Christianson presented the email architecture used by Earthlink Network, Inc., which currently accommodates more than 400,000 users with over 560,000 mailboxes. The system throughput approximates 13 million messages per week, with 40 POP sessions initiated per second and 600 active POP daemon processes running at any one time. The email system has a 99.9% uptime record. The architecture described is expected to scale to greater than one million users.

Prime criteria in the design were message integrity, general robustness (uptime), scalability, performance, cost effectiveness, and legacy considerations. The system is designed around four main functional areas.

Front end servers receive inbound SMTP messages from the Internet. These are stock Unix machines running unmodified sendmail. Local delivery is via a custom mail.local program that interacts with the authentication database and delivers messages to mailboxes stored on Network Appliance file servers.

POP servers are another set of machines that handle interaction with subscribers and delivery of subscriber messages to the Internet. The POP servers also interact with the authentication server and mount user mailboxes via NFS.

Mailbox storage is handled by Network Appliance fileservers. Mailboxes are split over several servers and their location is computed by a balanced hash over 319 directories and stored in the authentication database.

Authentication service evolved from standard UNIX password files, through newdb databases, and is now managed by a commercial database product and accessed via SQL. With 400,000 users, it is no longer possible to assign each user a unique UID in the traditional UNIX sense. The same authentication database serves email, access servers, Usenet news, and other services.

One of the most significant challenges has been in file locking, necessary in mail delivery. Few commercial systems have lock tables large enough or lock table lookups fast enough for Earthlink's needs, so they are currently using file system semaphores for locking. This remains an area of ongoing development.

Improving Web Server Performance by Caching Dynamic Data
Arun Iyengar and Jim Challenger, IBM T.J. Watson Research Center

Much work has been reported in the area of client caching static Web pages. This presentation by Arun Iyengar in contrast reports on server-side caching techniques that are well-suited for dynamic content. Dynamic content delivery is often up to two orders of magnitude slower than static content delivery. By caching dynamic pages at the server, substantial performance increases can be gained. The author described the DynamicWeb cache, used by IBM in support of the 1996 Atlantic Olympic Games Web site, where it achieved a hit rate of around 80%. The DynamicWeb cache is part of the forthcoming net.data offering from IBM.

Because of the transient nature of dynamic Web content, explicit communication between the application and a process known as the cache manager is required to ensure that only long-lived content is cached. IBM has developed an explicit application program interface (API) for application-cache manager communication. Although the cache manager can run on the same host as the application, it most often is deployed on a separate host. Furthermore, multiple cache managers can interact with a single application, and a single cache manager can manage several caches. This allows the system designer great flexibility in his task.

DynamicWeb can satisfy several hundred requests for dynamic content per second given typical workloads, and in tests displayed near-optimal performance in many cases and 58% of optimal performance in the worst case.

Measuring the Capacity of a Web Server
Gaurav Banga and Peter Druschel, Rice University

Banga's presentation described limitations in current Web server benchmarking methodologies and presented a method for generating synthetic Web server workload that more closely resembles real life and can economically produce traffic volumes large enough to overload even high capacity servers. He noted that in benchmarks such as WebStone and SPECWeb96, the clients stress the server under test by operating in lock step, with only a single outstanding request per client. Reaching rates of 1100 requests/sec in this scenario requires the use of 74,000 client processes.

The authors developed S-Clients (short for Scalable Clients) to enable them to generate enough workload to produce request rates similar to those "observed in nature," on an affordable number of client computers. Rather than following the structure of a traditional http client, an S-Client is designed to shorten TCP's connection timeout, and to maintain a constant number of unconnected sockets that are trying to establish new connections. These design goals allow S-Clients to saturate the server with a small number of clients and to ensure that the generated connection attempt rate is independent of the rate at which the server can accept new connections.

Banga displayed graphs illustrating the measured degradation in performance of a Web server when pushed into overload by S-Clients, for both steady-state workloads and bursty workloads. While previous test methodologies showed flat server performance once maximum capacity was reached, the S-Client methodology showed marked performance degradation as presented workload increased. During the question and answer session following the formal presentation, a conference attendee from CNN commented that he had observed similar performance degradation in real life.

More information on this work, including source code, can be found at <http://www.cs.rice.edu/CS/Systems/Web-measurment/>.

Session: Potpourri
Summaries by Petros Maniatis

BIT: A Tool for Instrumenting Java Bytecodes
Han Bok Lee and Benjamin G. Zorn, University of Colorado, Boulder

The objective of this work is to characterize the behavior of Java programs, using instrumentation of Java bytecodes. Furthermore, an approach was taken so that the instrumentation tools produced would be easy to modify so that they meet users' changing needs.

Metainstrumentation is the process of creating instrumentation tools. BIT is the first metainstrumentation tool for Java. The idea behind it is to insert calls to methods to keep track of the number of instructions that get executed while the program is running. Its interface was modelled after ATOM.

BIT consists of a set of Java classes that can be used to build customized tools. It takes a class file, modifies it, and outputs another class file. The BIT system contains instrumentation and analysis code.

The instrumentation code in the libraries deals with modifying a class file and producing the output file. This code has been used to accomplish the motivating goals, namely, insertion of calls before each basic block, extraction of basic metrics per basic block, and maintenance of accumulated counters. Obviously, the instrumentation code shouldn't alter the behavior of the program (for instance, it shouldn't change the number of instructions that get executed per basic block).

The analysis code in the libraries specifies what to do when the methods inserted by the instrumentation code actually get invoked. In this instance of the problem, this code just keeps track of the number of instructions in the basic blocks.

BIT works by hierarchically breaking a class file into the following entities: the program, which is divided into methods, each of which is divided into basic blocks, each of which is divided into instructions. The current system allows navigation among these elements, information gathering, and insertion of calls into the analysis methods in the library.

BIT targets any language that targets the Java Virtual Machine (i.e., any language for which there exists a compiler that outputs Java bytecodes). It can be used to create simple customized tools such as:

  • count the number of times a method is invoked
  • count the number of bytecodes during the execution of a program
  • measure the branch probability on a per-branch basis
  • measure the instruction usage
Furthermore, BIT can be used within other, more advanced tools:
  • profiling (flat or hierarchical)
  • calling context trees
  • program optimizations
  • reorganization
  • JIT optimization via annotation and relation to hardware coverage analysis
  • branch prediction

The performance of BIT has been evaluated on two sample tools produced with BIT:

  • a branch counting tool
  • a dynamic instruction counting tool

Five Java applications have been used as input to these tools (one of which was BIT itself). The execution times of the instrumented applications were increased by 1.1 to 2.5 times (compared to their execution times before instrumentation). The execution time increase is mainly due to the fact that calls to the analysis methods have to be executed, in addition to the normal code that constitutes the program. The code sizes of the instrumented applications were increased by 1.1 to 1.4 times (compared to their code sizes before instrumentation), for obvious reasons.

HPP: HTML Macropreprocessing to Support Dynamic Document Caching
Fred Douglis, AT&T Labs ­ Research, Antonio Haro, College of Computing, Georgia Institute of Technology, and Michael Rabinovich, AT&T Labs ­ Research

Dynamic Web pages, especially those produced by search engines, usually have a low percentage of dynamic resources. In other words, the dynamic data in them are usually far smaller in size than the static data, such as formatting, banners, ads, etc. In some cases, such pages display extensive repetitiveness in their structure. Many attempts have been made to take advantage of these characteristics in order to reduce the amount of bits transmitted per such page.

Data compression and delta encoding have been proposed and implemented in the past. Both approaches result in an increase in the amount of work that the Web server has to do. A different approach was introduced with the OBJECT tag in recent HTML specifications, whereby dynamic objects are embedded in a page. The browser then fetches these dynamic objects separately, which causes an excessive number of requests to be received by the server. Especially in the case of search engines, devising templates containing OBJECT tags to accommodate an average random query can be difficult and inefficient.

The approach taken by the authors of this paper extends HTML with macroinstructions. They provide named placeholders within a cacheable template, which the client fetches first. Then the dynamic data are retrieved separately in a single request. Finally, the dynamic data retrieved are placed within the template and presented to the user.

Similar work has been done with SHTML (server side includes) and ASP (active server pages). The argument the authors provide in favor of their system is that, unlike SHTML and ASP, it executes the expansion of the templates at the client, offloading the server significantly and therefore improving the performance of the server, which is the hot spot for heavily used services.

In summary, the benefits of this work include:

  • bandwidth savings (by not sending multiple copies of similar pages)
  • server load reduction
  • template pre-compression (since templates are static and can be cached/compressed)
  • some help with TCP slow start (by overlapping the calculation of the dynamic data at the server with the transmission of the template)
One of the more powerful aspects of this approach is the use of loops to accommodate multiple similar fragments of a resource, as needed in the case of search engines. Nested loops are also supported. Furthermore, conditional macroexpansion and assignment to macrovariables have been included.

In terms of performance, the most striking result is that the latency observed using this system is only slightly worse than delta encoding. However, the comparison was done between a prototype implementation of this system and the fastest known delta-encoding implementation. Moreover, the extra effort is spent at the client, thereby allowing the server to perform other work at the same time.

Session: Security
Summaries by Petros Maniatis

Lightweight Security Primitives for E-Commerce
Yossi Matias, Alain Mayer, and Avi Silberschatz, Bell Laboratories, Lucent Technologies

The objective of this work is to provide a secure mechanism to support electronic commerce that works well with microtransactions and in a mobile environment. This has been accomplished using a client-side proxy service.

The basic motivation was drawn from the emergence and popularity of personalized electronic commerce applications on the Internet that supply a very large number of very low cost transactions (like a stock quote ticker or an online stockbroker). Such applications normally

maintain a long-term relationship between the server and each client (in the previous example, the stock market service provider and the ticker software at the client's computer), normally in the form of a subscription. Every time new quotes arrive or the user asks "How much did I make this morning?" the client has to pay for the service. Such services have, individually, very low monetary value. The effort devoted to authenticating such a low-value transaction shouldn't be disproportionate.

Currently, SSL (the Secure Sockets Layer) is the predominant method of authentication on the Web. It can run on top of any TCP connection and has very wide applicability. Other such mechanisms with similar characteristics are S-HTTP (Secure HypterText Transport Protocol) or SET (Secure Electronic Transactions). However, all these methods are much more elaborate (and expensive) than the average microtransaction, like the one mentioned previously; they do not provide any flexibility with respect to cost and complexity. For instance, SSL needs to do a public key encryption/decryption per transaction, which is quite expensive. Furthermore, users have to maintain states about their subscriptions on a specific client host (in the form of cookies, for instance).

The proposed solution is modular. It supports secure subscription and initialization of information delivery, ensuring data privacy, authenticity, and integrity to a reasonable degree, as well as third-party provability (nonrepudiation).

Upon initialization of the subscription of the client to the services offered by the server, a single shared key is agreed upon. This is computed at the client's side, given the user's identity, a unique identifier for the service and a local secret. This

shared key is then encrypted using the server's public key and transmitted to it. The server maintains a record of the client's identity, along with the associated shared key. Notice that the public key operation just mentioned happens only once, when the subscription is initialized. Subsequent exchanges are encrypted using this single shared key.

Secret key computation on the client's side is a very important component of the system. The function used must be efficient, computable, and consistent across different clients (i.e., it shouldn't require memorization of any secret across sessions). It should also provide modular security ­ guessing a key (computed through this function) shouldn't provide any help to derive other keys. The function used in this work is the Janus function, used in Lucent's Personalized Web Assistant. This function is based on pseudorandom functions and collision-resistant hashing.

In terms of performance, an elaborate HTML page (about 10kb) takes about 0.06 seconds to be encrypted, whereas using RSA public key encryption of a single key takes about 0.12 seconds and a single RSA signature takes about 1 second. In other words, this system encrypts a whole page twice as fast as RSA encrypts a single symmetric key.

This is not supposed to be a standalone solution, especially because SSL already seems to be a standard. However, the results of this work are intended as a supplementary security mechanism, to be used along with SSL. In a typical scenario, the handshake between the client and the server would occur using SSL; subsequent communication could be performed without SSL.

For more information, the interested reader can look at the Lucent Personalized Web Assistant.

Going Beyond the Sandbox: An Overview of the New Security Architecture in the Java Development Kit 1.2
Li Gong, Marianne Mueller, Hemma Prafullchandra, Rolland Schemers, JavaSoft, Sun Microsystems, Inc.

The objective behind this work was the expansion of the security model present in the Java system, as found in the released Java Developers' Kits. The authors have produced a new, more flexible security model, allowing fine-grained access control of system resources.

In its past versions, the Java security model relied on the concept of the Sandbox. Local applications/applets were always trusted and were allowed to do whatever they pleased with system resources. Remote applets were always suspected for foul play and were allowed to run only within a very restricted environment (called Sandbox), without local filesystem or network access, to name a few of the restrictions. These were relaxed with the next version of the JDK, where key-signed applets (using certificates from a third party) could be trusted, even if they were downloaded from the network.

Both approaches are still too cumbersome for the requirements of contemporary Web-based services. For instance, a stock portfolio maintenance applet would need access to local financial records. However, there would be no need to allow such an applet to access other, unrelated files or to open arbitrary network connections. With JDK 1.0.x, such an applet would be unusable unless explicitly installed by the owner of the system. With JDK 1.1, this applet could be installed, if properly certified, but would have far greater privileges than those required to complete its purpose. Furthermore, local code cannot be unconditionally trusted by default, because a naive user could have inadvertently downloaded it from the net and run it.

The new Security Architecture (which is going to appear in JDK 1.2) implements a policy-neutral framework allowing multiple domains of protection to be imposed on any applet. The security policy configures a series of Sandboxes, depending on the code currently executing in the Java Virtual Machine. This new architecture does not have a built-in notion of trusted code, regardless of the code's origin (local or remote).

The basic components in this framework are:

  • security policy
  • typed permissions
  • protection domains
  • multidomain computation (trust among mutually suspicious parties)
Policies can be chosen per site, per user, or per application. Conceptually, they resemble a table indexed by code origin and credentials. Every entry in the table assigns a set of permissions to each origin and signature, although these can be overriden. Permission inference is also implemented, so rules can be used to cover cases not explicitly included in a policy. For instance, if the policy rule allows connecting to "any host in the domain A-Domain.com," the inference module will have to decide whether this applies to "A-Host.A-Domain.com." The reader is referred to the paper for details about protection domains and example scenarios where pieces of code from multiple domains are combined.

In terms of performance, each security call is estimated to take about 200-300 microseconds in midrange systems in the current implementation. Security

computations are evaluated in a "lazy" manner, i.e., nothing is done until a computation is required. The "eager" computation would mean that every time a domain boundary were crossed, the set of security permissions would have to be recomputed. Because domain crossings are more frequent than security computation requests, lazy evaluation is better for the average case.

This new Java security framework will be released soon, along with the rest of the new JDK version 1.2.

Secure Public Internet Access Handler (SPINACH)
Elliot Poger, Sun Microsystems Inc., and Mary G. Baker, Stanford University

The objective of this work is to provide an intermediate-grade, secure access mechanism for public network ports. The concept of a prisonwall is introduced, whereby unauthorized ­ or as yet unauthenticated ­ public port users are confined within a small, protected subnetwork. Only when proper access privileges are certified can they use local network resources and/or the Internet.

In most modern, technologically friendly buildings in the computing industry and elsewhere, network ports often appear in public lounges, hallways, or conference rooms. This is bound to become quite common soon in public libraries and educational institutions. These ports should be usable by all those having permission to use them through a permanent affiliation (as would be the case for students without a permanent office in their departmental building), or a temporary affiliation (as would be the case of visitors in a research facility).

It is imperative that access to public network ports is controlled. Their malicious use could embarrass the host organization (if, say, an unauthorized user initiated an attack against another organization from the host organization's facilities), endanger local operations (if an unauthorized user mounted an attack against the host organization itself, taking advantage of local trust), or even cause the host organization to break contractual obligations (if an unauthorized user, taking advantage of network location, accessed data licensed only to persons directly affiliated with the host organization, like online encyclopedias and such).

The basic idea behind SPINACH and the prisonwall scheme relies on separating public ports from trusted ports (trusted ports would be those within offices and behind locked doors). These public ports are connected to the rest of the network through a "prisonwall" device, which provides a mechanism for users to become authorized. When users first connect to a public port, they are "imprisoned" within the prisonwall. They cannot send or receive packets whose other endpoint lies outside the prisonwall. After users are authenticated with the system, packet delivery works as usual with the outside world. Note that a prisonwall works like a firewall, but turned inside out.

The idea was put to use in the authors' building (Computer Science Department, Stanford University). The design goals were:

  • ease of deployment
  • a system that works with current network infrastructure and existing client hardware and software
  • no effect on trusted ports
  • ease of use
  • availability to both temporary users (visitors) and permanent users (local staff temporarily using a public port). (Authorized users should be allowed to use the network without restraint. The authorization process should be short and easy.)
  • ease of administration
  • maintenance of an audit trail for better damage control (Maintenance of the system should impose minimal administrative overhead.)
In this specific implementation, the prisonwall concept (called SPINACH) uses locally installed Virtual LAN (VLAN) switches to separate public ports from trusted ports into a "public subnet." The SPINACH router selectively forwards traffic to and from the public subnet. It filters ingress traffic based on the hardware address and IP address of packets. It also performs the authorization/authentication functions for access control, using Kerberos.

An important decision that the designers of the system had to make traded off security for flexibility and ease of deployment. SPINACH had to work with existing networking hardware on site, as well as widely available software on most users' laptop computers (such as DHCP and Telnet clients). That limited the sophistication of security mechanisms they could use for the packet switching function at the SPINACH router. This is an issue because IP addresses and hardware Ethernet addresses can be spoofed without significant trouble. However, there are other options, which designers in other installations can elect to take.

Session: Monitoring
Summaries by Petros Maniatis

Web Facts and Fantasy
Stephen Manley, Network Appliance, and Margo Seltzer, Harvard University

The objective of this work was to provide a contemporary, broader characterization of Web servers on the Internet, proposing a taxonomy for them, and analyzing their access patterns, especially those related to growth.

So far, studies have focused on relatively shorter log traces of relatively narrower sets of servers. The authors managed to acquire 38 server logs spanning between one and two years, from multiple sites on the Internet, including Internet service providers (ISPs), universities, commercial sites, adult entertainment sites, free service sites (free software distribution), and informational/governmental sites.

The growth characteristics of all sites were astounding. Although not all growth was positive, growth patterns categorized sites according to the most highly correlated other observed factor. In all cases, however, growth was exponential (either increasing or decreasing).

The free software site was growing along with the number of Web users visiting it. Other sites in this category would be those requiring continual access (such as news feeds, as in ESPN or CNN).

The traditional business site was growing every time it underwent major renovation or redesign. The surges of accesses tend to decrease in time when the renovation becomes regular and expected.

Internet service providers and other sites hosting large numbers of user home pages were growing along with the number of user Web pages they contained (because each user tends to receive a close to constant amount of traffic). Such sites were the academic sites, the free Web-hosting sites like Geocities, etc.

Informational/governmental sites were growing along with the number of similar informational Web pages they could supply per user. Such sites tend to be indifferent to external traffic (and they happened to be the least popular).

Sites relying heavily on favorable treatment by search engines (such as the adult entertainment site) tend to grow along with their position in the result list of popular topics in popular search engines. The adult entertainment site observed had to shut down when its listing in most search engines stopped appearing close to the top.

Finally, sites accessed on a for-fee basis tend to grow inversely proportionally to the fees incurred per access. The organization site observed lost a lot of its traffic when the access price structure became less appealing to the public.

Another major concern on the Web involves latency. Users tend to abort a Web request after a while (estimated around ten seconds). Businesses geared toward Web advertisements are especially interested in decreasing the average latency for a page below the tolerance threshold of Internet users. Traditionally, the blame has been ascribed to the overloaded servers, to excessive Computer -Generated Interaction (CGI scripts), and to the concurrent maintenance of too many TCP connections.

Measurements were taken on location at the traced sites. Logs recorded processing latencies in most cases. Latencies were measured per byte. Most requests (around 90%) attained performance exceeding 1 millisecond per byte. There were, however, byte latencies reaching 100 milliseconds per byte.

As far as CGI-related latency was concerned, servers responded in less than one second to CGI-enabled requests. In most sites, even at different scales (from the most busy to the least busy servers), CGI was indistinguishable from non-CGI traffic and never surpassed 2% of the transferred bytes.

More work on the issue is under way, because it was concluded that the minute-long latencies observed could not be attributed to either excessive server load or high numbers of concurrent TCP connections between a server and its clients. This will mainly focus on a more detailed observation of heavily accessed Web servers so that the blame for portions of the measured latency can be assigned more accurately.

SPAND: Shared Passive Network Performance Discovery
Srinivasan Seshan, IBM T.J. Watson Research Center, Mark Stemm, and Randy H. Katz, Computer Science Division, University of California, Berkeley

The objective of this work was to provide a shared measurement scheme, offering information on expected performance characteristics of a network connection.

On the Internet today, we can find many clouds of local connectivity (which are, in fact, a Local Area Network in some incarnation or another) that can communicate with each other through a number of network hops. The basic problem is figuring out in advance what the net performance is going to be like when communicating with a host in a different domain (i.e., a different cloud).

The ability to change the content fidelity (quantity of information, smaller vs. larger images, loss-y or lossless compression, etc.) according to the available network resources could be very useful. Such

would be the case if our Web browser were able to automatically turn off image loading for slow connections. Another case would be choosing among multiple sites mirroring the same information (similarly to Harvest). Finally, we can picture a Web page where every link is annotated with the expected bandwidth of the network connection it requires.

SPAND utilizes shared network performance measurements, which get propagated to nearby clients. Clients send summaries of their connectivity to the performance servers. The measurements themselves are passive; they are obtained by observing existing traffic on links ­ the only traffic produced by SPAND contains the performance measurement reports shared among neighboring clients.

SPAND has several components. SPAND-aware clients run modified applications, which can extract performance measurements from their existing, ongoing network connections. They subsequently make up performance reports, which they send to the performance servers.

Servers receive summaries of connectivity by specific applications on local clients inside performance reports. They also receive performance report requests, to which they respond with aggregate reports (i.e., reports made up from all performance reports sent by local servers).

Packet capture hosts take on the task of creating performance reports when there are no SPAND-aware clients available. They snoop local traffic and generate performance reports on behalf of unmodified clients. Packet capture hosts are very important, because they make the rapid deployment of the system much easier. This is because they can, for the short term, substitute the existence of modified applications on local clients.

An important issue is that performance measurements are categorized per application. Each application has its own traffic characteristics (in terms of transfer type, bulky or interactive, flow control, congestion control, reliability, etc.), so performance measurements are kept separate per application type.

In practice, the authors have implemented a prototypical Web proxy that uses the SPAND toolkit and appends appropriate icons to links on a Web page to indicate expected performance behavior. The server's turnaround time is very promising, according to the results provided. The warm-up of the performance database is fairly fast. At first, the server can respond to a performance request 70% of the time. Eventually, this scales up to 95%. Also the measurements of the accuracy of the supplied estimates are very promising.

Future work includes the incorporation of better, more elaborate methods to derive aggregate performance reports, utilization of locality information to infer performance expectations from other nearby destinations, and protection from erratic measurements.

Rate of Change and other Metrics: a Live Study of the World Wide Web
Fred Douglis, Anja Feldmann, Balachander Krishnamurthy, AT&T Labs ­ Research, and Jeffrey Mogul, Digital Equipment Corporation ­ Western Research Laboratory

The objective of this work is to characterize change in the World Wide Web and to use this characterization to evaluate the benefits of rigorous Web caching.

Web caches are rapidly gaining popularity and attracting bigger research efforts (Squid/Harvest, Netcache, Cisco Cache Engine, Inktomi Traffic Server, and so on). It is still uncertain, however, how

well they work. The point behind their use is to serve pages to users without having to contact the content provider, thereby eliminating unnecessary network traffic. In the same vein, delta encoding, a scheme according to which only changes to a Web page are transmitted and applied to a previous, base version, is gaining popularity as well. One of the questions this work is attempting to answer is "How often will you see changes that you can apply to the base version?"

To answer many similar questions, the authors studied large logs of Web traffic, to which they applied many metrics, in order to characterize different aspects of change on the World Wide Web. The metrics they actually used are:

  • frequency of reaccess (It affects cacheability directly.)
  • rate of change (the fraction of accesses that involved changed resources, which affects both cacheability and the applicability of delta encoding)
  • age of resources and modification intervals (This helps to detect modification patterns, which affect how expiration times are selected on caches.)
  • duplication of content (mirrors, aliases)
  • changes in semantically meaningful ways (phones, embedded URLs) A few of the basic facts retrieved from this study were that:
  • The more frequently accessed resources change more often (stock tickers are a good example of that).
  • Images are more static.
  • Resource size is relatively unimportant.
  • Commercial sites have more dynamic content.
The data used for this study comprised mainly a 17-day log of port 80 (the http port) between AT&T Labs ­ Research and the world (which contained 465 clients and 21,000 servers and reconstructed the full content of all requests and responses). A second trace, taken at Digital's WRL site, covering two days of Web traffic, supported the results extracted from the first trace.

The change ratio metric indicates the fraction of the total accesses that are, in fact, accesses to changed data. Evaluation of this metric over the two traces showed that images very rarely change. The most rapidly changing resources are octet streams (netcast streams like tickers). HTML is bimodal. About 70% of the HTML resources never change, whereas most of the rest of their remainder change on each access.

Age analysis of the data showed that the most frequently accessed resources are, in fact, the youngest ones. In general, HTML resources are younger than GIFs, and smaller images seem to run older than larger ones. Finally, educational sites tend to be the "oldest."

Another issue the authors addressed was that of duplication. There are several causes for duplicated content:

  • different URLs applied to the same resource (because of session IDs, appearing as parameters to a CGI script)
  • mirroring of entire sets of resources
  • duplication of icons or other images (like the Netscape Now logo)
  • multiple URLs belonging to the same Web advertisement
  • the rate of duplication being surprisingly high (around 18%)
Finally, the authors studied changes in semantically identifiable portions of

retrieved resources, like images, email, or telephone numbers. They parsed the textual resources in the logs and found that HREFs appeared in 74%, IMGs in 72%, and email and telephone numbers in 20%. Email addresses and telephone numbers don't seem to change much, as expected. HREFs and IMGs were unchanged in about half of the examined resources, but changed completely in 3-5%.

In summary, this study showed, through the analysis of data coming from two major organizations, that frequent changes happen in many resources. This might make delta encoding more useful than simple caching for those cases. However, duplication was very common as well. Therefore, the Distribution and Replication Protocol proposed by Marimba and Netscape might also prove useful.

Session: Applications
Summaries by Mark Mellis

RainMan: A Workflow System for the Internet
Santanu Paul, Edwin Park, and Jarir Chaar, IBM T.J. Watson Research Center

In these days of strategic partnerships and geographically dispersed organizations, systems that enable and manage distributed workload and preserve process are increasingly important. Paul presented RainMan, a distributed workflow system built on the RainMaker workflow framework.

RainMan is a loosely coupled system of network connected performers. These performers handle tasks in response to service requests generated by workflow sources. Performers can be individual humans, computer applications, or other organizations. RainMan provides mechanisms for managing performer work lists, a directory service, and a variety of user interfaces that enhance its use in cross organizational environments, including the ability to manage shared work lists.

Paul indicated that a publicly accessible Web page describing RainMan was under construction. Contact him at <santanu@watson.ibm.com>for more information.

Salamander: A Push-Based Distribution Substrate for Internet Applications
G. Robert Malan, Farnam Jahanian, and Sushila Subramanian, University of Michigan

Salamander is a publish/subscribe data dissemination substrate for Internet applications. In use for several years supporting distributed research efforts, it offers a variety of services to its client applications.

Describing one of the principle Salamander applications, Malan told of the Upper Atmospheric Research Collaboratory (UARC) project, where groups of space scientists all over North America and Europe collaborate as frequently as daily on campaigns in realtime over the Internet. UARC's goal for Salamander was to replicate the feel for the participants of working together in "a hut in Greenland," providing access to instrumentation, informal "chat" communications, a shared annotation database, shared text editing, and most importantly, real time access to experimental data.

The Salamander architecture includes a channel subscription service, application-level quality of service policy support, and a lightweight data persistence facility employing caching and archival mechanisms.

More information on Salamander, including source code, is available at <http://www.eecs.umich.edu/~rmalan/salamander>.

Creating a Personal Web Notebook
Udi Manber, University of Arizona

Manber, a self-described "academic that hacks," is the author of the popular programs glimpse and agrep. Here he presents Nabbit, a tool for the construction of a personal Web notebook.

Two existing mechanisms for preserving "precious nuggets of information" from the Web have well-known shortcomings. Adding URLs to a hot list soon results in an unmanageable hot list, while saving entire Web pages locally yields overflowing disks filled with stale data in files with forgettable names. Hence, Nabbit. Nabbit is a WYCIWYG tool ­ What You Click is What You Get. It allows one to select sections of Web pages and paste them into a notebook, complete with time stamp and clickable reference to the original source. It even works with forms ­ a search engine form collected via Nabbit is still functional in the notebook.

Nabbit works by using approximate string matching to locate the selected information within the document source. (Approximate string matching is what Manber does ­ "when you have a hammer . . .") Once the selection is located, only the tags relevant to the selection are extracted, forming a stand-alone snippet of HTML. For this reason, Nabbit only works on HTML pages, not on Javascript. UNIX and Windows versions of Nabbit are available, though the UNIX version is the more functional. More information about Nabbit is available at <http://glimpse.cs.arizona.edu/udi.html>.

Session: Works in Progress
Summaries by Mark Mellis

ScalaServer
Vivek Pai <vivek@cs.rice.edu>

Pai described research into system-level support for scalable network servers using commodity hardware and software. This support includes light weight IO mechanisms, network stack enhancements, cluster support, and content-based request distribution. More information is available at <http://www.cs.rice.edu/cs/systems/ScalaServer>.

Why We Wait
Jeff Mogul <mogul@pa.dec.com>

Mogul discussed his investigation of Web latency based upon network packet traces rather than proxy logs. Examining one real-time hour of data ­ approximately 2 million connections ­ he found that the predominate cause of latency was lost SYN packets. He attributed the worst latency problems to network stacks that failed to include modifications to mediate SYN flooding attacks.

Data Collection with Mobile Agents
Jeremy Hylton <jeremy@cnri.reston.va.us>

The goal of Hylton's research is efficient application-specific data collection. In his approach he pushes the indexing code onto the server. More information is available at <http://www.cnri.reston.va.us/home/koel>.

Evaluating High Performance APIs in AIX
Erich Nahun

Nahun observed that Microsoft achieves significant performance improvements by exploiting particular APIs in its Internet Information Server (IIS) on Windows NT. His research is focused upon exploring the performance benefits of similar APIs in non-Microsoft environments. His approach is to implement the APIs as kernel extensions and then modify servers to use them. Preliminary work shows improvements in process-based servers and the potential for even greater improvements in threads-based servers.

Workstation Authorization
Peter Honeyman <honey@citi.umich.edu>

Honeyman reported on work with goals similar to those of the SPINACH project reported upon in the Security session ­ allowing controlled, authenticated access to campus networks via ethernet for mobile user communities. Honeyman's approach differs from that of SPINACH by requiring special software support on the client computers, and by permitting access by actively controlling the ports on the ethernet hub.

Wisconsin Proxy Benchmark
Pei Cao <cao@cs.wisc.edu>

Cao described her work on a benchmark that uses LAN-connected clients running a trace-derived workload to measure latency and throughput in Web proxies. More information is available at <http://www.cs.wisc.edu/wisweb/>.

Lucent Personalized Web Assistant
Alain Mayer <alain@bell-labs.com>

This work allows one to use personalized Web sites securely and privately while protecting against SPAM-oriented address harvesting and without having to remember a unique password for each site. These goals are achieved by accessing the personalized site via a proxy that mediates requests and manages authentication while protecting the user's identity. The proxy does not maintain per-user state ­ it computes the necessary information using cryptographic techniques. The service can be tested at <http://lpwa.com:8000/>.

Session: Caching II
Summaries by Mark Mellis

Cost-Aware WWW Proxy Caching Algorithms
Pei Cao, University of Wisconsin, Madison, and Sandy Irani, University of California, Irvine

Cao discussed issues influencing selection of Web cache object replacement algorithms, reviewed a number of extant strategies, and introduced a new algorithm, GreedyDual-Size (GD-Size).

GD-Size is a variation on one of a range of algorithms originally proposed by Neal Young. It is implemented by assigning a value H, which is a function of the cost to obtain the object and the size of the object, to each object in the cache, and maintaining the H values in a priority queue. At each object replacement, the object with the lowest H value is removed and the remaining H values are reduced.

The authors demonstrated that GD-Size outperforms other widely-used replacement algorithms in hit ratios, latency reduction, and network cost reduction. More information on GD-Size is available at <http://www.cs.wisc.edu/wisweb/>.

System Design Issues for Internet Middleware Services: Deductions from a Large Client Trace

Steven D. Gribble and Eric A. Brewer, Univeristy of California, Berkeley

Gribble and Brewer performed a 45-day-long packet trace of the network connecting a University of California, Berkeley modem pool to the Internet. The trace included some 24 million requests from six thousand clients. Gribble presented the results of the analysis of that trace. Unlike other work that has been reported upon in the past, this research was performed from a client perspective rather than a server perspective, and analyzed network traffic rather than using server or proxy logs.

The client community consisted mostly of Berkeley students and displayed strong geographic locality, which contributed to the strong diurnal cycle observed in the traffic patterns. Although peak-to-average traffic ratios of 5 to 1 were common on time scales of tens of seconds, the authors did not observe the self-similarity of network traffic described in other studies. This observation was the subject of lively discussion during the question and answer session.

A particular area of interest to the authors was diversity in Web browser client software. They observed 166 discrete clients, varying in type, version, OS, and hardware platform, although 55% of the browsers were Netscape Navigator on Windows95, with Netscape Navigator on MacOS the second most prevalent at about 20%. Considering volume of traffic, the authors observed that 31% of the bytes transferred were JPEGs, 27% were GIFs, and 18% were HTML, with 24% other types.

The traces generated during this study are available from the Internet Traffic Archives at <http://www.acm.org/sigcomm/ITA/>, and more information on this study is available at <http://www.cs.berkeley.edu/~gribble/papers/papers.html>.

Alleviating the Latency and Bandwidth Problems in WWW Browsing
Tong Sau Loon and Vaduvur Bharghavan, University of Illinois, Urbana-Champaign

Tong described research focused on reducing Web access latency by predictive pre-fetching objects according to usage patterns, filtering content prior to delivery to the browser, and caching at the workgroup level. These techniques are especially useful in "slow last link" scenarios. In order to gain the most benefit

from these strategies, content filtering is performed on the Internet side of the slow link, while pre-fetching and caching are performed on the client side of the slow link. Caching at the workgroup level exploits locality of reference by workgroup members with similar interests.

Usage pattern profiling is performed both at the browser by a local profile engine on the user's workstation, and at the group level by a backbone profile engine. Profiles are passed through the filters before being speculatively pre-fetched in order to reduce link utilization.

Pre-fetch performance is improved by the use of a number of heuristics, including exploiting the notion of sessions, hoard walking, performing pre-fetchs when the network is relatively idle, and dynamically determining a document's dependents.

Measured performance improvements of 20­38% across a number of metrics were obtained using this system, with hit ratios of approximately 70% for daily daily usage by a small population. Contact Bharghavan at <bharghav@crhc.uiuc.edu> to obtain the software. Other information on the work is available at <http://timely.crhc.uiuc.edu/>.

Session: Information Retrieval and Searching
Summaries by Petros Maniatis

The Search Broker
Udi Manber and Peter A. Bigon, Department of Computer Science, The University of Arizona

The objective of this work was the creation of a librarian of online libraries in a form similar to a search engine.

According to the author, search engines are very good already, although there is room for improvement. However, they

lack focus. Even when the user is looking for an item within a particular category, search engines return Web pages as results. For instance, if the user is looking for an x-ray of a shoulder, the average search engine will return Web pages containing the key words "x-ray" and/or "shoulder," perhaps close to one another. Still, those returned pages will not very often contain an actual shoulder x-ray. In other words, key word searching is not good for all purposes.

The Search Broker attempts to capitalize on the widespread existence of specialized databases on the Internet. Several hundred such databases, covering quite specialized subjects, from x-rays to hotels to language dictionaries, have been found, connected, and searched through an integrated front end. At the time of the presentation, the Search Broker contained 412 subjects.

The model of the query fits into a two-level paradigm. First the appropriate database is located; this is done by matching the first word against a list of key words identifying topics. The database associated with the topic identified is then used to answer the specific question. For example, to look for a recipe with lentils, a user would have to submit the query "recipe lentils"; "recipe" selects a Web-based recipe database, and then "lentils" is the actual text of the query submitted to the recipe database. If a topic cannot be identified, a regular search engine is used.

Other examples of queries would be "stocks IBM," to find the current price of the IBM stock, "fly tus jfk" to find a flight travelling today from Tucson, AZ to New York City, and "patent Manber" to find all the patents filed whose holder is someone called Manber.

The subject databases are added into the system by a human librarian. This was a decision intended to assure content quality. However, the addition of a new subject is fairly easy and quick (about one minute per new database). Once the URL of a new database is located, a utility reads its front end (assuming it is a form-based interface) and stores a representation for it, assigning default values to secondary fields. Then the librarian has to characterize the database (assign it a topic/category) and add links to related topics.

For more information and to use the actual system, interested readers can look at the Search Broker page.

Using the Structure of HTML Documents to Improve Retrieval
Michal Cutler, Yungming Shih, and Weiyi Meng, Department of Computer Science, State University of New York, Binghamton

The objective of this work was the improvement of the information retrieval techniques used for the World Wide Web, using HTML structural information.

Most automatically populated search engines (unlike those requiring manual classification of URLs) attempt to match a user query to their set of indexed documents. The documents matching the query are ranked according to certain suitability criteria. In most cases, the rank is a measure of the overlap between the concepts of the query and the document text. The main idea in this work is to use hypertext and HTML tag information to improve the retrieval results for such queries.

This is done by associating key words within a document with a classification vector, which contains the number of occurrences of the key word in any one of the predefined structural classes. Such classes can be hyperlink anchors, headers, titles, emphasized text, etc. Through this association, the appearance of a requested key word in an "important" structural class (like the title) of a document can reinforce our belief in the relevance of the document to the query containing the key word. Assigning appropriate weights to these structural classes was an important part of this work.

A technique used to provide better insight to the contents of a Web page is to see how other documents cite it. When we provide a link to a Web page, we try to supply a good description of it. The authors use such descriptions (also referred to as "anchor descriptions") to improve the relevance of the contained key words. Furthermore, anchor descriptions can provide alternate wordings (in the form of synonyms or related terms) of the concepts contained in the cited Web page, thereby alleviating the "language variability problem." Anchor descriptions constituted one of the structural classes mentioned previously. The complete list of structural classes used is:

  • title

  • top headers (Headers 1 and 2)

  • bottom headers (Headers 3 through 6)

  • strong (strong, emphasized, underscored, lists)

  • plain (none of the above)

  • anchor
To prove the concept, the authors built a search engine, called Webor (for Web-based search tool for Organization Retrieval) that indexed the entire Web space of SUNY at Binghampton and computed the term frequency vector per structural class for all key words in all pages. Given ten user queries and manually selecting the most relevant pages among all those indexed, they used hill climbing to derive the weights for the six structural classes (called "class importance values").

The metrics used to evaluate the results were recall and precision, established metrics in the field of information retrieval. They placed the greater significance on precision, because it measures the proportion of "good" documents out of all those retrieved. For the best values found at the time of the presentation (1 for plain text, 1 for bottom headers, 8 for strong and anchor, 6 for top headers, and 4 for title), they managed to improve the average precision by as much as 44% (for a 5-point average).

In the future, the authors plan to use more than ten queries to refine the class importance values derived, use larger page collections, and reevaluate the classes themselves (perhaps more/other classes would give better results).

SASE: Implementation of a Compressed Text Search Engine
Srinidhi Varadarajan and Tzi-cker Chiueh, Department of Computer Science, State University of New York, Stony Brook

The objective of this work is to combine searching and text compression in the same efficient framework.

The motivation behind SASE lies within current key word searching mechanisms used in most popular search engines.

Efficiency tends to become a problem, as the Internet grows. Being able to search through compressed text files without decompressing them first would provide increased flexibility and improved performance characteristics. As with most integration schemes, such an approach has to deal with tradeoffs, because it needs to achieve relative efficiency in both component subsystems (compression and searching).

In the text searching field, the most common approach uses inverted indices. Such an index records the location of every word in a database. It contains a dictionary of all the words, along with a linked list per word of all its locations in the database. When the user enters a query, the word is found in the dictionary, and its locations are retrieved through the linked list.

In the text compression field, the most popular approaches use substitution of repetitive patterns with shorter numerical identifiers. Variable bit length schemes have been used (Huffman codes), as well as dictionary-based schemes (Lempel-Ziv).

For the purposes of this work, an inverted index with a dictionary-based compression scheme has been used. Furthermore, the same dictionary used for the inverted index is used for the compression. The compression granularity is a single word, reducing compression efficiency, but allowing for searches that target word-based patterns.

Inverted indices tend to be very large, customarily about three times the size of the original database. One way to reduce the inverted index size is to introduce the concept of blocking into the system. The compressed word stream is partitioned into blocks. Pointer lists in the dictionary then point to these blocks, as opposed to individual words. During search, a

retrieved block is searched linearly (in compressed form) for the exact location of the word in question. The block size determines the balance in the tradeoff between search time and storage space.

The system has been implemented as a stateless server with a Java front end. State is maintained by the clients (through a "context," which operates like an iterator: along with every response, the server returns the context, indicating which one among a list of multiple responses it just returned. If the client needs more, it resubmits the query along with the context.).

SASE has been evaluated against GZIP (one of the best lossless compression utilities) and GLIMPSE (which also performs compressed text searching) on three databases:

stories from Project Gutenberg (7MB)

Internet RFC database (70MB)

USENET news articles (300MB)

The compression efficiency of GZIP is 7-17% better than SASE, which in turn is 28-31% better than GLIMPSE. In terms of its search performance, searches take between 13 and 120ms, which compares favorably to a fully inverted index scheme. Search times increase linearly with the block size.

Future work plans include incorporating SASE in a news server (NNTP) to avoid transition overheads, using vantage point trees for approximate searching, and making SASE updatable (to avoid unnecessary reconstruction of the indices).


?Need help? Use our Contacts page.
First posted: 13th April 1998 efc
Last changed: 13th April 1998 efc
Symposium index
Proceedings index
USENIX home