Next: The Tribeca Query Language Up: Tribeca: A System for Previous: Abstract

Introduction

The rapid growth of high speed computer and telephone networks means that the tools to analyze and engineer the networks are becoming more and more important. Network engineers use a combination of hardware and software tools to monitor the network, record various statistics and flows, and analyze the collected data. These tools either operate directly on the live network or record traffic for later offline analysis. For example, one group we have worked with records OC3 links (155 Mb/s) and groups of 16 T1 links (1.5Mb/s each, 24Mb/s aggregate). Their tape technology ranges from 8mm tape to 96 GByte ID-1 tapes that transfer data at about 256 Mb/s. The data from a monitoring run ranges from a few gigabytes to hundreds of gigabytes. Network engineers expect this number to grow rapidly into the terabyte range as monitoring tools, networks, and storage technologies improve in price and performance.

Network traffic engineers use their vast collections of network data to perform such diverse tasks as protocol performance analysis, conformance testing, error monitoring and fraud detection. In general, each group writes its own ad-hoc programs to examine and analyze the data. Although these programs query large databases of recordings, the traffic engineers avoid using conventional relational database management systems (RDBMSs) for several reasons:

Both the data and the storage medium are stream-oriented. Fast sequential access to data is crucial; transactional updates, fast access to random records, and concurrency control are not. A highly-tuned C program can outperform a general purpose RDBMS on this workload.
RDBMSs do not usually handle data on tape well [3]. Non-clustered indices will not work for traffic data. Worse, traffic analysis data is often used only a few times (or once), so load time is a significant cost. Finally, network traffic traces contain many small records with fields a few bits wide, so per-tuple overheads can noticeably increase the database size.
A network traffic trace is a sequence of time-stamped network protocol headers. The analysts use operators like those found in sequence and temporal DBMSs [13][16]. Traffic analysts usually calculate aggregates on packet inter-arrival times or calculate and compare network utilization over successive time periods or time scales. The traffic applications also use several data-flow operators and pattern matching operators (e.g. demultiplexing and protocol recognition) that are not common in sequence databases.
Traffic analysts run batches of related queries during a single pass over the data. Users will sometimes intentionally write queries that use partial results generated by a concurrently executing query. The shared single data source means that even otherwise distinct queries will often share subqueries.
Users must consider the capacity of the analysis hardware. Often the users would rather reformulate an expensive query or drop an expensive query from the mix than overallocate processor or memory resources in the analysis machine. Relational systems run queries as fast as they can but typically do not provide this kind of capacity information.

Tribeca is a software system for monitoring and analyzing either a live network or recorded network traffic on tape. Tribeca users can write queries to process arbitrarily long streams of information. Like a relational DBMS, Tribeca has a query language that can be compiled and optimized. Like extensible DBMSs [15][5][2], Tribeca has a type system and user-defined operators so it can integrate support for different network protocols and specialized traffic analysis operators. Unlike conventional systems, Tribeca does not support random access to data, transactional updates, conventional indices, or traditional joins.

Tribeca is designed to read a stream of data from a single source (tape or a network interface) and apply compiled queries to the stream. It has a data-flow-oriented query language that allows users to construct large batch queries for the one pass over the data. It also has operations to separate and recombine substreams derived from the source. Finally, Tribeca supports window operators that allow users to compute moving aggregates and to do a very restricted form of join. Both the query language and the optimizer help prevent users from expressing queries that produce intermediate results that cannot be stored in main memory. Because of this, query optimization focuses on memory management and predicate ordering rather than traditional I/O optimizations like access path selection and join optimizations.

Several different groups of network analysts used Tribeca over a one-year period and the system performed well. Measurements show that it is only 1-9% slower than a hand-tuned ad-hoc program on simple queries. With Tribeca, our users are also able to construct more complex queries than they would be able to implement easily in their ad-hoc programs. More importantly, they can easily retarget their queries to do similar analysis on different kinds of networks.

This paper describes the Tribeca design and implementation. Section two gives an overview of the query language. Section three outlines the system's implementation and presents performance measurements from our prototype. Section four compares Tribeca to related work and section five gives conclusions.

Next: The Tribeca Query Language Up: Tribeca: A System for Previous: Abstract