Using Coarse-Grain Indices

Next: Implementing Fixed and Moving Up: Implementation and Optimization Issues Previous: Basic Data Management

Using Coarse-Grain Indices

While Tribeca is largely designed so that a stream of data coming from the network and a stream of data coming from tape can be manipulated in the same way, there is one important difference between the two cases: secondary indices can be constructed for recorded data. Traffic analysis users often identify a few hours of packets that are especially interesting and issue repeated queries over this data. Secondary indices are very important for this workload.

When the database is stored on large, high-speed tapes, however, only a limited kind of index is feasible. Our ID-1 tapes are simply not effective random access devices. That means that we cannot support indices on non-sort fields. Even on the sort field, a full index of every packet in the stream is impractical. The index itself cannot be stored on tape, so it must be much smaller than the data.

Therefore, Tribeca supports what we call coarse-grained secondary indices. A coarse-grained index is an approximate index on the sort field of recorded data streams. Users specify the ratio, R, of index size to underlying data size. When building an index, Tribeca inserts a key pointing to one record for every R bytes. The index is always stored on disk.

After the index is constructed, a subsequent query with a qualification on the indexed sort field can use the index to skip over parts of the input data. Because the index is approximate, the scan cursor must be placed on the last indexed record before the scan key. After that, Tribeca applies the qualification to each record until the matching one is found. The search procedure is not as fast as a full index, but it represents a good compromise for the traffic analysis environment. Users can trade off search performance and the size of the index (larger R means smaller indices but a coarser grain search).

Next: Implementing Fixed and Moving Up: Implementation and Optimization Issues Previous: Basic Data Management