Basic Data Management

Next: Using Coarse-Grain Indices Up: Implementation and Optimization Issues Previous: Implementation and Optimization Issues

Basic Data Management

The stream data model allows Tribeca to perform its basic I/O operations efficiently. I/O in Tribeca consists only of reading the source stream and writing result streams. By design, joins do not affect I/O performance because one join operand must fit in memory. The large sequential read is the dominant I/O cost.

This simple I/O pattern means that Tribeca can take advantage of standard operating system services in a way that a conventional RDBMS cannot. Modern Unix file systems are tuned to detect when applications are doing large sequential reads and support them efficiently. Because a conventional RDBMS accesses data differently from a standard application program, the conventional DBMS must work around operating system resource managers such as the file system and buffer pool. Also, because Tribeca queries do not update data in place, Tribeca needs none of the support and overhead required for concurrency control.

Because of its mix of workloads, a conventional RDBMS faces page size tradeoffs that Tribeca does not; Tribeca has no fixed page size. At run time, it divides available I/O buffer space among its source and destination streams according to the expected volume on those streams (i.e. the source stream is usually very large and the others are small). All streams are double-buffered so that one buffer can be processed while another is being read or written. All read and write operations are in units of full buffers.

Unlike RDBMS's and many other programs, Tribeca cannot preformat its data. That means that it must separate the source stream into records as the data is processed (relational systems use a page structure so repeated scans of the same data do not require record parsing). Tribeca supports record parsing strategies for three different kinds of data: fixed-size records, variable-size records, and ``framed'' records. The stream data type tells the Tribeca executor which strategy to use. ``Framed'' streams come from some of the high speed data recorders. Valid records in these streams contain framing patterns that are used to distinguish valid records from periodic bursts of ``noise'' bytes.

The last important data management issue is that Tribeca makes every effort to minimize internal data copying. Intermediate records produced by Tribeca queries are never unnecessarily materialized. Instead, a pointer to the original field value in the input buffer is used every place the projected value appears as an argument. Materialization occurs only if (a) data is copied to an output buffer (b) data must be aligned correctly for some operator (c) non-contiguous values have to be assembled for a hash key. Also, note that projections involving user-defined functions, cannot normally be eliminated by compilation.

Multiplex and projection operators can often be removed at compilation, sometimes with the help of an extra level of indirection. For example, in frame relay, IP packet headers can appear at different offsets within the packet (depending on whether the frame relay packet is routed or bridged, for instance). Operators downstream from the multiplex must use extra indirection, but the operator does not involve copying.

Next: Using Coarse-Grain Indices Up: Implementation and Optimization Issues Previous: Implementation and Optimization Issues