Demultiplexing and Remultiplexing Streams

Next: Using Multiplex to Combine Up: The Tribeca Query Language Previous: Streams, Basic Operators, and

Demultiplexing and Remultiplexing Streams

Traffic analysis queries often must partition a stream into substreams, process the substreams, then recombine the results of the substream analysis. The demultiplexing operation partitions records in a stream based on data in each record. The query below partitions s1 into substreams of ATM trace records based on virtual circuits, then finds the max time stamp for each VCI:

stream_demux {s1.atm.vci} p1
stream_agg {p1.ts.max} p2
P1 is not a single stream but a set of substreams, each with the cells for one virtual circuit. The pipe p2 represents a collection of logical substreams each containing the max time stamp for one VCI.

In the example, demux is used like a groupby operator in a relational DBMS. However, the demultiplex operator allows users to apply a series of operations to the demultiplexed streams instead of simply applying aggregates. We can, for example, demultiplex ATM cells by VCI, assemble IP packets from consecutive cell payloads, then apply an aggregate to the IP stream.

In the example below, the query first divides the stream into virtual circuits (p1). Each logical substream on p1 is a sequence of ATM cells for a distinct VCI. Next, consecutive cell payloads from the same circuit are assembled into a stream of IP packets associated with that circuit (usually there are several cells per IP packet. Assembly is actually more complex than this and is described in detail in the subsection on Windows). Finally, the IP stream is qualified and an aggregate is applied to each qualified TCP/IP substream.

stream_demux {s1.atm.vci} p1
stream_proj {{p1.assemble_ip}} p2
stream_qual {{p2.ip_type.eq TCP}} p3
stream_agg {{p3.atm.vci p3.count}} p4
P4 is a set of logical substreams each containing the count of TCP packets for one virtual circuit.

Tribeca allows users to demultiplex the same stream more than once. A second demux simply divides the original stream into more substreams. For instance, we can demux once by VCI then, after assembling IP packets, demux again by IP type (UDP, TCP, etc.). These operations produce substreams that are distinct for each VCI/ip_type pair

In partitioning the stream, the demux operator also ``names'' each substream. The substream name is not part of the data stream, but may be referred to in project and aggregate operations. In the example above, p3 is a set of substreams containing 3#3 pairs.

Unfortunately, the demultiplex operator allows users to express queries that cannot be executed in available memory. The demux implementation uses memory space proportional to the cardinality of the demux field. In practice, however, the number of distinct VCIs, packet types and so on in our data is small. Instead of removing demux from the language, Tribeca uses statistics and capacity planning support to help users avoid it when inappropriate. For traffic analysis, demux only really breaks down when users want to partition the packet stream into substreams by time stamp. Streams are long and are typically sorted by time stamp. Users often want to apply aggregates to packets grouped by time value. Tribeca's window feature (described below) allows users to group records by sort field in an efficient way.

Using Multiplex to Combine Streams

Next: Using Multiplex to Combine Up: The Tribeca Query Language Previous: Streams, Basic Operators, and