On-Demand View Materialization and Indexing for
Network Forensic Analysis

Roxana Geambasu¹, Tanya Bragin¹, Jaeyeon Jung², and Magdalena Balazinska¹

¹ Department of Computer Science and Engineering ² Mazu Networks

University of Washington, Seattle, WA Cambridge, MA

{roxana,tbragin,magda}@cs.washington.edu jyjung@mazunetworks.com

Abstract

Today, network intrusion detection systems (NIDSs) use custom solutions to log historical network flows and support forensic analysis by network administrators. These solutions are expensive, inefficient, and lack flexibility. In this paper, we investigate database support for interactive network forensic analysis. We show that an "out-of-the-box" relational database management system (RDBMS) can support moderate flow rates in a manner that ensures high query performance. To enable support for significantly higher data rates, we propose a technique based on on-demand view materialization and indexing. In our approach, when an event occurs, the system proactively extracts relevant historical data and indexes it in preparation for forensic queries over that data. We show that our approach significantly improves response times for a large class of queries, while maintaining high insert throughput.

1 Introduction

Traditionally, a network intrusion detection system (NIDS) monitors traffic at a network perimeter to block known attacks and suspicious network behavior (e.g., port scans). However, as attackers devise new ways to compromise and exploit end hosts (e.g., email viruses, peer-to-peer malware), many enterprises have begun to employ more sensors inside their network to monitor internal network activity. As such, an NIDS has evolved to support internal network security-e.g., detecting internal worm propagation.

The main role of an NIDS is to detect and flag suspicious network activity in near real time. Some alerts can trigger an automatic response such as dropping malicious packets or updating a network filter. Many alerts, however, go through a manual verification process in order to sift out false alarms. This verification process typically involves checking the suspected host's recent network activity and looking up any services recently run by that host. It may also involve forensic investigation aimed at finding the root cause of the security breach and identifying other affected hosts. To support the network administrator, an NIDS must thus provide not only near real time security event detection, but also convenient and efficient access to historical network flow data. Near real-time network event detection has received significant attention in the past [9,18,20]. In this paper, we address the problem of building a historical flow database suitable for forensic analysis queries. Figure 1 shows the structure of an NIDS with a historical log database.

Figure 1: Network intrusion detection system with a historical flow database. The NIDS receives network flow summary data (such as NetFlow data [8]) from the enterprise's internal routers, processes them, and generates alerts. While doing so, it maintains historical flow summary data for network forensic analysis.

Supporting a historical network flow database in conjunction with an NIDS raises two important technical challenges. First, because network traffic monitors generate data continuously and at high-rate, the database needs to support a high data insertion rate. For example, in a busy hour, a medium-size enterprise network with several thousands of hosts may see up to 100,000 new flows per minute internally. Second, to facilitate human-driven forensic analysis, the database must answer historical queries quickly ( < seconds). Today, NIDSs try to meet these challenges by implementing their own storage systems; such custom solutions, however, are expensive to build and often offer limited and sometimes awkward APIs [13].

In contrast, relational database management systems (RDBMSs) [12,14,17,19] offer many features desirable for conducting network forensic investigations-a powerful query optimizer, indexes, and a flexible and standard query language (e.g., SQL). It is common belief, however, that an "out-of-the-box" RDBMS is ill-suited to support the high data-rate workload of an NIDS. Even if an RDBMS can store incoming data sufficiently fast to keep-up with input data rates (by bulk loading the data, for example), efficient querying requires indexes on multiple attributes, and indexes are known to significantly decrease database insert throughput [5]. Hence, it is expected that an RDBMS would be either too slow to keep-up with network flow data rates or it would have unacceptably slow query execution times.

In this paper, we first demonstrate this trade-off between insert throughput and query performance for an off-the-shelf RDBMS using a network monitoring and forensic analysis workload (Section 2). We show that an RDBMS has excellent insert throughput as long as it only keeps one or two indexes on the data (23,000 flows/sec for two indexes in our experiments), but that such a small number of indexes yields an unacceptably slow response time for common forensic queries (from a few minutes to several hours).

To address this shortcoming, we develop a new technique that we call OVMI, for On-demand View Materialization and Indexing (Section 3). The OVMI technique enables an NIDS to use an off-the-shelf RDBMS as the network flow archive, yet ensure both a high data insert rate and a high forensic query performance. OVMI exploits three properties of forensic analysis queries to achieve this goal. First, alerts produced by the NIDS partly determine subsequent forensic queries: i.e., administrators query the system about the historical activity of the entities involved in the newly detected event. Second, forensic queries are "drill-down" queries: subsequent queries refine the selection predicates of earlier queries. Finally, having a human in the loop creates a time delay between the time when an event occurs and the time when a first forensic query is issued. OVMI exploits this delay to prepare for upcoming queries.

With OVMI, the RDBMS does not index data continuously. Instead, when an event occurs, the NIDS requests that the RDBMS heavily index only the recent network activity of the entities involved in the event. To ensure fast creation of these partial indexes, the RDBMS first copies all relevant data into a separate table and builds indexes on that table. The existence of these materialized views is known to the NIDS but transparent to the user. We show that the OVMI approach works well for a large class of forensic analysis queries, where an administrator starts submitting queries within a short time period (minutes to tens of minutes) of an event and is interested in the past few hours worth of network flow data

OVMI improves query evaluation performance by orders of magnitudes while supporting high data rates (up to 3,000 flows/s in our experiments). OVMI also helps quickly produce partial results when a query stretches past the materialized time window of data.

2 Storing Flow Records in an RDBMS

The data rate that an NIDS must handle depends on several factors, including the number of hosts on the monitored network, the applications that these hosts are running, and hosts' network activity levels. Roughly, we estimate that a flow arrival rate can easily reach 100,000 flows per minute ( » 1,700 flows/s) for a medium-large enterprise where thousands of active hosts are generating traffic¹. The flow arrival rate can be even higher for larger organizations or if a network worm causes infected machines to engage in network scans.

In this paper, we assume that an NIDS receives streams of flow records from network sensors or routers, which track each flow, updating the flow's attributes such as the number of packets, bytes, and TCP flags. These sensors or routers can be configured to export a set of flow records to an NIDS for further processing when these flows expire. Our goal is thus to devise a historical flow database that will easily handle flow record insert rates in excess of 3,000 flows/s. In this section, we benchmark the performance of an RDBMS in handling such data rates while maintaining various indexes in order to provide good query performance.

2.1 Method

We perform all experiments on a PostgreSQL 8.1.5 database [19]. PostgreSQL is one of the most advanced open source databases. We expect the performance of commercial databases, however, to exceed that of PostgreSQL.

For our experiments, we use two traces: Trace 1 is a 10-hour network trace from a medium-size Internet Service Provider (ISP). This trace is somewhat old, collected in April 2003, but contains flow records from two hosts infected by a Code Red worm (out of 389 hosts), scanning hundreds of thousands of IP addresses, making it suitable for testing RDBMS performance in the presence of security events. Trace 2 is a 22-day network trace from a small enterprise collected in October-November 2006. Table 1 shows the main attributes associated with the network flow data in both traces. The average flow rates are 27 flows/s for Trace 1 and 10 flows/s for Trace 2. Since our traces are over limited periods of time, we cycle them to simulate more extended monitoring activity. We also replay them at higher speeds to simulate higher data rates.

Column	Type	Description
start_ts	timestamp	start time of flow
protocol	varchar(10)	TCP, UDP, ICMP, etc.
cli_ip	int	client IP
srv_ip	int	server IP
cli_port	varchar(8)	client port
srv_port	varchar(8)	server port
app	varchar(20)	application
c2s_p	int(64)	# packets from client to server
s2c_p	int(64)	# packets from server to client

Table 1: Schema of a network flow database. We show only attributes relevant to our discussion. All attributes can be obtained from network packet headers, except app (e.g. MSExchange, Lotus Notes), which is available via payload inspection. Flows are generated by app.

For all benchmarks, we use a machine with two Intel Xeon, 3 GHz processors with hyper-threading enabled, 8 GB RAM, and a 369 GB, Western Digital (SATA, 8.9ms seek, 16Mb cache, 7200 RPM) disk. We believe that this machine is representative of what one would use for a DBMS server.

2.2 Database Bulk Load Throughput

The database insert throughput determines the maximum average flow rate that the database can keep up with. Figure 2 shows the insert throughput for increasing numbers of indexes. In this experiment, the client bulk-loads the data, blocking between each bulk. The figure shows results for 500-tuple bulk sizes. Error bars show maximum throughputs as the bulk size changes. The indexed attributes are in order: start_ts, cli_ip, srv_ip, proto, srv_port, cli_port, and app. All indexes are B-trees to enable range queries.

As shown in the figure, without indexes, the database can sustain rates in excess of 45,000 flows/s. Each added index significantly cuts the throughput. For Trace 2, even though the throughput decreases, the database can sustain all 7 indexes and still achieve our 3,000 flows/s goal. However, this is not the case in general. For Trace 1, the throughput falls below 3,000 flows/s after 5 indexes. Additionally, the throughput is barely above 3,000 flows/s as soon as the database maintains more than two indexes. Therefore, maintaining multiple indexes is not a scalable approach to achieving high input rates.

Figure 2: Effect of number of indexes on insert throughput. Indexes significantly cut throughput for both traces.

Figure 3: Effect of type of indexed attribute on insert throughput. The throughput of a database depends on the properties of the indexed attributes. srv_ip(int) represents a hash value of the string srv_ip(str) attribute.

Figure 2 also suggests that the effect of an additional index on throughput depends on the properties of the indexed attribute. Figure 3 shows this effect in more detail by presenting the maximum insert throughput for Trace 1 with a single index but for different indexed attributes. Attribute properties that affect performance include the data distribution: e.g., start_ts is clustered and monotonically increasing, while (srv_ip(int)) is unclustered and has a large number of distinct values due to port scans by the infected hosts. Another important property is the attribute's data type: e.g., indexing integers (srv_ip(int)) goes significantly faster than indexing strings (srv_ip). Figure 3 also shows that the database can maintain almost the same throughput whether it has a single multi-attribute index (e.g., (cli_ip,srv_ip)) or an equivalent number of single-attribute indexes (e.g., cli_ip and srv_ip). Grouping indexes into multi-attribute indexes thus provides minimal performance gains.

Other properties than the ones we observed can also affect performance. However, because the choice of indexes must be based on workload and not index performance, and because the distribution of input data cannot be controlled, we expect the maximum possible number of indexes to remain low in practice.

2.3 Network Forensics Queries

Suppose a host X is detected as having been scanning hundreds of hosts on TCP port 25 at a slow rate for the past two days. This event may lead the administrator to issue the following queries to identify the root-cause of this scanning behavior (e.g., email virus propagation or unregistered email server):

(Q1) What client applications has X been running and how much traffic was generated for each application?

 SELECT count(*) FROM Flows 
 WHERE start_ts >= 'now - T' AND cli_ip = X
 GROUP BY app

Q1 is one example of forensic analysis query. In general, administrators need to issue several such queries, using different attributes in their search. For example, to investigate the same alarm, the administrator may further need to see what applications X has been running as a server instead of as a client (i.e., WHERE srv_ip = X) (Q2) or both as client and server (i.e., WHERE cli_ip = X OR srv_ip = X) (Q3).

After running some initial queries, the administrator often needs to execute additional queries to "drill-down" on the problem. Let us assume that the administrator executed query Q3, which returned the following result:

 WEB 52563
 MSExchange 2231 
 NETBIOS139 476
 ..

Given the NETBIOS and MSExchange traffic, the result suggests that the host in question is running a Microsoft Windows operating system. The administrator is now suspicious of a possible infection by a MyDoom email virus. To verify her suspicion, she checks whether X has received any traffic on port 1034 which is known to be a backdoor used by MyDoom.

(Q4). Has X received any traffic on port 1034?

 SELECT s2c_p, c2c_p FROM Flows 
 WHERE start_ts >= 'now - T' AND
       srv_ip = X AND srv_port = 1034

The administrator may also submit more sophisticated SQL queries. An example is to look for backdoor intrusions: the administrator wants all pairs of malicious "triggers", in which a first attack flow causes the victim to initiate a new flow back to the attacker to register the success of the exploit (Q5).

 SELECT * FROM Flows a, Flows b
 WHERE a.srv_ip = b.cli_ip AND
       a.cli_ip = b.srv_ip AND
       a.srv_port = 1034   AND
       a.start_ts < b.start_ts AND
       b.start_ts - a.start_ts < 600 -- seconds

To support a variety of such queries effectively, an RDBMS needs to maintain a variety of indexes on the different attributes that appear in query predicates. With only one or two indexes, a large fraction of queries will require scanning all historical data within the window T, yielding an unacceptably poor performance. For example, for an average flow rate of 3,000 flows/s and with only an index on start_ts, Q2 takes about 19s for T=1 h, about 6 min for T=6 h, about 57 min for T=1 day, and as much as 5.1 h for T=2 days. With an additional index on srv_ip, the same queries take under one second.

Fortunately, forensic queries have several properties that we can leverage to support them efficiently:

Forensic queries are highly predictable. By definition, these queries request historical information about the recent network activity of entities involved in alerts. Alert attributes (e.g., alert time, IP address of the machine generating the abnormal network activity) thus determine the bulk of upcoming forensic queries.
Forensic queries look at a tiny fraction of all the network flow data. In our traces, we find that even the busiest server is responsible for less than 3% of all flows.
Forensic analysis is usually composed of one or more drill-down queries-subsequent queries refine the selection predicates of earlier queries, requesting a subset of previously selected tuples.
There is typically a time lag between the time an alert occurs and the time an administrator investigates it.

In what follows, we propose a scheme that exploits these properties to efficiently support network forensics queries in an off-the-shelf relational database.

3 On-Demand View Materialization and Indexing (OVMI)

To support network forensics queries, we propose to maintain only a full index on start_ts and create additional indexes over the historical flow log on-demand (i.e., when an event occurs) and partially (i.e., only index data that is relevant to the alert). Since our scheme (OVMI) does not rely on multiple full indexes, the database can sustain a high average insert rate of network flow data. At the same time, because OVMI still indexes the relevant data, it provides fast query execution. Given that forensic queries are predictable (Section 2.3), a simple rule-based deterministic algorithm can be used to select the predicates and indexes for each newly detected event.

3.1 Partial Materialization and Indexing

PostgreSQL already supports partial indexes [19], where an index is built only over tuples that match a particular predicate. However, when building a partial index, PostgreSQL does not utilize existing indexes that match the predicate but instead scans the entire underlying relation. This approach is untenable in our setting because the historical log is large and grows with time.

Instead, we propose to use a materialized view. When an alert occurs, OVMI reacts by copying into a temporary table all network flow data that falls within some predefined time window (T) and matches a predicate determined by the alert. In doing so, OVMI utilizes the index on start_ts and avoids scanning the whole relation. The following example shows a materialized view created upon scan_event_1.

 SELECT * INTO scan_event_1 FROM Flows 
 WHERE start_ts >= 'now - T'  AND
       start_ts <= now AND 
       (cli_ip = X OR srv_ip = X)

	Window Size
	1 hour	6 hours	1 day	2 days
Scan Window	19 s	6.2 min	56 min	5 h
Materialize 5%	24 s	6.5 min	58.4 min	5.3 h

Table 2: Scanning and materializing various time windows. View materialization is largely dominated by the time it takes to scan the window. We assume a flow rate of 3,000 flows/s and materialize 5% of the window.

Figure 4: Time to index the materialized view. We are showing the time to create one through three indexes concurrently and compare it with the time to create them sequentially. Concurrent indexing is significantly faster than sequential indexing and it is little compared to view materialization.

As mentioned in Section 2.3, typical forensic queries only look at a small percentage of the tuples in the time window (T). Table 2 shows the time it takes to scan a window T and to materialize 5% of the tuples in T, assuming a flow rate of 3,000 flows/s.² We see that the view materialization time is largely dominated by the time it takes to scan the window. For example, materializing 5% of a 1-day window incurs only about 4.2% penalty over scanning the window.

5% of a large time window can still represents a significant amount of data if the flow rate is high. For example, six hours of flow data at 3,000 flows/s represents 65 million tuples. While our simple queries from Section 2.3 take on the order of seconds to complete even on the 5% view of a 2-day window (e.g., query Q1 takes about 80 sec), in the absence of indexes, more complex queries can take significantly longer because the database can only access the data by scanning it.

To achieve high query performance, indexes must be built on the materialized view. Figure 4 shows the time to build between one and three indexes simultaneously on the materialized view. For 5% of one day, even three simultaneous indexes take only about 10 min to build. Index creation drops to 1 min for a 6-hour window. Indexing the materialized view thus adds a small overhead compared with materializing the view in the first place.

Overall then, the bulk of the cost to materialize and index the data in preparation for forensic analysis queries is dominated by the time to scan the historical window of data. The time varies greatly depending on the size of this window. Therefore, OVMI can serve to prepare a shorter or longer window of historical data, depending on flow rates and the expected delay between the time an alert occurs and the time the administrator starts his investigation. With 3,000 flows/s and for a typical scenario where the time lag is on the order of minutes or tens of minutes, the OVMI approach enables the system to easily pre-process over 6 hours worth of data. Additionally, as we discuss in the following subsection, OVMI also benefits queries with a window of interest larger than that of the materialized view.

One limitation of this approach is that potentially important network traffic that follows the initial alert is not included in the view. To address this limitation, we can extend the above approach as follows. When the NIDS triggers the alert, the system materializes and indexes a given window of data. It then continues to insert new data into the materialized view until the administrator indicates that the alert is no longer interesting. While updating the materialized view, the database throughput would decrease. The decrease, however, would be temporary. Once the threat is handled, the materialized view would no longer need to be updated, and the database throughput would go back to its pre-alert value.

3.2 Partitioned Query Execution

In some scenarios, forensic queries may request more information than is available in the materialized view. For these scenarios, OVMI is still helpful, because the materialized view that it prepares can help answer a subset of the query faster. There exists a vast literature on exploiting materialized views to improve query execution performance [10,11]. In our approach, although we support arbitrary queries over the materialized view and the raw archive, we currently only split selection queries on their time predicates, which is enough to demonstrate the benefit of OVMI.

More specifically, given a query that overlaps a materialized view, we split it into two subqueries based on time: one subquery runs on the view exploiting its indexes, and the other one accesses the data via a scan of the remaining time window in the Flows relation. For example, given the sample materialized view from Section 3 over the last T=6 h and query Q4 over T=7 h, the query is split into the following subqueries:

(Q4.1) SELECT s2c_p, c2c_p FROM scan_event_1
       WHERE srv_ip = X AND srv_port = 1034 
(Q4.2) SELECT s2c_p, c2c_p FROM Flows 
       WHERE srv_ip = X AND srv_port = 1034
       AND start_ts >= 'now - 7'
       AND start_ts < 'now - 6'

Table 3 shows the performance of partitioned Q4 queries if a materialized view indexed on srv_port exists. We vary the overlap of the query's time interval and the view's window. In addition to the virtually zero delay in retrieving the first results from the indexed materialized view, splitting the query also improves its overall performance: the larger the window covered by the materialized view the greater the gain. For example, about 94% gain is achieved in the favorable case of having 5 hours covered by the view and only 1 hour lying outside. In summary, OVMI greatly improves performance not only for queries that match the materialized window of data, but also for queries that span a larger time interval.

However, there are cases when splitting queries might penalize the total execution time. For example, if we only materialize flows with cli_ip = X within a certain time window, splitting the query SELECT * FROM Flows WHERE start_ts > T AND (cli_ip = X OR cli_ip = Y) will yield worse performance than executing the query only on the Flows relation. For this reason, we currently do not perform query splitting for this kind of queries, although they still improve user-perceived latency by returning the first results fast.

Hrs. Inside	Time
Mat. View +	Results from Mat. View	Unsplit
Hrs. Outside	+ Results from Flows	Query
5h + 1h	0.02 s + 21 s	6.3 min
1h + 5h	0.02 s + 4.8 min	6.3 min

Table 3: Performance of split queries. We show the execution time for both subqueries of the split query, for different window sizes covered by or outside the materialized view. Splitting queries improves query performance and allows the user to get the first results fast.

4 Related Work

Traditional NIDS systems detect intrusions, known attacks, and suspicious network activity [18,20]. In this paper, we focus on a newer class of NIDS systems that provide both intrusion detection functionality and network forensic analysis by keeping historical flow logs [1,2]. Existing NIDS solutions use custom databases whose performance results are not publicly available. We show that it is possible to use an open source relational database to store network flow information and propose an on-demand materialization and indexing technique that effectively speeds-up forensic queries, while maintaining a high insert throughput.

The increased popularity of monitoring applications, including network monitoring and network intrusion detection, has lead the database community to develop a new class of general-purpose data management systems, called stream processing engines (SPEs)³. Examples of SPEs include: Aurora [3], Borealis [4] Gigascope [9], STREAM [15], and TelegraphCQ [6]. These SPEs process flow data directly as it arrives without storing it. As such, they provide high performance continuous processing, but do not support forensic analysis queries. Our approach is thus complementary to these efforts.

The Seaweed [16] peer-to-peer data management system can serve as a historical flow database. Seaweed's focus, however, is on the availability and scalability challenges of accessing highly distributed data sets. It provides no mechanism for speeding-up historical queries at each site, which is the focus of our approach.

Answering queries using materialized views [10,11] and caching query results [23] are known techniques to speed-up query execution. Our contribution lies in applying these techniques to the domain of network forensic analysis and proposing a new scheme that materializes relevant network flow data on-demand, when an event occurs, to speed-up upcoming queries.

Partial indexes, which enable a database to index only tuples matching a predicate, have been investigated in the past [21,22,24]. Our approach can be thought of as building partial indexes on-demand. However, instead of using the native partial index mechanism, we materialize data matching the predicate to avoid the overhead of scanning the entire relation while building the index.

FELIX [5] is an approach for temporarily suspending indexing archived stream data during overload. In contrast, in our application domain, the database is unable to maintain full indexes under normal load conditions.

Handling large amounts of historical data is not a new problem in databases. However, most research in this area has focused on warehousing scenarios, where data is archived, indexed and queried offline, with plenty of time to optimize reads for data warehousing applications [7].

5 Conclusion

Forensic analysis is a critical task in network monitoring applications. In this paper, we proposed OVMI, an On-demand View Materialization and Indexing technique, that enables an NIDS to use an off-the-shelf RDBMS for its historical log. With OVMI, the system can sustain high network flow rates, while achieving high performance for forensic analysis queries. We showed that with a data rate of 3,000 flows/s, OVMI can prepare the system within minutes to handle forensic analysis queries over several hours of data preceding the alert. We view OVMI as an important step in the direction of using RDBMSs instead of custom solutions in network monitoring applications.

One limitation of OVMI is that it improves performance for forensic analysis queries over only the most recent several hours worth of historical data. While this amount of data is frequently sufficient, administrators sometimes need to access even older data. In future work, we plan to investigate additional techniques to further improve RDBMS support for applications within the network monitoring domain, including efficiently querying larger volumes of historical data. We also plan to investigate archiving and efficiently querying data other than just flow data.

References

[1]: Mazu networks. http://www.mazunetworks.com/, 2007.
[2]: Netscout. http://www.netscout.com/, 2007.
[3]: Abadi et. al. Aurora: A new model and architecture for data stream management. VLDB Journal, 12(2), Sept. 2003.
[4]: Abadi et. al. The design of the Borealis stream processing engine. In Proc. of the CIDR Conf., Jan. 2005.
[5]: S. Chandrasekaran. Query processing over live and archived data streams. Ph.D. Thesis, Univ. of California Berkley, 2005.
[6]: Chandrasekaran et al. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proc. of the CIDR Conf., Jan. 2003.
[7]: S. Chaudhuri and U. Dayal. Overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1), 1997.
[8]: Cisco Systems. Cisco IOS NetFlow. http://www.cisco.com/go/netflow, 2007.
[9]: C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. In Proc. of the 2003 SIGMOD Conf., June 2003.
[10]: J. Goldstein and P.-A. Larson. Optimizing queries using materialized views: a practical, scalable solution. In Proc. of the 2001 SIGMOD Conf., 2001.
[11]: A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4), 2001.
[12]: IBM. DB2. http://www.ibm.com/db2, 2006.
[13]: D. N. Joel Snyder and R. Thayer. http://www.networkworld.com/reviews/2003/1013idsrev.html?page=3, 2003.
[14]: Microsoft. SQL Server. http://www.microsoft.com/sql/, 2006.
[15]: Motwani et. al. Query processing, approximation, and resource management in a data stream management system. In Proc. of the CIDR Conf., Jan. 2003.
[16]: D. Narayanan, A. Donnelly, R. Mortier, and A. Rowstron. Delay aware querying with Seaweed. In Proc. of the 32nd VLDB Conf., Sept. 2006.
[17]: Oracle. http://www.oracle.com/, 2006.
[18]: V. Paxson. Bro: a System for Detecting Network Intruders in Real-Time. Computer Networks (Amsterdam, Netherlands: 1999), 31(23-24):2435-2463, 1999.
[19]: PostgreSQL. http://www.postgresql.org/, 2006.
[20]: M. Roesch. Snort - lightweight intrusion detection for networks. In Proc. of the 13th Large Installation System Administration Conference, pages 229â€“-238, 1999.
[21]: C. Sartori and M. R. Scalas. Partial indexing for nonuniform data distributions in relational DBMS's. IEEE Transactions on Knowledge and Data Engineering, 6(3):420-429, 1994.
[22]: P. Seshadri and A. N. Swami. Generalized partial indexes. In Proc. of the 11th ICDE Conf., 1995.
[23]: J. Shim, P. Scheuermann, and R. Vingralek. Dynamic caching of query results for decision support systems. In SSDBM '99, 1999.
[24]: M. Stonebraker. The case for partial indexes. Technical Report UCB/ERL M89/17, EECS Department, University of California, Berkeley, 1989.

Footnotes:

¹We have seen on average 50,000 new flows per minute from a network with about 400 hosts in business hours.

²The results shown are for Trace 2. We found almost identical results when using Trace 1.

³These engines are also called data stream management systems (DSMS) [3,15] continuous query processors [6], and stream databases [9]

File translated from T_EX by T_TH, version 3.77.
On 14 Mar 2007, 10:58.

¹ Department of Computer Science and Engineering	² Mazu Networks
University of Washington, Seattle, WA	Cambridge, MA
{roxana,tbragin,magda}@cs.washington.edu	jyjung@mazunetworks.com

On-Demand View Materialization and Indexing for Network Forensic Analysis

Roxana Geambasu1, Tanya Bragin1, Jaeyeon Jung2, and Magdalena Balazinska1 1 Department of Computer Science and Engineering 2 Mazu Networks University of Washington, Seattle, WA Cambridge, MA {roxana,tbragin,magda}@cs.washington.edu jyjung@mazunetworks.com