We applied the k-Nearest Neighbor classifier to the 1998 DARPA data. The 1998 DARPA Intrusion Detection System Evaluation program provides a large sample of computer attacks embedded in normal background traffic [18]. The TCPDUMP and BSM audit data were collected on a network that simulated the network traffic of an Air Force Local Area Network. The audit logs contain seven weeks of training data and two weeks of testing data. There were 38 types of network-based attacks and several realistic intrusion scenarios conducted in the midst of normal background data.
We used the Basic Security Module (BSM) audit data collected from a victim Solaris machine inside the simulation network. The BSM audit logs contain information on system calls produced by programs running on the Solaris machine. See [19] for a detailed description of BSM events. We only recorded the names of system calls. Other attributes of BSM events, such as arguments to the system call, object path and attribute, return value, etc., were not used here, although they could be valuable for other methods.
The DARPA data was labeled with session numbers. Each session corresponds to a TCP/IP connection between two computers. Individual sessions can be programmatically extracted from the BSM audit data. Each session consists of one or more processes. A complete ordered list of system calls is generated for every process. A sample system call list is shown below. The first system call issued by Process 994 was close, execve was the next, then open, mmap, open and so on. The process ended with the system call exit.
Process ID: 994
close | munmap | open | munmap | chmod |
execve | mmap | mmap | open | close |
open | mmap | mmap | ioctl | close |
mmap | close | munmap | access | close |
open | open | mmap | chown | close |
mmap | mmap | close | ioctl | close |
mmap | close | close | access | exit |
The numbers of occurrences of individual system calls during the execution of a process were counted. Then text weighting techniques were ready to transform the process into a vector. We used Equation (2) to encode the processes.
During our off-line data analysis, our data set included system calls executed by all processes except the processes of the Solaris operating system such as the inetd and shells, which usually spanned several audit log files.