We have just shown that the kNN classifier algorithm can be implemented for effective abnormality detection. The overall running time of the kNN method is O(N), where N is the number of processes in the training data set (usually k is a small constant). When N is large, this method could still be computationally expensive for some real-time intrusion detection systems. In order to detect attacks more effectively, the kNN anomaly detection can be easily integrated with signature verification. The malicious program behavior can be encoded into the training set of the classifier. After carefully studying the 35 attack instances within the seven-week DARPA training data, we generated a data set of 19 intrusive processes. This intrusion data set covers most attack types of the DARPA training data. It includes the most clearly malicious processes, including ejectexploit, formatexploit, ffbexploit and so on.
For the improved kNN algorithm, the training data set includes 606 normal processes as well as the 19 aforementioned intrusive processes. The 606 normal processes are the same as the ones in subsection 4.2. Each new test process is compared to intrusive processes first. Whenever there is a perfect match, i.e., the cosine similarity is equal to 1.0, the new process is labeled as intrusive behavior (one could also check for near matches). Otherwise, the abnormal detection procedure in Figure 1 is performed. Due to the small amount of the intrusive processes in the training data set, this modification of the algorithm only causes minor additional calculation for normal testing processes.
The performance of the modified kNN classifier algorithm was evaluated with 24 attacks within the two-week DARPA testing audit data. The DARPA testing data contains some known attacks as well as novel ones. Some duplicate instances of the eject attack were not included in the test data set. The false positive rate was evaluated with the same 5285 testing normal processes as described in Section 4.2. Table 3 presents the attack detection accuracy for k=10 and the threshold of 0.8. The false positive rate is 0.59% (31 false alarms) when the threshold is adjusted to 0.8.
The two missed attack instances were a new denial of service attack, called process table. They matched with one of training normal processes exactly, which made it impossible for the kNN algorithm to detect. The process table attack was implemented by establishing connections to the telnet port of the victim machine every 4 seconds and exhausting its process table so that no new process could be launched [21]. Since this attack consists of abuse of a perfectly legal action, it did not show any abnormality when we analyzed individual processes. Characterized by an unusually large number of connections active on a particular port, this denial of service attack, however, could be easily identified by other intrusion detection methods.
Among the other 22 detected attacks, eight were captured with signature verification. These eight attacks could be identified without signature verification as well. With signature verification, however, we did not have to compare them with each of the normal processes in the training data set.