Steven Cheung and Alfonso Valdes
Computer Science Laboratory, SRI International
Initial experimental results indicate that our approach is useful in facilitating automated IDS alert pattern discovery, and in characterizing malware that manifests as multiple attack steps. Also, it may be used in identifying redundant signatures, enabling IDS performance tuning. Specifically, we examined the Snort rule identifiers (SIDs) of the alerts generated by the BotHunter tool, developed in the Cyber-Threat Analytics project, considering which SIDs co-occur pertaining to the same identified bot instance. Our exploratory analysis indicates that IDS alerts corresponding to bots can be expressed in terms of a small number of factors. Also, some bot families have distinguishing factor patterns.
Honeynets address part of this difficulty by allowing researchers to harvest traffic strongly suspected of being malicious. However, much traffic to a honeynet may consist of unsuccessful infection attempts. Dialog-based correlation, a technique introduced in BotHunter [5], can reliably identify instances of successful bot infections by correlating IDS alerts pertaining to different phases of bot infections. To model the infection process, BotHunter detects five common types of (loosely ordered) communication flows between an internal host and a set of external hosts that may be observed during bot infections:
This paper examines bot infection profiles collected via BotHunter on a high-interaction honeynet. We analyzed 106 infection instances, which triggered a total of 26 unique Snort rules in various combinations. We examined the triggered rules in the various instances using the statistical technique of factor analysis (e.g., [11]), described in Section 3. To validate the discovered rule patterns, we harvested the corresponding malware binaries and submitted them to www.virustotal.com for labeling.
This paper presents the feasibility and utility of the factor analysis technique as applied to Snort SIDs. We believe that this approach may be applicable to other event types from a collection of heterogeneous sensors, such as alerts from a variety of IDSs and firewalls. The technique uses very little information, and in particular does not use the IP addresses of attack traffic. As such, it is suitable for analysis of large-scale, privacy-preserving alert repositories such as that of the Cyber-Threat Analytics (Cyber-TA) project.
The contributions of this paper are summarized as follows. We present a novel alert correlation approach based on the factor analysis statistical technique for characterizing malware. This approach identifies patterns in the IDS alerts triggered by malware, facilitating rapid and consistent identification and classification of existing and emerging malware threats. This is accomplished without predefined attack models, which are typically labor and time intensive to construct. Moreover, the method discovers potentially unknown relationships among event types. Our approach can also facilitates IDS performance tuning by uncovering redundant IDS rules, which exhibit as co-occurring alert IDs for malware instances, and are highly correlated with the same factors. We conducted an experiment to validate the usefulness of the approach. Our initial results are promising. For a set of bot instances we analyzed, we achieve a significant dimensionality reduction by using factor analysis: over 89% of the variance in the alerts (of 26 types) can be explained by five factors. Every malware instance in the dataset can be expressed succinctly by a few factors. To examine the consistency of our approach, we compare our results with the labels assigned to the binaries of the bot instances by existing malware detection products.
The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 describes our approach and reviews the factor analysis technique. Section 4 presents the results of performing factor analysis on the BotHunter alerts. We observed that five factors are sufficient to enable us to characterize the dataset containing 26 different Snort signatures. Moreover, there are only seven patterns of (significant) factor loadings. We are able to attribute some of these patterns to command and control activity, and others to specific malware species. Section 5 describes an attempt to validate the experimental results using malware labels obtained from third-party virus research and classification efforts. Section 6 discusses our findings. Section 7 concludes and presents our plan for future work.
To better comprehend alerts generated by IDS and firewalls to achieve situation awareness, a significant amount of research has been performed on alert correlation.
By analyzing potentially a large volume of alerts (possibly from a myriad of heterogeneous sensors), identifying the security-critical ones and discounting the false alarms, and inferring the relationships among the alerts, alert correlation aims to generate high-level digests that summarize and explain the alerts.
Previous alert correlation work includes Goldman et al.'s Scyllarus correlation framework [4], Julisch's work using alarm clustering to perform root cause analysis [6], Ning et al.'s abstraction-based intrusion detection approach [8], Porras et al.'s M-Correlator [9], a mission-impact-based correlation system that uses common-attribute-based aggregation, network topology analysis, and mission criticality specification to perform alert fusion and ranking, Qin's and Lee's statistical causality analysis approach for alert correlation [10], Staniford et al.'s graph-based approach for detecting large-scale attacks [12], Valdes's and Skinner's probabilistic approach for correlating IDS alerts based on (possibly imperfect) matching of attributes [16], and Yang et al.'s distributed coordinated attack detection system called CARDS [19].
For detecting complex, multistep attacks, an attack modeling approach based on specifying the pre- and post-conditions of the individual attack steps has been proposed in Cuppens' and Ortalo's LAMBDA [3], Michel's and Mé's ADeLE [7], Ning et al.'s abstraction-based approach [8], and Templeton's and Levitt's Jigsaw [14]. Valeur et al. [17] proposed a state-transition-based modeling framework for detecting multistep attacks. To facilitate multistep attack modeling, Cheung et al. [2] presented attack patterns to ease reuse of attack module specifications. The effectiveness of these approaches strongly relies on the accuracy and the coverage of the attack models developed. Our work presents a complementary approach for detecting multistep attacks by discovering alert patterns without depending on predefined attack models.
Viinikka et al. [18] presented a time-series-based approach for modeling the regularities of and eliminating the background noise in the alert flows. Using that approach, security analysts can focus their resources on the more interesting alerts. Our work has some similarities with [18] in that both approaches attempt to extract patterns from the alert stream using event IDs. A main difference is that our approach focuses on patterns across different event IDs, whereas [18] focuses on the trends for individual event IDs.
Bailey et al. [1] examined the weaknesses of existing antivirus products with respect to their consistency, completeness, and conciseness in classifying malware, and presented an automated malware classification scheme based on their behavior, such as file accesses and process creation, to address those issues. Our work shares some of the objectives of [1], but is based on correlating network IDS alerts using the factor analysis statistical technique.
Our approach is motivated by the observation that some bot profiles (i.e., the sets of IDS alerts generated for individual bot instances) have significant overlap, but are not necessarily identical. Moreover, some bot profiles exhibit distinguishing characteristics.
We hypothesized that different instances of a bot or different bot variants belonging to the same family may exhibit similar observable behavior from the network IDS viewpoint. Moreover, we hypothesized that different bot species or families have distinguishing network IDS alert profiles, as they may exploit different vulnerabilities, use different means to coordinate with command-and-control servers and other bots, and employ different techniques to propagate.
Factor analysis (and the closely related principal component analysis) are statistical techniques that may reduce a complex dataset containing measurements of a number of quantities to one with a lower dimension. These factors (or components) correspond to some commonalities among the quantities, and can sometimes reveal simplified, high-level structures that may not be observed directly.
A main objective of this study is to test our hypothesis through the use of a small set of factors that can be used to account for (most of) the variability of the observed IDS alerts. These factors, each corresponding to a set of alert IDs, may be used to succinctly characterize the malware instances. In other words, we can express the set of IDS alerts for each malware instance in terms of a small number of factor patterns.
To illustrate factor analysis, let us consider an example from [13]. In the example, we randomly selected a set of people from the population, and measured the sizes of different parts of their bodies, e.g., height, leg, waist, fingers. We would expect that many of the measurements would be correlated, and could be explained by an underlying common factor of body size, which is a high-level quantity not directly measured. Using the body size factor alone may not be sufficient to account for all (or most) of the variability of the measurements, and we may need additional factors such as lankiness. Factor analysis enables us to achieve economy of description (or dimensionality reduction), describing the measurements using a smaller number of components, without losing too much information.
In this paper, we gathered the sets of alert IDs, which are Snort rule identifiers in our experiment as described in Section 4, pertaining to malware instances. We converted the malware profiles to a matrix, with rows corresponding to malware instances, and columns to Snort rule identifiers. For example, a malware instance may trigger the following set of alerts:
alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg:"E2[rb] NETBIOS SMB-DS Session Setup NTMLSSP unicode asn1 overflow attempt"; ... sid:3003; rev:4;) alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg:"E2[rb] SHELLCODE x86 inc ebx NOOP"; ... sid:99998; rev:2;) alert ip $EXTERNAL_NET $SHELLCODE_PORTS -> $HOME_NET any (msg:"E2[rb] REGISTERED FREE SHELLCODE x86 inc ebx NOOP"; ... sid:1390; rev:5;) alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg: "E2[rb] BLEEDING-EDGE EXPLOIT MS04-007 Kill-Bill ASN1 exploit attempt"; ... sid:2001944; rev:3;) alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg:"E3[rb] BotHunter MALWARE executable upload"; ... sid:3000006; rev:99;) alert tcp $EXTERNAL_NET !20 -> $HOME_NET any (msg:"E3[rb] BLEEDING-EDGE Malware Windows executable sent from remote host"; ... sid:2001683; rev:3;) alert tcp $EXTERNAL_NET !20 -> $HOME_NET any (msg:"E3[rb] BotHunter Malware Windows executable (PE) sent from remote host"; ... sid:5001684; rev:3;)This malware profile is transformed to a row of the matrix with the entries in columns pertaining to SIDs 3003, 99998, 1390, 2001944, 3000006, 2001683, and 5001684 equal 1, and the rest of the entries equal 0.
Formally, factor analysis is a statistical technique that examines the covariance or correlation structure among a number of variables, and then extracts factors that explain most of the information in the matrix. Briefly, for a set of observed variables (which in our analysis correspond to counts of Snort SIDs for each bot infection instance), the common factors , and the unique factors , the variables may be expressed as linear functions of the factors
The factor coefficients are referred to as factor loadings, and these are generally scaled so that the vector of factor loadings has unit L2 norm. Factor analysis is frequently used as a dimensionality reduction technique, where the most significant factors are selected based on the fraction of the total variance explained. The loading on any individual variable can be positive or negative, and the absolute value is indicative of the importance of the underlying variable in the respective factors. Readers are referred to [13] and [11] for more information about the factor analysis technique.
The data records for the factor analysis consist of the counts of each of the SIDs per infection instance. We observed that in most cases, this was 1 or 0. We have performed factor analysis that considers rule incidence rather than count--in other words, the data record would be binary, with a 1 indicating that the corresponding SID was triggered in the infection instance. The result for this ``binary'' dataset is essentially the same as that of the original dataset.
Visual examination of the correlation matrix revealed interesting structure. First, we observed that some SID pairs have a correlation of 1.0, indicating redundancy and the potential for rule pruning for performance reasons. While it may be the case that perfectly correlated SIDs in our data are not perfectly correlated in general, we are limiting our analysis to likely successful infection instances as identified by dialog-based correlation. Thus, this observed perfect correlation appears to hold in some interesting cases. SID subsets for which we observed perfect correlation are {1390, 2001944, 3000006, 99998}, {1444, 3001441}, {2000046, 99906}, {2000047, 2001056}, and {2001184, 2002029}. For some of these subsets, potentially one could select the maximum coverage rule, and disable the others as redundant.1
We used Matlab to perform factor analysis [15]. The input of factor analysis is an data matrix--where is the number of measurements and is the number of variables--and generates a set of factors and the corresponding loadings and variance.
We used the latent root criterion2for selecting factors, which turned out to be five.
After applying the varimax factor rotation,
we obtained the factor loading table as shown
in Table 1.
The first column in the table corresponds to SIDs. Columns 2 through 6 represent the loadings for the selected factors, respectively. The last column corresponds to the communality (i.e., the fraction of the total variance captured by the selected factors). The larger the communality value is (i.e., closer to 1), the more variability of the corresponding variable (or SID) is accounted for using the common factors. On the other hand, a small communality value (i.e., closer to 0) means that the common factors cannot effectively explain the variability of the corresponding variable.
The results indicate that all observed variables have significant loadings on one or two factors. Moreover, the factor loadings often correspond to distinct patterns over the variables in question, which can be used to group the variables. Based on the factor loadings, we can form the following groups.
1390: alert ip $EXTERNAL_NET $SHELLCODE_PORTS -> $HOME_NET any (msg:"E2[rb] REGISTERED FREE SHELLCODE x86 inc ebx NOOP"; ... 2001944: alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg: "E2[rb] BLEEDING-EDGE EXPLOIT MS04-007 Kill-Bill ASN1 exploit attempt"; ... 3000006: alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg:"E3[rb] BotHunter MALWARE executable upload"; ... 3003: alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg: "E2[rb] NETBIOS SMB-DS Session Setup NTMLSSP unicode asn1 overflow attempt"; ... 99998: alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg:"E2[rb] SHELLCODE x86 inc ebx NOOP"; ...
2000032: alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg: "E2[rb] BLEEDING-EDGE EXPLOIT LSA exploit"; ... 2000033: alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg: "E2[rb] BLEEDING-EDGE EXPLOIT MS04011 Lsasrv.dll RPC exploit (WinXP)"; ... 2001569: alert tcp $HOME_NET any -> any 445 (msg: "E5[rb] BLEEDING-EDGE Behavioral Unusual Port 445 traffic, Potential Scan or Infection"; ... 2466: alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg:"E2[rb] NETBIOS SMB-DS IPC$ unicode share access"; ... 3000000: alert tcp $EXTERNAL_NET any -> $HOME_NET 1030:1040 (msg: "E3[rb] BotHunter HTTP-based .exe Upload on backdoor port"; ... 3000003: alert tcp $HOME_NET 1028:1040 -> $EXTERNAL_NET any (msg: "E3[rb] BotHunter HTTP-based .exe Upload on backdoor port"; ... 99913: alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg:"E2[rb] SHELLCODE x86 0x90 unicode NOOP"; ...
2000345: alert tcp $HOME_NET any -> $EXTERNAL_NET !6661:6668 (msg: "E4[rb] BLEEDING-EDGE ATTACK RESPONSE IRC - Nick change on non-std port"; ... 2001184: alert tcp $EXTERNAL_NET any -> $HOME_NET any (msg: "E4[rb] BLEEDING-EDGE WORM RXBOT / RBOT Vulnerability Scan"; ... 2002024: alert tcp $HOME_NET any -> $EXTERNAL_NET !25 (msg: "E4[rb] BLEEDING-EDGE TROJAN IRC NICK command"; ... 2002025: alert tcp $HOME_NET any -> $EXTERNAL_NET any (msg: "E4[rb] BLEEDING-EDGE TROJAN IRC JOIN command"; ... 2002028: alert tcp $HOME_NET any -> $EXTERNAL_NET any (msg: "E4[rb] BLEEDING-EDGE TROJAN IRC PONG response"; ... 2002029: alert tcp $EXTERNAL_NET any -> $HOME_NET any (msg: "E4[rb] BLEEDING-EDGE TROJAN BOT - channel topic scan/exploit command"; ...
2000047: alert tcp $EXTERNAL_NET any -> $HOME_NET 9996 (msg: "E3[rb] BLEEDING-EDGE VIRUS Sasser Transfer _up.exe"; ... 2001056: alert tcp $EXTERNAL_NET any -> $HOME_NET any (msg: "E2[rb] BLEEDING-EDGE VIRUS W32/Sasser.worm.b -NAI-)"; ...
2000046: alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg: "E2[rb] BLEEDING-EDGE EXPLOIT MS04011 Lsasrv.dll RPC exploit (Win2k)"; ... 3000004: alert tcp $HOME_NET any -> $EXTERNAL_NET any (msg: "E3[rb] BotHunter Scrip-based Windows egg download .exe"; ... 99906: alert tcp $EXTERNAL_NET any -> $HOME_NET 445 (msg:"E2[rb] SHELLCODE x86 0x90 unicode NOOP"; ...
1444: alert udp $HOME_NET any -> $EXTERNAL_NET 69 (msg:"E3[rb] TFTP GET from external source"; ... 3001441: alert udp $HOME_NET any -> $EXTERNAL_NET 69 (msg:"E3[rb] TFTP GET .exe from external source"; ...
2001683: alert tcp $EXTERNAL_NET !20 -> $HOME_NET any (msg:"E3[rb] BLEEDING-EDGE Malware Windows executable sent from remote host"; ...
We used the groups to characterize malware instances based on the IDS alerts generated for them. For example, a bot profile (shown in Section 3) containing alerts with the SIDs 3330, 99998, 1390, 2001944, 3000006, and 2001683 may be characterized using Groups 1 and 7. In the next section, we will evaluate the grouping results obtained in this experiment.
The dataset has 106 malware instances. For each instance, we obtain the malware label specified by AntiVir (and by Symantec when AntiVir cannot provide a specific malware label3) and identify the corresponding alert patterns in terms of the above-mentioned groups.
Table 2 shows all distinct mappings between the malware labels and the groups identified by factor analysis for the malware instances. For example, the malware instance corresponding to the label ``Worm/Sasser.C'' triggered three out of seven rules in Group 2, both of two rules in Group 4, and one out of three rules in Group 5. The malware labels in the table were provided by AntiVir, except that those with a * prefix were provided by Symantec. The question marks correspond to malware instances that cannot be labeled by AntiVir and Symantec.
|
Our experimental results show that the malware instances in our dataset can be explained using a small number of factors. These factors may or may not have human-level semantics; they are computed mechanically to reflect which Snort IDs tend to co-occur.
Let us examine the Snort ID groupings induced by the factors to attempt to understand what they mean from the intrusion detection viewpoint. Group 1 consists of rules for detecting external to internal infection, and one for downloading malware. From the correlation matrix among the SIDs of Group 1, they all have very large (pairwise) correlation coefficients (of 0.9 or above), as we expected. Also, we found that two rules in Group 1 have almost identical detection coverage (SIDs 1390 and 99998), except that they differ in their source/destination port specification. Group 2 is the largest group, consisting of seven SIDs. They correspond to Snort rules for detecting several phases of bot infection: exploit, binary acquisition, and internal-to-external outbound infection scanning. Group 3 corresponds to rules for detecting command-and-control channel activities, e.g., IRC transactions. Group 4 contains two rules for detecting the Sasser worm. Moreover, Group 6 contains two rules for detecting TFTP GET activities, possibly for downloading malware from external sources. These two rules have virtually identical detection coverage. Thus, one of them may be removed, for performance purposes.
From Table 2, we observe that malware instances belonging to the same family generally share similar alert patterns, although there are exceptions. For example, variants of the Korgo family have similiar factor pattern characterization. Moreover, the Sasser family has a distinguishing pattern, which involves Group 4 alerts.
Examination of the rules in Groups 1 and 6 reveals that some IDS rules are virtually identical in detection coverage. Thus, some of them can be removed from the rulebase of the IDS to improve its runtime performance without reducing its detection capabilities. For most IDS deployments, the number of rules is generally quite large (e.g., in the order of 1000's) and it is resource intensive to manually find redundant rules. Our approach facilitates the identification of redundant IDS rules based on runtime behavior.
We note that the alert patterns for W32/Virut.AM and W32/Virut.AO appear to be quite different. We hypothesize that it may partly depend on the classification schemes used by the antivirus companies. For instance, the malware instances labeled as W32/Virut.AM by AntiVir were labeled as W32.Korgo.S by Symantec, and the factor pattern for these malware instances resembles those for the instances labeled as Korgo.
Another interesting experimental result pertains to malware instances that cannot be labeled by the existing antivirus products. As shown in Table 2, there are four such samples. Based on the similarities and differences between the factor patterns for the unlabeled bot instances and those for the labeled ones, one may infer new bot varieties or common lineage between the labeled ones and the unlabeled ones. For example, the factor pattern of the unlabeled sample involving Group 6 SIDs is quite different from those of the labeled ones, and could correspond to a new bot or to a major variant of the known bots. The factor group pattern corresponding to the fifth unlabeled malware instances is similar to those of Rbot.328262 and SdBo.100864.22, and thus they could be related malware species.
Not all of the experimental results are positive. We have noticed a few limitations or weaknesses of our approach:
This exploratory study indicates that the factor-analysis-based approach provides a promising means for behavioral characterization of bot infection instances. We were able to achieve such a characterization of 106 bot instances, harvested on a honeynet, which triggered 26 unique Snort signatures, as attributable to five factors. The factor loadings for the triggered signatures revealed seven patterns. These patterns, expressed in terms of a small number of high-level, mechanically derived quantities (or factors), can be used for malware characterization. Malware instances belonging to the same family generally have similiar factor patterns, and some malware instances correspond to distinguishing patterns (e.g., Sasser). We also identified a case in which rules can be safely removed from the IDS to improve performance with no loss in detection capability.
Based on the labels assigned by antivirus tools to the malware binaries studied in this paper, we observed that variants of a particular species of malware or malware families often exhibited similarities in the pattern of factor loadings. We also found some similarities between labeled malware and unlabeled malware, which could indicate an unknown common lineage or a new variant.
Because the analysis is achieved without using the source or destination address of the attack, it is suitable for use in a privacy-preserving repository such as the Cyber-TA repository.
We envision that a collaborative malware repository may collect and analyze the malware profiles contributed by various organizations using the techniques presented in this paper. Moreover, subscribers would be able to query the results in near real time. As such, they may be able to learn the characterization of a piece of malware according to its pattern of factor loadings, rather than waiting for a time- and analyst-intensive decompilation and labeling of new malware species. Potentially, this enables rapid deployment of defenses to counter the emerging malware.
Finally, we plan to refine our approach to address its weaknesses identified in Section 6, and investigate the usefulness of our approach for analyzing non-bot traffic and reports generated in various types of sensors.
This document was generated using the LaTeX2HTML translator Version 2008 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons -no_navigation Cheung_Valdes_LEET09.tex
The translation was initiated by Steven Cheung on 2009-03-26