Tarem Ahmed, Boris Oreshkin and Mark Coates
Department of Electrical and Computer Engineering
McGill University
Montreal, QC, Canada
A network anomaly is a sudden and short-lived deviation from the normal operation of the network. Some anomalies are deliberately caused by intruders with malicious intent such as a denial-of-service attack in an IP network, while others may be purely an accident such as an overpass falling in a busy road network. Quick detection is needed to initiate a timely response, such as deploying an ambulance after a road accident, or raising an alarm if a surveillance network detects an intruder.
Network monitoring devices collect data at high rates. Designing an effective anomaly detection system consequently involves extracting relevant information from a voluminous amount of noisy, high-dimensional data. It is also important to design distributed algorithms as networks operate under bandwidth and power constraints and communication costs must be minimised.
Different anomalies exhibit themselves in network statistics in different manners, so developing general models of normal network behaviour and of anomalies is difficult. Model-based algorithms are also not portable across applications, and even subtle changes in the nature of network traffic or the monitored physical phenomena can render the model inappropriate. Non-parametric, learning algorithms based on machine learning principles are therefore desirable as they can learn the nature of normal measurements and autonomously adapt to variations in the structure of ``normality''.
Most methods of network anomaly detection are based on network traffic models. Brutlag uses as an extension of the Holt-Winters forecasting algorithm, which supports incremental model updating via exponential smoothing [1]. Hajji uses a Gaussian mixture model, and develops an algorithm based on a stochastic approximation of the Expectation-Maximization (EM) algorithm to obtain estimates of the model parameters [2]. Yamanishi et al. also assume a hierarchical structure of Gaussian mixtures in developing the ``SmartSifter'' tool, but uses different algorithms for updating the model parameters [3]. They use a variant of the Laplace law in the discrete domain, and a modified version of the incremental EM algorithm in the continuous domain. They test their algorithm to detect network intrusion on the standard ACM KDD Cup 1999 dataset. Lakhina et al. apply Principal Component Analysis (PCA) to separate IP network data into disjoint ``normal'' and ``anomalous'' subspaces, and signal an anomaly when the magnitude of the projection onto the anomalous subspace exceeds a threshold [4,5,6]. Huang et al. build on Lakhina's centralised PCA method of anomaly detection from [6], and develop a framework where local PCA analysis and stochastic matrix perturbation theory is used to develop an adaptive, distributed protocol [7].
Researchers have recently begun to use machine learning techniques to detect outliers in datasets from a variety of fields. Gardener et al. use a One-Class Support Vector Machine (OCSVM) to detect anomalies in EEG data from epilepsy patients [8]. Barbará et al. have proposed an algorithm to detect outliers in noisy datasets where no information is available regarding ground truth, based on a Transductive Confidence Machine (TCM) [9]. Transduction is an alternative to induction, in that instead of using all the data points to induce a model, one is able to use a small subset of them to estimate unknown properties of test points. Ma and Perkins present an algorithm using support vector regression to perform online anomaly detection on timeseries data in [10]. Ihler et al. present an adaptive anomaly detection algorithm that is based on a Markov-modulated Poisson process model, and use Markov Chain Monte Carlo methods in a Bayesian approach to learn the model parameters [11].
An example of a machine learning approach to network anomaly detection is the time-based inductive learning machine (TIM) of Teng et al. [12]. Their algorithm constructs a set of rules based upon usage patterns. An anomaly is signalled when the premise of a rule occurs but the conclusion does not follow. Singliar and Hauskrecht use a support vector machine to detect anomalies in road traffic [13]. They use statistics collected by a sophisticated network of sensors including microwave loops and lasers, and design a detector for road traffic incidents.
Our objective in this paper is to show the applicability and need for learning algorithms in detecting anomalous behaviour in a distributed set of network measurements. From the wide variety of machine learning techniques available, we choose the One Class Neighbor Machine (OCNM) proposed by Muñoz and Moguerza in [14], and the recursive Kernel-based Online Anomaly Detection (KOAD) algorithm that we developed in [15]. We present our case via two examples: sequences of images from Transports Quebec's camera network, and IP timeseries data from the Abilene backbone network. We demonstrate that both algorithms are effective in detecting anomalies and motivate the development of more advanced, fully adaptive and fully distributed, learning algorithms.
The rest of this paper is organized as follows. Section II defines the problem we address. Section III describes the Transports Quebec and Abilene datasets and Section IV reviews the OCNM and KOAD algorithms. Section V presents our results and Section VI summarises our conclusions and discusses the need for distributed, learning algorithms for network anomaly detection.
The anomaly detection problem can be formulated as follows. A continuous stream of data points constitutes a collection of measurements governed by a probability distribution . Although measurements correspond to certain physical events in the event space , the mapping between them may not be known. We assume that can be divided into two subspaces corresponding to normal and anomalous physical conditions. In many practical situations it is of interest to infer the membership of an event in a particular subspace using the corresponding measurement. As the probability distribution governing measurements is unknown, some mechanism should facilitate learning its volumetric representation from the collection . A general approach to the aforementioned problem of learning such a representation consists of constructing a Minimum Volume Set (MVS) with probability mass with respect to distribution for a volume measure [16]:
Real multidimensional data exhibit distributions which are highly sparse. Moreover, distributions of raw data may lack invariance with respect to generating events. That is, physical events pertaining to the same region in may generate measurements in completely different regions of . Therefore, it is often desirable to reduce the dimensionality of raw data via some feature extraction mechanism where , that is robust to sparsity and variance induced by the transition . We then construct a minimum volume set from the features and not from the raw data.
We use two different datasets to advocate the applicability of machine learning algorithms to network anomaly detection.
Transports Quebec maintains a set of webcams over its major roads [18]. These cameras record still images every five minutes. We collected images recorded by cameras over a period of four days (Sep. 30 to Oct. 03, 2006) on Quebec's Autoroute 20. Each 5-minute interval constitutes a timestep.
Anomaly detection in a sequence of images relies mainly on the extraction of appropriate information from the sequence. There are two fundamental reasons for this. First, the large dimensionality inherent to image processing leads to dramatic increase in implementation costs. Second, large variation in operating conditions such as brightness and contrast (which are subject to the time of day and weather conditions) and colour content in the images (which is subject to season), can cause abrupt and undesirable changes in the raw data.
We decided to use the discrete wavelet transform (DWT) to process the images. The DWT is known for its ability to extract spatially localised frequency information. We perform the two-dimensional DWT on every image and average the energy of transformation coefficients within each subband to achieve approximate shift invariance of the feature extractor. We expect that the appearance of a novel image in the sequence will manifest itself as a sudden change in the power in the frequency content of the vector of subband intensities. At each timestep, we construct a wavelet feature vector from each image obtained by each camera node.
The Abilene backbone network in the US contains core routers. A backbone flow consists of packets entering the Abilene network at one particular core router and exiting at another. The data constitute a timeseries of the entropies of the main packet header fields (source IP address, destination IP address, source port number and destination port number) in each of backbone flows pertaining to each timestep. The entropy for each backbone flow, at each timestep, for each header field, is computed after constructing an empirical histogram of the relevant header field distribution for that backbone flow during that timestep. The four component entropies are finally concatenated to obtain an entropy timeseries of the backbone flows. Physical anomalous phenomena cause changes in the distributions of packets belonging to the responsible backbone flow, and Lakhina et al. showed in [6] that these changes are best captured by changes in the entropies. The duration of a timestep is again five minutes, and the length of the Abilene timeseries is timesteps pertaining to one week (Dec. 15 to 21, 2003).
The OCNM algorithm proposed by Muñoz and Moguerza provides an elegant means for estimating minimum volume sets [14]. It assumes a sample set comprising , -dimensional data points, . The algorithm requires the choice of a sparsity measure, denoted by . Example choices of a sparsity measure are the -th nearest neighbour Euclidean distance and the average of the first nearest-neighbour distances. The OCNM algorithm sorts the values of the measure for the set of points , and subsequently identifies those points that lie inside the minimum volume set (MVS) with the smallest sparsity measure , up to a specified fraction of the number of points in .
If the -th nearest neighbour distance function is used as the sparsity measure, the OCNM algorithm involves calculating the distance from every point to every other point in the sample set. As each point is -dimensional and there are timesteps, the complexity is .
[Camera 5] [Camera 6] |
Consider a set of multivariate measurements . In an appropriately chosen feature space with an associated kernel function , the features corresponding to the normal traffic measurements should cluster. Then it should be possible to describe the region of normality using a relatively small dictionary of approximately linearly independent elements [19]. Here represent those that are entered into the dictionary and we expect the size of the dictionary ( ) to be much less than , leading to computational and storage savings. Feature vector is said to be approximately linearly dependent on with approximation threshold , if the projection error satisfies the following criterion:
The Kernel-based Online Anomaly Detection (KOAD) algorithm operates at each timestep on a measurement vector . It begins by evaluating the error in projecting the arriving onto the current dictionary (in the feature domain). This error measure is then compared with two thresholds and , where . If , we infer that is sufficiently linearly dependent on the dictionary, and represents normal traffic. If , we conclude that is far away from the realm of normal behaviour, and immediately raise a ``Red1'' alarm to signal an anomaly.
If , we infer that is sufficiently linearly independent from the dictionary to be considered an unusual event. It may indeed be an anomaly, or it may represent an expansion or migration of the space of normality. In this case, we do the following: raise an ``Orange'' alarm, keep track of the contribution of the relevant input vector in explaining subsequent arrivals for timesteps, and then make a firm decision on it.
At timestep , we re-evaluate the error in projecting onto dictionary corresponding to timestep . Note that the dictionary may have changed between timesteps and , and the value of at this re-evaluation may consequently be different from the at timestep . If the value of after the re-evaluation is found to be less than , we lower the orange alarm and keep the dictionary unchanged.
If the value of is found instead to be greater than after the re-evaluation at timestep , we perform a secondary ``usefulness'' test to resolve the orange alarm. The usefulness of is assessed by observing the kernel values of with , . If a kernel value is high (greater than a threshold ), then is deemed close to . If a significant number of the kernel values are high, then cannot be considered anomalous; normal traffic has just migrated into a new portion of the feature space and should be entered into the dictionary. Contrarily if almost all kernel values are small, then is a reasonably isolated event, and should be heralded as an anomaly. We evaluate:
The KOAD algorithm also deletes obsolete elements from the dictionary as the region of normality expands or migrates, thereby maintaining a small dictionary. In addition, it incorporates exponential forgetting so that the impact of past observations is gradually reduced.
Assuming a dictionary containing elements, the computational complexity of the KOAD algorithm is for every standard timestep, and on the rare occasions when an element removal occurs. The KOAD complexity is thus independent of time, making the algorithm naturally suited to online applications. Our experiments have shown that high sparsity levels are achieved in practice, and the dictionary size does not grow indefinitely. See [15] for details regarding the KOAD algorithm.
We propose two monitoring architectures: a distributed approach and a centralised approach. In the distributed architecture, the detection algorithms are locally run at each node. After each timestep, each node makes a local decision regarding the presence or absence of an anomaly at that timestep, and transmits a binary result to the Central Monitoring Unit (CMU). The CMU then makes a decision on the location of an anomaly in time and space, if at least of the nodes individually signalled an anomaly. The idea behind this -out-of- detection scheme is that in many applications, such as in a road network, bona fide anomalies such as an untimely traffic congestion are simultaneously evident to multiple nodes. Individual flags are likely to be caused by comparably less important and independent events such as a single camera malfunctioning.
In the centralised architecture, all measurements are communicated to the Central Monitoring Unit. The CMU then runs the detection algorithm. The centralised approach is often desirable and necessary to detect anomalies that exhibit themselves through distributed changes in the global measurement vector. The works of Lakhina et al. have shown that traffic volume distributions in large backbone IP networks exhibit a covariance structure, and detecting a break in this structure enables one to unearth a wide range of anomalies [4,5,6].
In this section, we study the effectiveness of OCNM and KOAD in detecting unusual events in Quebec's road network. We apply OCNM and KOAD to image sequences from Transports Quebec webcam network. This is an example application of the distributed monitoring architecture described in Section IV-C. Our data consists of a series of timesteps corresponding to daylight hours during the Sep. 30 to Oct. 03, 2006 period. We use the averaged energy of DWT coefficients (see Section III-.1) from subbands from each of the cameras (nodes). In our application with cameras, we used as the central decision rule.
We present illustrative images from the -camera network corresponding to a traffic congestion in Fig. 1. Given the normal flow of traffic during the length of our dataset, short periods of congestion constitute an example of a road network anomaly. This timestep was flagged in all nodes by both OCNM and KOAD as anomalous, for most representative algorithm parameter settings.
Fig. 2(top panel) shows the results of wavelet analysis of the image sequence from one camera. We selected one of the six cameras for preliminary assessment of feature extraction quality. It can be seen that the high-frequency components of the feature vector show the expected variation in the vicinity of traffic jams. Abrupt changes to the position of a camera generate sudden spikes in the feature vector components. Fig. 2 (middle panel) shows the distance measures obtained using the OCNM algorithm with set to and using to signal the outliers. Fig. 2(bottom panel) shows the variations in KOAD projection error . We ran the KOAD algorithm here with the thresholds and , a Gaussian kernel having standard deviation , and default settings for all other parameters (see [15]) which includes the orange alarm being resolved after timesteps (i.e. ). We begin our analysis of Fig. 2 at with the previous timesteps constituting the training period in this application.
Fig. 3 presents the receiver operating characteristics (ROC) curves showing the variation in the probability of detection ( ) with the probability of false alarms ( ), for the OCNM and KOAD algorithms applied to the Transports Quebec dataset. In our experiments, we used OCNM with the nearest-neighbour parameter set to , and varying from to to signal between and outliers. We ran KOAD with the thresholds set to , and using a Gaussian kernel where the standard deviation of the kernel function is varied between and . The other KOAD parameters are retained at their default values (see [15]), with an orange alarm resolved after timesteps (i.e. ).
Although our experiments were performed on a limited data set, this result provides a preliminary assessment of the anomaly detection algorithms based on wavelet feature extraction mechanism and machine learning data clustering approaches. It can be clearly seen from Fig. 3 that the KOAD detector outperforms the OCNM detector.
In this subsection we present the results of applying OCNM and KOAD to the Abilene dataset. Here we want to also detect those anomalies that cause sudden changes in the overall distribution of traffic over the network, as opposed to affecting a single link, during a particular timestep. Thus in this application we implement the centralised architecture proposed in Section IV-C. For discussions on the wide range of anomalies seen in IP networks, refer to the works of Lakhina et al. [4,5,6]. Here we also compare our results with those obtained by Lakhina et al. using the PCA subspace method of anomaly detection.
Fig. 4(a) shows the variations in obtained using the KOAD algorithm with , a Gaussian kernel of standard deviation , and default settings for all other parameters (see [15]). We start our analysis at , with the previous timesteps disregarded as constitute the training period in this application. Fig. 4(b) shows the magnitude of the energy in the residual components using the PCA subspace method of anomaly detection [6]. We used principle components for the normal subspace, in accordance with [6]. Fig. 4(c) shows the distance measures obtained using OCNM with , together with the threshold indicating the minimum volume set. The spike positions in Figs. 4(a-c) indicate the anomalies signalled by KOAD, PCA and OCNM, respectively. Fig. 4(d) isolates for comparison the positions of the anomalies detected by each individual algorithm.
It is evident from Fig. 4(c) that the OCNM -th nearest neighbour distance metric experiences an upward trend during the one-week period. This phenomenon was observed for values of that ranged from to . Although the positions of the spikes (with respect to the immediate surrounding timesteps) in Fig. 4(c) largely correspond with those in Fig. 4(a-b), we see that most of the outliers signalled by OCNM lie in the latter part of the plot.
The increasing distance metric suggests that the space of normal traffic expands over the recorded time period. The KOAD algorithm is best able to detect anomalies in such a situation, as the dictionary in this algorithm is dynamic, with obsolete elements being removed and new, more relevant elements added as necessary. Indeed, we noticed in our experiments with this particular dataset that the dictionary members change significantly over the reported period.
Fig. 4(c) also argues the need for a sequential or block-based version of OCNM where outliers may be incrementally reported after every timestep or block of timesteps. When we ran OCNM on the first data points only, it flagged the same anomalies as KOAD.
Our preliminary results of the application of machine learning techniques to network anomaly detection indicate their potential and highlight the areas where improvement is required. The non-stationary nature of the network measurements, be they network traffic metrics or recordings of physical phenomena, makes it imperative that the algorithms be able to adapt over time. To make the algorithms portable to different applications and robust to diverse operating environments, all parameters must be learned and autonomously set from arriving data. The algorithms must be capable of running in real-time despite being presented with large volumes of high-dimensional, noisy, distributed data. This means that the methods must perform sequential (recursive) calculations with the complexity at each timestep being independent of time. The computations must be distributed amongst the nodes in the network, and communication, which consumes network resources and introduces undesirable latency in detection, must be minimised.
The KOAD algorithm satisfies some of these requirements, being recursive in nature and adapting to changing data streams. However, in its current form it does not learn appropriate parameter settings, exhibits sensitivity to threshold choices, and has no natural distributed form. The distributed architecture we have described in this paper is a sub-optimal system. Our current and future research focuses on rectifying these deficiencies in the KOAD approach and exploring other promising learning-based alternatives to the network anomaly detection challenge.
The authors thank Sergio Restrepo for processing the Transports Quebec data and Anukool Lakhina for providing the Abilene dataset. This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and industrial and government partners, through the Agile All-Photonic Networks (AAPN) research network. The research was also supported by Mathematics of Information Technology and Complex Systems (MITACS) under the NSERC Network of Centres of Excellence (NCE) program.