Anomaly detection is a first and important step needed to respond to unexpected problems and to assure high performance and security in IP networks. We introduce a framework and a powerful class of algorithms for network anomography, the problem of inferring network-level anomalies from widely available data aggregates. The framework contains novel algorithms, as well as a recently published approach based on Principal Component Analysis (PCA). Moreover, owing to its clear separation of inference and anomaly detection, the framework opens the door to the creation of whole families of new algorithms. We introduce several such algorithms here, based on ARIMA modeling, the Fourier transform, Wavelets, and Principal Component Analysis. We introduce a new dynamic anomography algorithm, which effectively tracks routing and traffic change, so as to alert with high fidelity on intrinsic changes in network-level traffic, yet not on internal routing changes. An additional benefit of dynamic anomography is that it is robust to missing data, an important operational reality. To the best of our knowledge, this is the first anomography algorithm that can handle routing changes and missing data. To evaluate these algorithms, we used several months of traffic data collected from the Abilene network and from a large Tier-1 ISP network. To compare performance, we use the methodology put forward earlier for the Abilene data set. The findings are encouraging. Among the new algorithms introduced here, we see: high accuracy in detection (few false negatives and few false positives), and high robustness (little performance degradation in the presence of measurement noise, missing data and routing changes).
The first step in fixing a problem is knowing it exists. This is no less true in networking than anywhere else - we need to know about a problem before we can repair it. Networking vendors typically build alarms into network equipment to facilitate fast, accurate detection and diagnosis of problems. However, in practice, there are many problems for which explicit alarms are either absent (for new or uncommon problems), or intrinsically hard to produce. In these cases we must infer the problem from other data sources. For instance, many types of network problems cause abnormal patterns to appear in the network traffic. Such traffic anomalies may be caused by problems ranging from security threats such as Distributed Denial of Service (DDoS) attacks and network worms, to unusual traffic events such as flash crowds, to vendor implementation bugs, to network misconfigurations. We refer to the problem of inferring anomalies from indirect measurement as network anomography (combining ``anomalous'' with ``tomography,'' a general approach to such inference problems).
Network tomography [31] bears some resemblance, in that both involve the solution of a linear inverse problem. Examples include inference of individual link performance characteristics from path performance characteristics, and inference of traffic matrices from individual link load measurements. For example, the traffic matrix estimation problem arises because the obvious source of data for direct measurement (flow-level data) can be hard to obtain network-wide [5,14,22,23,28,31,34,35]. On the other hand, Simple Network Management Protocol (SNMP) data on individual link loads is available almost ubiquitously. Fortunately, the link loads and traffic matrices are simply related by a linear equation
The anomography problem is different and somewhat more complex. First, note that anomaly detection is performed on a series of measurements over a period of time, rather than from a single snapshot. In addition to changes in the traffic, the solution must build in the ability to deal with changes in routing. Second, note that the anomalies that we wish to infer may have dramatically different properties from a traffic matrix, and so different methods than those used for network tomography may be called for. Indeed, we find that simple extensions to network tomography methods perform fair poorly here. Techniques that transform the measurements prior to attempting to solve the inverse problem are preferable.
As a simple example, imagine trying to detect an anomalous traffic pattern caused by a flash crowd or DDoS attack on a web site. This type of event will cause increases in traffic flows headed towards a particular set of destinations. It may be hard to rapidly identify which of the tens of thousands of ingress links on a large network might be primarily responsible, as large surges at a network egress link may arise from small surges on several ingress links. We must infer the change in the pattern of traffic to the particular site from the complete set of link data, considered together, rather than as individual time series. This illustrates an important feature of anomography - it extends anomaly detection to network-level problems (automatically building in correlation across the network) where link-level anomaly detection might be inadequate or unreliable.
Many approaches to anomography are possible. In pioneering work, Lakhina et al. introduced a novel approach based on Principal Component Analysis (PCA) [19]. Our paper makes three major contributions to understanding and solving anomography problems:
The paper is organized as follows. Section 2 summarizes background and related work. In Section 3 we describe our framework, and the anomography algorithms examined in this paper, in the context of fixed routing. In Section 4 we extend the Box-Jenkins anomography to the case where routing may change over time. In Section 5 we describe our evaluation methodology, and Section 6 presents the results. Section 7 provides final remarks.
Network tomography describes several problems: inferring link performance of a network from end-to-end measurements, or inferring Origin-Destination (OD) traffic demands from link load measurements. These problems can be written as linear inverse problems where one seeks to find unknowns from measurements given a linear relationship (1), where is the routing matrix. For a network with links, and OD flows, we define the routing matrix to be the matrix where indicates the fraction of traffic from flow to appear on link .
SNMP provides link measurements of traffic volumes (bytes and packets), typically at 5 minute intervals (this data is described in more detail in, for example [34]). We shall assume data of this type is the input to our algorithms, and we wish to infer anomalous traffic matrix elements, but note that anomography is not limited to this problem, and could equally be applied to inferring anomalous link performance from end-to-end measurements. An additional source of data used here comes from the routing protocols used to build the forwarding tables within each router. We use routing data (e.g., gathered from a route monitor as in [27]) along with a route simulator (as in [10]) to predict the results of these distributed computations, and determine the network routing.
The problem of inferring the OD traffic-matrix has been much studied recently (e.g., [5,14,22,23,28,31,34,35]). The problem's key characteristic is that it is massively underconstrained: there will be approximately OD flows to estimate and only link measurements. Hence tomography methods seek to introduce additional information, often in the form of some kind of traffic model (for instance a Poisson model in [31,28], a Gaussian model in [5], or a gravity model in [34,35]). Anomography problems are also highly underconstrained, but the models used to describe traffic are inappropriate for anomalies -- by definition these events are generated by completely different processes from normal network traffic. Moreover, in anomography we combine detection with inference, whereas in standard network tomography we seek only to infer a set of traffic matrix elements. Hence there are important differences between this paper and network tomography.
It is also important to note that routing matrices change over time. In much previous work, routing matrices are taken to be constant (an exception being [23], where the traffic is assumed to be somewhat constant, while the routing varies), but it is important (see [29]) to allow for the fact that routing is not constant, and neither is the traffic. In order to allow for variable routing, we index not just the traffic measurements over time, but also the routing matrix. Given these, we may write the relationship between the link traffic, and OD traffic matrix as
Lakhina et al. carried out the pioneering work in the area of inference of anomalies at network level, [19,18,20], and adapted Principal Components Analysis (PCA) to this setting. Donoho [8,9] introduced a powerful mathematical treatment for tomography-like problems, wherein one seeks solutions that maximize sparsity (intuitively, solutions with fewest explanations). These papers inspired our development of the new methods introduced here, and our development of a framework in which a very wide class of methods all fit.
Anomaly detection is a burgeoning field. A great deal of research in network anomaly detection relies on some type of inference step, taking a set of alarms [13,15,16,25,30] as input. While anomography includes methods of this type, our results indicate that it is better to delay alarm generation until after the inference step. In that way, a single constructive alarm may be generated, rather than a storm of redundant alarms. Moreover, in delaying the alarm generation until after the inference step, we can in some cases greatly improve the sensitivity of detection, as was demonstrated in [19].
We approach the network anomaly detection problem from the point of detecting unknown anomalous behavior, rather than looking for particular signatures in the data, the focus of much work in the security community. A large component of the work on machine learning, signal processing and time-series analysis is devoted to detecting outliers or anomalies in time-series. This literature has been applied to networks in a number of cases; for examples see [1,4,15,17,30,32]. These methods range in sophistication from [4], which suggests the use of the standard Holt-Winters forecasting technique for network anomaly detection, to [1], which uses a sophisticated wavelet based method with great potential. These methods focus on single time series rather than the multi-dimensional time series that arise in anomography.
Most earlier work ignores noise or provides weak tests of robustness to noise (which can destroy utility). A strength of the work presented here is that we provide tests of effectiveness of the methods in the presence of noise, always a factor in practice.
In this section, we shall assume that the routing matrices are time-invariant and are denoted by . (We will extend our work to time-varying in Section 4.) Under this assumption, we can combine all linear systems (2) into a single equation using matrix notation:
We identify two basic solution strategies to network anomography: (i) early inverse, and (ii) late inverse. Early-inverse approaches may appear more intuitive. The early-inverse approach tackles the problem in two steps. The first is the network tomography step, where OD flow data at each interval are inferred from the link load measurements by solving the ill-posed linear inverse problem (2). Given the estimated OD flow data at different time points , in the second step, anomaly detection can then be applied to the . For this step, there are many widely used spatial and temporal analysis techniques, which we will describe later in this section.
Early-inverse methods, although conceptually simple, have an obvious drawback -- errors in the first step, which are unavoidable due to the ill-posed nature of the inference problem, can contaminate the second step, sabotaging overall performance. Another disadvantage is that early-inverse methods apply a potentially computationally expensive anomaly detection step to high-dimensional data: on a network of nodes, one must perform this step on all OD pairs. As we will see, late-inverse performs anomaly detection on only dimensional data. We focus on late-inverse methods in this paper for these reasons, though we shall provide some comparisons between early- and late-inverse methods.
The idea of the late-inverse method is to defer ``lossy'' inference to the last step. Specifically, late inverse approaches extract the anomalous traffic from the link load observation, then form and solve a new set of inference problems:
While the new inference problems (4) share the same linear-inverse structure as in network tomography (3), the characteristics of the unknowns are very different, and so is the solution strategy, which we will explore in Section 3.4.
We now introduce a simple framework for late-inverse anomography methods. In this framework, is formed by multiplying with a transformation matrix . Depending on whether we use a left or right multiplying transformation matrix, we can further divide the framework into the following two classes:
Our framework encompasses a number of analysis techniques for extracting anomalous traffic from link load observations , as we next illustrate.
Recall that is the collection of link traffic data at links over time intervals, where each row ( ) denotes the time series of the -th link and each column ( ) represents an instance of all the link loads at time interval . The principal components, can be computed iteratively as follows:
We use to denote the matrix of the principal axes in the anomalous subspace, where is the first axis that fails to pass the threshold test.
We call the above method spatial PCA because it exploits the correlation between traffic on different links (across space). Later in Section 3.3.4, we will describe temporal PCA, which exploits temporal correlation by applying PCA to identify dominant patterns across time.
Univariate time series. The Box-Jenkins methodology, or AutoRegressive Integrated Moving Average (ARIMA) modeling technique [2,3], is a class of linear time-series forecasting techniques that capture the linear dependency of the future values on the past. It is able to model a wide spectrum of time-series behavior, and has been extensively used for anomaly detection in univariate time-series.
An ARIMA model includes three order parameters: the autoregressive parameter (), the number of differencing passes (), and the moving average parameter (). In the notation introduced by Box and Jenkins, models are summarized as ARIMA. A model described as ARIMA means that it contains (zero) autoregressive parameters and moving-average parameters which were computed for the time series after it was differenced once ().
A general ARIMA model of order can be expressed as:
Many commonly used smoothing models are special instances of ARIMA models. For example, the Exponentially Weighted Moving Average (EWMA), is equivalent to ARIMA; linear exponential smoothing, also known as Holt-Winters, is equivalent to ARIMA. See [26] for detailed equations for various smoothing models and their equivalence with ARIMA models.
There are well known techniques for estimating the parameters and for a given time series [2,3], and given the parameters, the model is simply applied to get a prediction of (using for instance the Durbin-Levinson algorithm [3]). The prediction errors are then , which then form our anomalous traffic (the traffic which does not fit the model). In practice the parameters used in the ARIMA model are sometimes chosen to meet particular goals intended by the implementor (see [4] for some discussion of these choices), rather than being estimated from the data set, because the parameters of a data set may change over time. However, we prefer to use adaptive techniques to overcome this problem.
If we consider the time series to be vectors of length , then the above results can be written in matrix form. Taking the measurements , we can obtain the errors , via right-multiplication by a transformation matrix . Specifically, let denote the identity matrix, denote the ``back shift'' matrix, and denote the unit matrix, i.e.,
The differencing result, , is then
Extending ARIMA based models to multivariate time series is straightforward. As noted earlier, we construct the matrix with the measurements at each time period as its columns. Via the above transformations, we obtain
ARIMA based anomography. Replacing by the matrix form of (6), we see that is indeed a transformation given by right-multiplying with a matrix . In fact, any linear filtration of the elements of a time series can be modeled by a right multiplying matrix transformation.
To get back to anomaly detection, we simply identify the forecast errors as anomalous link traffic, . That is, traffic behavior that cannot be well captured by the model is considered anomalous.
For a discrete-time signal , the Discrete Fourier Transform (DFT) is defined by
The Inverse Discrete Fourier Transform (IDFT) is used to reconstruct the signal in the time domain by
An efficient way to implement the DFT and IDFT is through an algorithm called the Fast Fourier Transform (FFT). The computational complexity of the FFT is .
FFT based anomography. The idea of using the FFT to extract anomalous link traffic, is to filter out the low frequency components in the link traffic time series. In general, low frequency components capture the daily and weekly traffic patterns, while high frequency components represent the sudden changes in traffic behavior. Working in the frequency domain provides us with the opportunity to distinguish these two kinds of behaviors.
We summarize FFT based anomography as follows.
The DFT and IDFT may be represented as right-matrix products. In setting columns of to zero, and performing the IDFT we are taking a linear combination of the columns of , which in turn are a linear combination of those of . Hence, the overall process above can be modeled as a right-multiplying matrix transformation . Note also that in thresholding at frequency we preserve the symmetry of , and so although may contain complex elements, the resulting transform will be real.
In [1], Barford et al. have developed a wavelet-based algorithm for detecting anomalies in the link traffic data. It shares the same principle as the FFT based approaches -- exposing anomalies by filtering low frequency components. More specifically, it uses wavelets to decompose the original signal into low-, mid-, and high-frequency components and then detects anomalies by close examination of the mid- and high-frequency components.
Below we compute as the high-frequency components of link traffic . We can also compute as the mid-frequency components of in essentially the same way.
It is easy to verify that the process of WAVEDEC and WAVEREC only involves linear combinations of columns of . As a result, the derived through the wavelet based anomography can also be modeled as right multiplying matrix transformation.
Consider the link load matrix . We can think of each row as a -dimensional vector. What we are looking for is a new coordinate system, , , ... , , such that the projection of the links (on , , ..., ) has energy concentrated on the first several axes. This is exactly what PCA provides. The only difference is that we now apply PCA on as opposed to (as used in spatial PCA). Then we follow the same procedure to define an anomalous subspace and to extract anomalies that have projections in the anomalous subspace. In this way, we obtain a left multiplying transformation matrix , i.e., . Taking transpose on both side of the equation, we have where is a right multiplying transformation matrix that extracts anomalies from .
Once we obtain the matrix of link anomalies , the next step is to reconstruct OD flow anomalies by solving a series of ill-posed linear inverse problems . For example, Lakhina et al [19] proposed to find the single largest anomaly in each time interval by applying a greedy algorithm. We present below three common inference algorithms for solving these problems. All three algorithms deal with the underconstrained linear system by searching for a solution that minimizes some notions of vector norm, three examples of which are
A standard solution to is the pseudoinverse solution , where is the pseudoinverse (or Moore-Penrose inverse) of matrix . It is known that is the solution to the problem that minimizes the norm of the anomaly vector, i.e. it solves:
minimize subject to is minimal | (8) |
In practice, we expect only a few anomalies at any one time, so typically has only a small number of large values. Hence it is natural to proceed by maximizing the sparsity of , i.e., solving the following norm minimization problem:
One common approach to approximate norm minimization is to convexify (9) by replacing the norm with an norm, so that we seek a solution to
In the presence of measurement noise, the constraints may not always be satisfiable. In this case, we can add a penalty term to the objective and reformulate (10) as:
We can cast (11) into the following equivalent Linear Programming (LP) problem, for which solutions are available even when is very large, owing to modern interior-point linear programming methods.
Another common heuristic solution for norm minimization is to apply the greedy algorithm. For example, the greedy heuristic has been successfully applied to wavelet decomposition, where it goes by the name of Orthogonal Matching Pursuit (OMP) [24]. In the same spirit here, we develop a greedy solution to maximize the sparsity of . The algorithm starts with an empty set of non-zero positions for and then iteratively adds new non-zero positions to . During each iteration, for each position , the algorithm tests how much it can reduce the residual by including as a non-zero position. More specifically, let . The algorithm estimates the values for the non-zero elements of (denoted as ) by solving the following least squares problem
minimize | (13) |
Up to this point, we have assumed that the routing matrices are constant. However, we wish to allow for dynamic routing changes, and so we must allow to vary over time. In IP networks, routing changes occur as part of the normal ``self-healing'' behavior of the network, and so it is advantageous to isolate these from traffic anomalies and only signal traffic anomalies. In addition, if some measurements are missing (say at time ), we may still form a consistent problem by setting the appropriate rows of to zero. Thus, for realistic SNMP measurements where missing data are often an issue, we still wish to vary even for static routing. Routing measurements may be obtained using a route monitor, to provide accurate, up-to-date measurements of routing (at least at the time scale of SNMP measurements, e.g. minutes).
Where the tomography step can be done separately at each time interval (for instance see [34,35]), it is simple to adapt early-inverse methods to dynamic network anomography by inverting (2) at each time step. Given the straight forward approach for early-inverse methods, We seek here to generalize late-inverse methods to dynamic network anomography.
When the routing matrix is non-constant, there is no reason to believe that the measurements should follow a simple model such as an ARIMA model. Even where the traffic itself follows such a model, a simple routing change may change a link load measurement by 100%, for instance by routing traffic completely away from a particular link. If we were to apply the ARIMA model to the measurements , we would see such a change in routing as a level-shift anomaly. However, its cause is not an unknown change in (to be discovered), but rather a known change in the routing matrices . Likewise, it no longer makes sense to try to exploit spatial correlations which arose from a particular routing, to the case of another routing.
However, it is no less reasonable to approximate the traffic matrix by an ARIMA model (than when the routing is constant), even when routing may change. Under such a modeling assumption, we can write . We know also that the measurements are given by (2). A reasonable approach to the solution is therefore to seek a solution which is consistent with these equations, but also minimizes one of the norms (described above) at each time step. We choose to minimize the norm here because (i) it allows us to naturally incorporate link load constraints at multiple time intervals, and (ii) it is more accurate than both the pseudoinverse and the greedy algorithms for static anomography (as we will show in Section 6).
Unfortunately, for transform based methods (the Fourier, wavelet and PCA methods) the number of constraints becomes very large (as grows). On the other hand, the set of constraints for the ARIMA model can be written in a form such that it does not grow with . Hence, in the following we concentrate on generalizing the ARIMA approach. We present the algorithm for ARIMA models with (Section 4.2). We have also extended the algorithm to handle ARIMA models with , though we omit this treatment here for brevity (as it is a straightforward extension). Due to space limits, we will leave out the discussion on model selection and parameter estimation, two important issues for applying ARIMA-based anomography. Interested readers can find this in our technical report [33].
We are going to seek solutions that are consistent with the measurements for , and an ARIMA model that gives where is the same transformation matrix implicitly defined by (6) and (7). Importantly, we do not wish to have to estimate (or we may as well use an early-inverse method). The advantage of the ARIMA model, is we do not need to know , but only linear combinations of .
Let be the backshift operator, whose effect on a process can be summarized as . Let the AR polynomial be
As an example, consider the simplest ARIMA model, ARIMA. In this case, , so we have
(17) |
As in Section 3.4.2, we can accommodate measurement noise by incorporating penalty terms into the objective to penalize against violation of constraints (14) and (15). We can then solve the resulting norm minimization problem by reformulating it as an equivalent LP problem. We omit such details in the interest of brevity.
We have also developed two techniques to significantly reduce the size of the above minimization problems by exploiting the fact that changes in routing matrices tend to be infrequent (i.e., not in every time interval) and local (i.e., only in a small subset of rows). Interested readers please refer to our technical report [33] for details.
The primary data inputs for our anomaly diagnosis are the time series of link loads (bytes across interfaces) for every network, gathered through SNMP. We use flow level data, where available, for validation. As is often the case, the flow data is incomplete. The flow data are collected at the edge of the network where data packets are sampled and aggregated by the IP source and destination address, and the TCP port number. Adjusted for sampling rate and combined with BGP and ISIS/OSPF routing information, these sampled IP flow statistics are then aggregated into a real traffic matrix [11], where each element is an OD flow with the origin and destination being the ingress and egress point of the flow to/from the network. Consistent with [19], we aggregate these measurements into bins of 10 minutes to avoid any synchronization issues that could have arisen in the data collection.
Ideally, to evaluate the methods, one would like complete flow level data, SNMP link load measurements, and continuous tracking of routing information, providing a consistent, comprehensive view of the network in operation. Unfortunately, we do not have the complete set of flow level data across the edge of the network (due to problems in vendor implementations or in data collection), and our routing information is only ``quasi-'' real time (we rely on snapshots available from table dumps carried out every 8 hours). As a result, inconsistencies sometimes arise between these measurements. To overcome these problems and provide a consistent means for evaluating the algorithms, we adopt the method in [34] and reconstruct the link traffic data by simulating the network routing on the OD flow traffic matrix generated from the available set of flow level data. Note that we use derived link load measurements for validation purposes only. In practice, our methods are applicable to direct measurement of traffic data as obtained from SNMP.
In Section 6.4.2 we step away from the simple anomaly detection algorithm applied to test the inference component, and compare the complete set of anomography methods described in Section 3. As before we use detection rate to measure whether the anomaly detection method produces similar results when applied to the OD pairs directly, or applied to the link load data, along with an inversion method -- we use the Sparsity-L1 method (the best performing of the methods tested using the methodology above). In other words, we benchmark the anomography method against the anomalies seen in direct analysis of the OD flows.
Since different methods may find different sets of benchmark anomalies, we need an objective measure for assessing the performance of the methods. Ideally, we would like to compare the set of anomalies identified by each of the methods to the set of ``true'' network anomalies. However, isolating and verifying all genuine anomalies in an operational network is, although important, a very difficult task. It involves correlating traffic changes with other data sources (e.g., BGP/OSPF routing events, network alarms, and operator logs), an activity that often involves case-by-case analysis. Instead, we perform pair-wise comparisons, based on the top ranked anomalies identified by each of the anomography methods, an approach also taken in Lakhina et al. [19].
Specifically, for each of the anomography methods, we apply the underlying anomaly detection method directly to the OD flow data. We think of the top ranked anomalies, denoted by the set for anomaly detection method as a benchmark. For each of the anomography methods , we examine the set of largest anomalies inferred from link load data. To help understand the fidelity of the anomography methods we consider the overlap between the benchmark and the anomography method, , across the benchmarks and the anomography methods. We allow a small amount of slack (within one ten-minute time shift) in the comparison between events, in order that phase differences between methods not unduly impact the results.
We are interested in understanding both false positives and false negatives:
For our reports in the next section, we choose the smaller of and to be 30, since this roughly represents the number of traffic anomalies that network engineers might have the resources to analyze deeply on a weekly basis. We would like to show comparative results where the larger parameter varies, but cannot within a reasonable amount of space, and so show results for one fixed value . It is important to note that the results we obtained for other values of and change none of our qualitative conclusions.
We obtained six months (03/01/04-09/04/04) of measurements for the Abilene network and one month (10/06/04-11/02/04) for the Tier-1 ISP network. We partitioned the data into sets spanning one week each, and evaluated the methods on each data set. Due to space limits, we present only one set of representative results - Tier-1 ISP (10/6/04-10/12/04). In our technical report [33], we also report results in other weeks for the Tier-1 ISP network as well as for the Abilene network. These results are qualitatively similar to those reported here.
We first compare different solution techniques for the inference problem . More specifically, we consider three late inverse algorithms: Pseudoinverse (Section 3.4.1), Sparsity-Greedy (Section 3.4.2), and Sparsity-L1 (Section 3.4.2), and one early inverse technique: Early Inverse-Tomogravity. We choose to use the tomogravity method [35] as the early inverse technique since it has demonstrated high accuracy and robustness for estimating traffic matrix for real operational networks [14,35].
Figure 1 plots the sizes of the top 50 anomalies (the forecast errors) of the OD flows (the solid lines) and the corresponding values diagnosed by the different inference techniques (the points) for 10/6/04 to 10/12/04, for the Tier-1 ISP network. The y-axis provides the size of the anomalies normalized by the average total traffic volume on the network. The x-axis is the rank by the size of anomalies directly computed from the OD flows. We observe that there are very few large changes - among more than 6 million elements ( 6000 OD flows at 1007 data points), there is one instance where the size of anomaly is more than 1% of total traffic and there are 18 cases where the disturbances constitute more than 0.5% of total traffic. This agrees with our intuition on the sparsity of network anomalies.
We see that Pseudoinverse significantly underestimates the size of the anomalies. Intuitively, Pseudoinverse finds the least square solution which distributes the ``energy'' of the anomaly evenly to all candidate flows that may have contributed to the anomaly, under the link load constraint. This is directly opposed to the sparsity maximization philosophy. Among the sparsity maximization techniques, Sparsity-L1 performs the best. Sparsity-L1 always finds solutions close to the real anomalies. Sparsity-Greedy, in general, is more effective than Pseudoinverse, although it sometimes overestimates the size of anomalies. As a representative of the early inverse technique, Tomogravity also performs well. With few exceptions, tomogravity finds solutions that track the real OD flow anomalies. Intuitively, when a proportionality condition holds, i.e., when the size of the anomalies are proportional to the sizes of the OD flows, then early inverse methods work well. However, where the proportionality condition does not hold, the error can be significant.
Figure 2 presents the detection rate for the different inference techniques. We observe that for the Tier-1 ISP network, Sparsity-L1 and Tomogravity, which have about 0.8 detection rate, significantly outperform other methods.
Due to space limits, we will consider only Sparsity-L1 and Tomogravity in the rest of the evaluation, as these method demonstrate the greatest performance and flexibility in dealing with problems such as missing data and routing changes.
in Sparsity-L1. Sparsity-L1 involves a parameter in its formulation (Eq. 11). Figure 3 investigates the sensitivity to the parameter choice. Specifically, Figure 3 plots the detection rate of Sparsity-L1 for , , , and . All in this range achieve good performance. This is reassuring, since it suggests that little training or parameter tuning is needed to match the method to a different network or traffic pattern.
Measurement Noise. Thus far, we have assumed perfect link load information for anomaly detection. However, in real networks, SNMP byte counts are collected from all routers across the network. Inevitably, measurement issues such as lack of time synchronization may introduce noise. Below we evaluate the impact of measurement noise by multiplying white noise terms with each element of the link load, and then using the result as input to our inference algorithms.
Figure 4 compares how well the methods perform with no noise, to how well they do with noise levels and . Note that measurement errors near 1% throughout the network are quite significant, since the size of the largest anomalies are themselves near 1% of the total traffic (Figure 1). It is a challenging task to accurately diagnose anomalies given the comparable level of noise. Nevertheless, we find that both Sparsity-L1 and Tomogravity are quite robust to measurement noise. For the Tier-1 ISP network, the detection rate remains above 0.8 for big anomalies (small ) and above 0.7 for the top 50 anomalies. These results demonstrate the strength of our algorithms in dealing with imperfect measurements.
Missing Data. Missing measurement data, arising from problems such as packet loss during data collection, is common in real networks. Indeed, this can be tricky to deal with, since the loss of link load data has the effect of producing time varying routing matrices in the anomography formulation. Fortunately, as discussed in Section 4, our extended Sparsity-L1 algorithm can handle this situation.
Figure 5 shows the performance of the inference algorithms with up to 5% of the data missing - missing values are selected uniformly at random. We see that both Sparsity-L1 and Tomogravity suffer only minor (almost negligible) performance impact, in terms of detection rate. The low sensitivity to missing data is an important feature of these methods, which is critical for real implementation.
Routing Changes. In an operational network, the routing matrix is unlikely to remain unchanged over a few days. Hardware failures, engineering operations, maintenance and upgrades all may cause traffic to be rerouted on alternative paths. Here we evaluate the impact of routing changes on the performance of our algorithms. We introduce routing changes by simulating faults on internal links.
Figure 6 presents results where we have randomly failed/repaired up to 3 links at each time instance. We observe that Sparsity-L1 is very robust to such a disturbance in the routing structure, while Tomogravity suffers significant performance impact. It appears that Tomogravity suffers here because errors in the (early) inference step, being computed from different routing matrices, add to become comparable to the anomalies themselves. This demonstrates another advantage of the late-inverse over the early-inverse approach.
Thus far, we have compared the performance of Sparsity-L1 and Early Inverse-Tomogravity, under the simple temporal model (forecasting the next data point using the current value). We found that Sparsity-L1 in general outperforms the Early Inverse approach. We also observed that Sparsity-L1 is robust to measurement noise, is insensitive to parameter choice, and is able to handle missing data and route changes. We now evaluate overall performance when applying Sparsity-L1 with other temporal and spatial anomography methods. In particular, we compare FFT (Section 3.3.2), Wavelet (Section 3.3.3), PCA (Section 3.2.1), TPCA (Section 3.3.4), and four ARIMA based methods, Diff (the simple forecasting model of the last section), Holt-Winters, EWMA, and general ARIMA, which determines the appropriate model using the method in [33].
As noted in Section 5, for each model considered, we compute directly from the OD flow traffic data and use it as the benchmark. Next, we compute with the same anomography model, and construct the inference problem. We compare the solution derived through Sparsity-L1 with the benchmark. Figure 7 presents the detection rate for these approaches. To avoid overcrowding the graph, we divide the anomography methods into two groups. Figure 7 (a) plots the results for the ARIMA family of anomography approaches and Figure 7 (b) plots the results for the rest. We observe that for all the ARIMA based approaches, Sparsity-L1 finds very good solutions. With the traffic data aggregated at the 10-minute level, simple Diff and EWMA can sufficiently extract the anomalous traffic and warrant a solution that maximizes the sparsity of the anomalies. Holt-Winters produces better performance than Diff and EWMA. This is because the model is more sophisticated, and thus is able to capture more complex temporal trends exhibited in the traffic data. Further sophistication, as incorporated in ARIMA, however, cannot significantly improve performance. In the family of ARIMA models, Holt-Winters appears to provide the best complexity-performance trade-off.
From Figure 7 (b), we observe that Sparsity-L1 can also achieve high detection rate under FFT, Wavelet and TPCA. However, it doesn't work well with PCA2. This can be explained as follows. When we apply spatial PCA on the real traffic matrix and the link load matrix , we obtain two linear transformation , and , respectively. However, the two transformation matrices and may differ significantly because the spatial correlation among link loads and that among OD flows are rather different. Even if we use , we cannot ensure that (i.e., (Note that this last comment applies to spatial anomography methods in general). Thus, the spatial PCA anomography solution is not expected to completely overlap with the identified by directly applying spatial PCA on the OD traffic flows. In contrast, the temporal anomography methods are self-consistent in that given , if we apply the same transformation on and obtain , we guarantee that ().
We now turn to comparing the various anomography methods . To do so, we use a set of benchmarks, as described in Section 5, each derived from applying anomaly detection algorithm directly to the OD flows. For each benchmark, we report on the success of all of the anomography methods. The hope is that methods emerge that achieve both low false positives and low false negatives for nearly all of the benchmarks.
|
From the table, we observe from the upper left quadrant that the ARIMA, FFT and Wavelet approaches tend to have relative low false positives among detected anomalies. Thus, the top 30 ranked anomalies derived through these approaches indeed appear to be anomalous traffic events that are worth investigating.
The PCA based approaches, however, exhibit a higher false positives when benchmarked against other approaches. This appears to be partially due to PCA identifying anomalies of a different type than those identified by the methods. Consider, for example, a sudden increase of traffic for an OD flow that persists for a couple of hours. PCA methods may identify every instance within the two-hour period as anomalous. ARIMA based approaches detect abrupt traffic changes. Hence ARIMA based methods likely extract only the ``edges'' - the first and last instance - of the two-hour period. Another factor contributing to PCA's false positives may be its lack of self-consistency: anomalies present in the OD pairs but not detected by the method in the link loads. In addition, unlike ARIMA, FFT, or wavelet based tomography, both spatial PCA and temporal PCA cannot fully utilize temporal ordering information in the measured time series data. For example, any reordering of the time series, , , ..., , does not affect the outcome of the algorithm.
Table 2 presents the number of false negatives for and , where we are interested in the number of large anomalies that are not identified by each approach. We observe that the ARIMA methods, FFT and Wavelet anomography approaches have superb performance - the number of false negatives are very low. This indicates that very few important traffic anomalies can pass undetected by these approaches. The PCA based approaches, however, identify about half of the anomalies.
In this paper, we introduced network anomography, the problem of inferring network-level anomalies from widely available data aggregates. Our major advances are:
While we believe our work represents a significant advance in the state of the art, we recognize that the the ultimate test of performance is significant operational experience: utility is bringing to light in the field new anomalies that were ``flying under the radar'' of other techniques, while producing very few false alarms. Our larger goal in future work is to explore the feasibility and performance of automated traffic management systems, which incorporate anomaly detection, root cause diagnosis and traffic and route control for operational networks.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons -image_type gif paper.tex
The translation was initiated by Yin Zhang on 2005-08-10