Next: Testing Different Feature Construction
Up: Analyzing System Logs: A
Previous: Ranking Messages by Exceptionality
Using Rank Correlation for Feature Construction
In the original feature-set of our dataset, system log
is represented as the message count vector
defined above. There are
message types in our dataset, hence this results in a
matrix. This matrix is very sparse; only about 0.6% of the entries are non-zero. The high dimensionality of the data and its sparseness make k-means clustering impractical for this representation.
Since our objective is message ranking, we propose a new feature construction scheme of the system-log dataset that measures the difference in the ranking of messages between system logs. Two known rank correlation measures can be used to achieve this: The Spearman rank correlation [12] and Kendall's tau rank correlation [12].
Let
and
be vectors of dimension
. Let
and
be vectors of ranks for
and
, i.e.
if
is the
'th largest number in
, and similarly for
.3
The two correlation measures are defined as follows:
Definition 1 (Spearman Rank Correlation)
Let
data:image/s3,"s3://crabby-images/85069/850694bb3748c0fd547e414f8874c317ed2b9851" alt="$ \vec{d} \stackrel{def}{=}\vec{r_x} - \vec{r_y}$"
.
The Spearman rank Correlation between
data:image/s3,"s3://crabby-images/4d7ef/4d7ef2b9ea72fdb7c0b40e2fcc1666f7b1a3ca8e" alt="$ \vec{x}$"
and
data:image/s3,"s3://crabby-images/8f64c/8f64c8d49fec27be3a709fa9c9e5991ff1de1de5" alt="$ \vec{y}$"
is defined by:
data:image/s3,"s3://crabby-images/5a4e3/5a4e3e273760b9f608b220fb11ca17c06b7028a8" alt="$\displaystyle \rho(\vec{x},\vec{y}) \stackrel{def}{=}1-\frac{6\Vert\:\vec{d}\:\Vert^2}{N(N^2-1)}$" |
(1) |
Definition 2 (Kendall's Tau Rank Correlation)
Let
data:image/s3,"s3://crabby-images/c867b/c867be432e58e06ffde571d9759bdda8e6eb4031" alt="$ P(\vec{x},\vec{y})$"
be the number of pairs
data:image/s3,"s3://crabby-images/5f77c/5f77ca661374c70247b4900c673faf458270479d" alt="$ i,j$"
such that both
![$ r_x\!\left[i\right]>r_x\!\left[j\right]$](img32.png)
and
![$ r_y\!\left[i\right]>r_y\!\left[j\right]$](img33.png)
.
Kendall's tau rank correlation between
data:image/s3,"s3://crabby-images/4d7ef/4d7ef2b9ea72fdb7c0b40e2fcc1666f7b1a3ca8e" alt="$ \vec{x}$"
and
data:image/s3,"s3://crabby-images/8f64c/8f64c8d49fec27be3a709fa9c9e5991ff1de1de5" alt="$ \vec{y}$"
is defined by:
data:image/s3,"s3://crabby-images/8b78d/8b78dc5d224d24bd6ede0c38383763472c3677ff" alt="$\displaystyle \tau(\vec{x},\vec{y}) \stackrel{def}{=}\frac{4P(\vec{x},\vec{y})}{N(N-1)}-1$" |
(2) |
We first define a new Spearman-based feature-set. In this feature-set, system log
is represented by the vector
, where
is the number of samples in our dataset. A Kendall's-tau-based feature-set can be generated in an analogous way. The resulting matrix is a sample correlation matrix, and the new feature-set has
dimensions instead of the much larger
(the number of different message types in the dataset). It is generally expected that in a typical dataset,
would be much larger than
as it is in our own dataset, because of the diversity of possible messages in a computer system.
It is interesting to note that using either the Spearman-based feature-set or the Kendall's-tau-based feature set generates a kernel similarity matrix for the original dataset. This opens the possibility of using these correlation measures in kernel-based algorithms [9]. We prove that both sample correlation matrices are kernel matrices, using the fact that a kernel matrix is a Positive Semi-Definite (PSD) matrix. A matrix
is PSD if for any non-zero vector
,
[9].
We first prove that the Pearson Sample Correlation matrix [12] is PSD, and then conclude that so are the Spearman rank correlation matrix and the Kendall's-tau sample correlation matrix.
Definition 3 (Pearson Sample Correlation)
Let
data:image/s3,"s3://crabby-images/4d7ef/4d7ef2b9ea72fdb7c0b40e2fcc1666f7b1a3ca8e" alt="$ \vec{x}$"
and
data:image/s3,"s3://crabby-images/8f64c/8f64c8d49fec27be3a709fa9c9e5991ff1de1de5" alt="$ \vec{y}$"
be vectors of the same dimension.
The Pearson correlation coefficient is defined by:
data:image/s3,"s3://crabby-images/863b0/863b034f377cf7256e0233d1d3591b43a31eca96" alt="$\displaystyle r(\vec{x},\vec{y}) = \frac{cov(\vec{x},\vec{y})}{\sqrt{var(\vec{x})\cdot var(\vec{y})}}$" |
(3) |
where
data:image/s3,"s3://crabby-images/e6fa1/e6fa1d0ea378d2c6d16adc9b86707bc9a10d6417" alt="$ cov$"
is the sample covariance function and
data:image/s3,"s3://crabby-images/3796b/3796bd449968a8045e31e43fa6fe4fe18237078b" alt="$ var$"
is the sample variance function.
Theorem 1
A Pearson sample correlation matrix is PSD.
Proof.
Let
data:image/s3,"s3://crabby-images/a1d1a/a1d1ab4e8d26be3033b4e605a871e4d48f9e01b3" alt="$ X$"
be a matrix in which each row is a sample. Let
data:image/s3,"s3://crabby-images/3d5da/3d5dafd7341bd1754bbc8e05cf0f788f1566ec08" alt="$ S$"
be a diagonal matrix such that entry
data:image/s3,"s3://crabby-images/03c58/03c5814bd685e11417d8e200f3327e729f93900d" alt="$ \left(i,i\right)$"
is the variance of row
data:image/s3,"s3://crabby-images/18aa5/18aa53b686891bad525940c2f39d8cebc075ef0b" alt="$ i$"
in
data:image/s3,"s3://crabby-images/a1d1a/a1d1ab4e8d26be3033b4e605a871e4d48f9e01b3" alt="$ X$"
. Assume, without loss of generality, that the mean of each sample in
data:image/s3,"s3://crabby-images/a1d1a/a1d1ab4e8d26be3033b4e605a871e4d48f9e01b3" alt="$ X$"
is zero. Then the Pearson correlation matrix can be written in vector form as:
For any non-zero vector
data:image/s3,"s3://crabby-images/4d7ef/4d7ef2b9ea72fdb7c0b40e2fcc1666f7b1a3ca8e" alt="$ \vec{x}$"
, the expression
data:image/s3,"s3://crabby-images/9c771/9c771044742fe46506d77067d03bf5e7210ed17a" alt="$ \vec{x}'R\vec{x}$"
can be written as:
data:image/s3,"s3://crabby-images/87510/875102753803ce7b05082834fea5bf079c7e255f" alt="$\displaystyle \vec{x}'R\vec{x} =\vec{x}'\left( S^{-\frac12} X X' S'^{-\frac12}\right)\vec{x} = \left(\vec{x}' S^{-\frac12} X \right) ^2$" |
(4) |
The rightmost element is a square term, hence it is greater than or equal to zero. Therefore
data:image/s3,"s3://crabby-images/04837/0483753dd2f48b06615a7d242cc6826e9916aa7b" alt="$ R$"
is PSD.
Theorem 2
A Spearman rank correlation matrix is PSD.
Proof.
Spearman correlation is Pearson correlation applied to ranks [
4]. Therefore, the Spearman rank correlation matrix is PSD.
Theorem 3
A Kendall's-tau correlation matrix is PSD.
Proof.
For a vector
data:image/s3,"s3://crabby-images/4d7ef/4d7ef2b9ea72fdb7c0b40e2fcc1666f7b1a3ca8e" alt="$ \vec{x}$"
of dimension
data:image/s3,"s3://crabby-images/f26c3/f26c31a4532df53a136854140debc1a875f32f40" alt="$ N$"
, let
data:image/s3,"s3://crabby-images/a2547/a25472d73dfa58cb383feeb031d4b09d48d02139" alt="$ \vec{x_K}$"
of dimension
data:image/s3,"s3://crabby-images/5a170/5a170d9b7919b5ce51140cbbd0bc5b08ae2c9a4a" alt="$ N^2$"
be defined by:
![$\displaystyle x_K\!\left[j+(i-1)\cdot N\right] =
\left\{
\begin{array}{ll}
...
...i\right]>r_x\!\left[j\right]
0 & \textrm{otherwise}
\end{array}
\right.$](img52.png) |
(5) |
Then
![$ P(\vec{x},\vec{y}) = \sum_{k=1}^{N^2}x_K\!\left[k\right]\cdot y_K\!\left[k\right]$](img53.png)
, and it can be easily verified that Kendall's tau correlation of
data:image/s3,"s3://crabby-images/4d7ef/4d7ef2b9ea72fdb7c0b40e2fcc1666f7b1a3ca8e" alt="$ \vec{x}$"
and
data:image/s3,"s3://crabby-images/8f64c/8f64c8d49fec27be3a709fa9c9e5991ff1de1de5" alt="$ \vec{y}$"
is the Pearson correlation of
data:image/s3,"s3://crabby-images/3cab2/3cab2ed36d06163eeac0adcc3af9ce052ace53fa" alt="$ c\cdot\vec{x_K}$"
and
data:image/s3,"s3://crabby-images/2c9f1/2c9f15936d7f3fcdbe725bbdac405014a452eea2" alt="$ c\cdot\vec{y_K}$"
, for
data:image/s3,"s3://crabby-images/0e8e3/0e8e354e731dc0cf1c6a50f382d4caaeaa78fa0b" alt="$ c$"
a constant that depends on
data:image/s3,"s3://crabby-images/7371c/7371c45c479e75f11b8dee0b4363b704a2502ccf" alt="$ n$"
. Hence, Kendall's tau correlation matrix is also a Pearson correlation matrix, and so it is PSD.
Next: Testing Different Feature Construction
Up: Analyzing System Logs: A
Previous: Ranking Messages by Exceptionality
2007-03-12