 
 
 
 
 
 
   
 by a vector
 by a vector
![$ \vec{c_i}=(c_i\!\left[1\right],\ldots c_i\!\left[n\right])$](img4.png) , where
, where  is the number
of possible message types, and
 is the number
of possible message types, and 
![$ c_i\!\left[m\right]$](img6.png) is the number of
instances of message
 is the number of
instances of message  in system log
 in system log  .1 Also, let
.1 Also, let 
 be a set of probability cumulative distribution functions
 be a set of probability cumulative distribution functions 
![$ p_m\!:\!\ensuremath{\mathbb{N}}\rightarrow \left[0,1\right]$](img9.png) , where
, where  is the probability that message
 is the probability that message  would appear
 would appear  or less times in a system log. If the probability of getting more than
 or less times in a system log. If the probability of getting more than 
![$ c_i\!\left[m\right]$](img6.png) instances of message type
 instances of message type  is low, then the number of appearances of message
 is low, then the number of appearances of message  is more than expected, and therefore message
 is more than expected, and therefore message  should be ranked higher. Therefore, the ranking of messages should approximate an ascending ordering of
 should be ranked higher. Therefore, the ranking of messages should approximate an ascending ordering of 
![$ (p_1(c_i\!\left[1\right]),\ldots p_n(c_i\!\left[n\right]))$](img12.png) . 
 
Given a large enough dataset of system logs from actual computer systems, we can estimate
. 
 
Given a large enough dataset of system logs from actual computer systems, we can estimate  from the empirical distribution
 from the empirical distribution 
 of  the number of instances of each message type in each system. We define the Score of message type
 of  the number of instances of each message type in each system. We define the Score of message type  in a log
 in a log  to be
 to be 
![$ \hat{p}_m(c_i\!\left[m\right])$](img15.png) , and use this score to rank the messages within the log.2 The messages that are top-ranked by this method usually indicate important problems in the system. This is illustrated in the ranked log view in Table
, and use this score to rank the messages within the log.2 The messages that are top-ranked by this method usually indicate important problems in the system. This is illustrated in the ranked log view in Table ![[*]](crossref.png) , which was generated from one of the samples in our dataset.
, which was generated from one of the samples in our dataset. 
 using the empirical distribution of the entire population is based on the implicit assumption that the population of computer systems in our dataset is homogeneous enough to treat all of them as generated from the same distribution. In actuality, different computer systems are used for very different purposes. Each purpose dictates a use-model that results in a different message distribution. For example, a computer system that serves as a file-server would probably be more likely to issue `File Not Found' messages than a personal workstation. On the other hand, a personal workstation might issue more system-restart messages. 
To improve the accuracy of our estimation of
 using the empirical distribution of the entire population is based on the implicit assumption that the population of computer systems in our dataset is homogeneous enough to treat all of them as generated from the same distribution. In actuality, different computer systems are used for very different purposes. Each purpose dictates a use-model that results in a different message distribution. For example, a computer system that serves as a file-server would probably be more likely to issue `File Not Found' messages than a personal workstation. On the other hand, a personal workstation might issue more system-restart messages. 
To improve the accuracy of our estimation of  , we group
the computer systems in our dataset into sets of systems with a
similar use-model, and estimate
, we group
the computer systems in our dataset into sets of systems with a
similar use-model, and estimate  separately for each set. We group
the systems using k-means clustering [1] on the system log
dataset. To generate the ranked log view for a given system, we first
find the cluster it belongs to, and then rank its log messages based
on the estimation of
 separately for each set. We group
the systems using k-means clustering [1] on the system log
dataset. To generate the ranked log view for a given system, we first
find the cluster it belongs to, and then rank its log messages based
on the estimation of  for that cluster. In the following section,
we present a new feature construction scheme for the system log
dataset. This scheme achieves a significantly better clustering than the original feature-set.
 for that cluster. In the following section,
we present a new feature construction scheme for the system log
dataset. This scheme achieves a significantly better clustering than the original feature-set.
 
 
 
 
