The Role Correlation Algorithm

Next: Results Up: Role Correlation Previous: Challenges

The Role Correlation Algorithm

The correlation algorithm operates by comparing the results of two executions of the grouping algorithms. Let P_t-1 and P_t be the group sets generated by the grouping algorithm at time t-1 and t respectively. The correlation algorithm updates the ID set of P_t, so that ID_{G_t} = ID_{G_t-1}, where $G_t \in P_t$ and $G_{t-1} \in P_{t-1}$ , if and only if G_t and G_t-1 considered to represent the same logical role. More specifically, the connection patterns of the members of G_t and those of G_t-1 are very similar. The groups correlation algorithm correlates the ID_{G_t} and ID_{G_t-1} in a meaningful manner and thus allow applications to preserve data specific to a particular group.

The role correlation algorithm:

Isolates the primary events, such as node arrivals and removals, that directly affects the connection habits of groups,
Identifies nodes that have not changed their neighbors,
Heuristically computes the time-varying similarity between the connection habits of two groups formed at times t and t-1, and assigns ID_{G_t} = ID_{G_t-1} if and only if the role of hosts (in terms on their connection patterns) in G_t-1 can be considered identical to the role of hosts in G_t.

First, the correlation algorithm eliminates the differences between the two host sets, I_t and I_t-1, so that it can compare the connection patterns meaningfully. The algorithm computes the set of nodes that existed at time t-1 but have been removed in time t ( $I_{t-1} \setminus I_{t}$ ), and the set of nodes that only appear at time t ( $I_{t} \setminus I_{t-1}$ ). All new nodes are removed from I_t and deleted nodes are removed from I_t-1. Thus, the changes in connection set of each host is only as a direct result of changing connection patterns between the host and its neighbors (which existed at time t).

Second, the algorithm heuristically identifies the set, $H_{\textit{same}}$ , of nodes that are very like to play the same logical roles from t-1 to t. We say that the two nodes h_t and h_t-1 are highly likely to be the same machine (i.e. it hasn't changed its logical role) if they have the identical connection sets. Specifically, $H_{\textit{same}} = \{ h_t \vert \exists h_{t-1} \in I_{t-1}, C(h_t) = C(h_{t-1})\}$ . We will explain shortly how we use the fact that a host $h \in H_{\textit{same}}$ to our advantage in computing the time varying similarity measure.

The role correlation algorithm will determine whether the two groups G_t and G_t-1 are the same group by heuristically computing the time-varying similarity measure and comparing against the pre-defined threshold. The group correlation algorithm operates as follows:

For each group G_t, identify G_t and G_t-1 as the same group if i) G_t-1 has the strongest time-varying similarity with G_t, among all the groups in P_t-1 and ii) the average number of connections is at least within T^hi percent of the average number of connections of G_t-1.
For each group pair (G_t, G_t-1) that remain uncorrelated, decide whether G_t and G_t-1 represent the same logical group based on how similar the connection patterns between G_t and its neighbor groups are to those between G_t-1 and its neighbor groups.

Step 1 decides whether the two groups G_t and G_t-1 are identical based on the time varying similarity measure. As in Section 4.2, we compute the similarity measure based on the average number of connections between the groups and their common neighbors. However, finding the common neighbor set between G_t and G_t-1 is not trivial. This is because we cannot simply assume that a neighbor $h_t \in C(G_t)$ and a neighbor $h_{t-1} \in C(G_{t-1})$ are the same host even if they have the same host identifier. We use the following techniques to identify the common neighbor set:

If a neighbor h_t of G_t shares the same host identifier with the neighbor h_t-1 of G_t-1 and both have been considered highly likely to be the same host (i.e. $h_t, h_{t-1} \in H_{same}$ ), we assume h_t is the neighbor to G_t in the same way as h_t-1 is to G_t-1.
For each neighbor pairs (h_t, h_t-1) that are not considered as highly likely to be the same host, we assume h_t is the neighbor to G_t in the same way as h_t-1 is to G_t-1 if and only if the following condition is true. The connection set size of h_t-1 is within T^hi percent of that of h_t and no other neighbor of G_t-1 has the connection set size closer to that of h_t.

The algorithm then computes the time-varying similarity measure between each neighbor pair (h_t, h_t-1), which meets the aforementioned requirements, as the minimum of the average number of connections between h_t and G_t and between h_t-1 and G_t-1. If the sum of the similarity measures for all common neighbor pairs within the bounds of the specified thresholds , the algorithm declares that groups G_t and G_t-1 mean the same group.

Next: Results Up: Role Correlation Previous: Challenges

Godfrey Tan 2003-04-01