Conceptual framework

Next: Bundling algorithm Up: Bundling: a hybrid approach Previous: Dividing a collection of

Conceptual framework

We chose to view the collection of files as a fully connected weighted graph. Each file is represented by a node in the graph. The weight of an edge from file A to file B represents the desirability of placing A and B in the same bundle.

Given a set of profiles, there are many ways to determine the edge weights. We chose to use frequency correlation as the basis for the edge weights. The frequency correlation of two files A and B is defined as t/n, where t is the number of profiles in which both A and B are loaded, and n is the number of profiles in which either A or B is loaded. A frequency correlation of 1.0 indicates that A and B are always loaded together.

The bundling algorithm considers edges according to an order produced by the edge comparator. We chose to use weight as the primary criterion for the edge comparator, and average distance as the secondary criterion. (The average distance of an edge connecting files A and B is the average distance between A and B in the input profiles.) This sort order helps ensure that edges connecting files usually loaded close to each other are considered before edges connecting files usually loaded far apart.

While frequency correlation is a good measure of how often two files are loaded together, it is not a measure of whether those files are generally loaded near each other in the profiles. If we put two files which are generally loaded far apart from each other in the same bundle, if a request is made for one of the files the other will be loaded too early. To ensure that all of the files in a bundle are loaded near each other, we limit the maximum bundle spread. The bundle spread for a proposed bundle b is the maximum over all profiles p of

lastMoment(b, p) - firstMoment(b, p) - size(b) + 1

The `moment' of a file in a profile is the ordinal value indicating when it was loaded relative to the other files in the profile. (I.e., the first file in the profile is moment 0, the second is moment 1, etc.) The best possible spread for a bundle is 0, indicating that for any profile in the input set, after any file in the bundle is requested all of the files in the bundle will be used before any file not in the bundle.

In addition to determining which files to bundle together, the bundling algorithm must also determine the order of the files within each bundle. This order is determined by a bundle sort comparator. We chose to use average position as the criterion for the bundle sort comparator. Given a bundle b and a profile p, the position of each file is found in relation to the other files in b. For example, if A is loaded in p before any other files in b, then its position is 0. The average position of file A in a bundle b is simply its average position over all input profiles. Using average position as the criterion for the bundle sort comparator helps ensure that files are placed in the order in which they will be needed, based on the order in which they occurred in the input profiles.

Next: Bundling algorithm Up: Bundling: a hybrid approach Previous: Dividing a collection of

David Hovemeyer
Tue Feb 27 18:23:07 EST 2001