Check out the new USENIX Web site. next up previous
Next: Cycle Breaking Up: Design Previous: A Simple Example

Encoding Dependencies

Our final implementation of ip-rsync uses a graph-based algorithm that detects and resolves dependencies, preventing corruption of the target file. This comes at the cost of a delay to generate and process the conflicts, losing some of the benefit of overlapping network transfer with checksum generation.

In ip-rsync, the generator (steps 1 and 2) mirrors the corresponding actions of traditional rsync. The process which builds the hash table in step 3 also remains unmodified. However, during the scan in step 4, ip-rsync's sender buffers ADD and COPY commands in memory, instead of sending them immediately. Buffering all commands allows ip-rsync to detect dependencies and reorder commands prior to sending data.

Buffering COPY and ADD commands consumes much less space than traditional rsync requires. For a COPY command, ip-rsync needs space to store the source block and the target offset only. ADD commands require an extra field for the length of the data. The algorithm does not store the raw data for ADD commands in memory. Rather, ip-rsync stores the meta-data in memory and reads data directly from the source file when sending an ADD command.

The algorithm constructs a directed graph in which edges represent ordering constraints among commands (Figure 3). A topological sort of the graph determines an execution order for processing COPY commands on the target. When a total topological ordering proves impossible because of cycles in the graph, ip-rsync converts nodes that copy data in the file to commands that add the data explicitly. This results in lost compression. We later describe several heuristics for breaking cycles.

Figure 4: Pseudo-code for depth-first search. Initially nodes are in the UNVISITED state.
\begin{figure}\begin{center}
\par\vspace{-5pt}
\par\begin{pseudocode}{DFS}{grap...
...
\ENDMAIN\\
\end{pseudocode}\end{center}\par\vspace{-35pt}
\par\par\end{figure}

Figure 5: Modified depth-first search.
\begin{figure}\vspace{-5pt}
\par\begin{center}
\begin{pseudocode}{modified-DFS}...
...Visit}{node}
\ENDMAIN\\
\end{pseudocode}\end{center}\vspace{-35pt}
\end{figure}

The source puts buffered nodes into a structure that points to the COPY command and contains fields necessary for dependency detection, topological sorting, transmission to the receiver, and deallocation. The fields of the graph node are a pointer to the COPY command that the node represents, a ``visited'' field used to encode states during topological sort, a reference counter for deallocation, a pointer to the next node in topological order (initially NULL), and finally a pointer to the first in a linked list of outgoing edges.

The sender also buffers ADD commands. ADD commands are self-describing and, therefore, require no data from the old version. As a result, ADD commands need no reordering.

After all commands have been buffered, the COPY graph is passed to a dependency-detection function (step 5). This function detects all dependencies with an $O(n\lg n)$ algorithm, where $n$ is equal to the number of bytes in the larger of the source file or the target file [2]. Each generated edge has two fields: the target node, and a pointer to the next outgoing edge from the source. With the full graph constructed, the sender topologically sorts the nodes using a modified depth first search (DFS) algorithm.

The algorithm used to topologically sort the graph in step 6 modifies DFS to make it operate on graphs that contain cycles. We present pseudo-code for depth-first search (Figure 4) and the modified depth-first search used in our algorithm (Figure 5). DFS (for acyclic graphs) is a ``stack'' algorithm, pushing nodes onto a stack as they are visited. In DFS, nodes take on two states - UNVISITED and VISITED. A node remains UNVISITED until the algorithm pops it off the STACK during its traversal. The algorithm marks the completed node VISITED. Modified-DFS for topological sort uses an additional STACK state to record the order in which the algorithm visits each node. The STACK state helps detect and break cycles. Modified-DFS also adds a DELETED state to encode nodes that have been deleted. Just as in DFS, each node begins UNVISITED. Analogous to DFS, the algorithm calls Visit on each node. Visiting a node places it on the stack and marks it in the STACK state. Subsequently, the algorithm Visits any neighbors of the node that are not VISITED or DELETED. When done with the neighbors, the algorithm removes the node from the stack, marks the node VISITED and places it on the front of an output list. If the Visit procedure finds a neighbor already marked STACK, a cycle exists in the graph. The algorithm marks this node DELETED, which breaks the cycle. A compensating ADD command is created in place of the DELETED COPY. The topological sort algorithm is $O(V+E)$ as it examines each vertex at least once and traverses every edge. We will examine alternative metrics and procedures for resolving cycles in the next section.

Ip-rsync sends the COPY commands to the target in topologically sorted order. Upon sending a COPY command, it deallocates the corresponding node along with its remaining edges. After sending all COPY commands, ip-rsync sends ADD commands according to the ADD list, including the ADD commands that correspond to deleted COPY commands (step 7). If necessary, the target truncates the file to the new size and the synchronization is complete.


next up previous
Next: Cycle Breaking Up: Design Previous: A Simple Example
2003-04-08