We now digress to a simple, yet flawed algorithm, which one might propose as an alternative to the in-place rsync algorithm. The following algorithm is a trivial modification to rsync. Its flaw results in significant compression loss. This example motivates the need for the complexity (buffering and dependency detection) of our solution. The simple algorithm runs rsync and employs the following heuristic: if any COPY command refers to a block location which precedes the current write offset, then that block has been overwritten with new data and can no longer be used for COPY commands. The lost data is resent over the network link as an ADD command. The advantage of this algorithm is the similarity to the original algorithm and its ability to overlap I/O with the computation of checksums.
This simple algorithm is sensitive to changes which insert data into the source file. Such changes cause the simple algorithm to generate many ADD commands, resulting in large compression losses. Consider synchronizing a pair of files that are identical except that the source file has some data (as few as 1 or 2 bytes) inserted at the beginning. These few bytes prevent any data from being copied.
Reordering commands is necessary in spite of the rarity of cycles in dependency graphs. To evaluate the value of ordering, we look at compression in the naive algorithm. Run on our data set, the simple algorithm uses ADD commands to replace 52% of the COPY commands. Each ADD sends extra bytes over the network link. This naive algorithm cuts compression in half when compared with traditional rsync. Our final algorithm's complexity proves necessary to avoid a drastic loss in compression.