Although we anticipate most re-executions will be successful (execute completely and pass the validation), we have to prepare for the possibility that they may fail (cannot execute completely or fail the validation). Therefore, we have to ensure that re-executions are abortable transactions such that failed re-executions will have no lasting effect. We implement this as follows.
Upon receiving a UserOpPropagate RPC from the client, Venus on the surrogate will temporarily put the affected volume 2 into write-disconnected mode, and then re-executes the user operation via a spawned child called the re-executor. During the re-execution, since the volume is write-disconnected, input files can be retrieved from the servers, but file-system updates are not written immediately to the server. These updates are tagged with the identity of the re-execution and are logged in the CML. If the re-execution later passes the validation, the surrogate will re-connect the volume with the servers and reintegrate the updates. At the end, and only when the reintegration succeeds, the surrogate will locally commit the updates, and indicate a success to the client in a RPC reply. On the other hand, failures may happen any time during the re-execution, the validation, or the reintegration. If any such failures occur, the surrogate will discard all the updates and indicate a failure to the client in a RPC reply.
It is possible that a reintegration completes successfully at the servers, but the RPC response fails to arrive at the client in spite of retransmission. This can happen when there is an untimely failure of the surrogate or the communication channels. We make use of the existing Coda mechanism of ensuring atomicity of reintegrations [13] to handle this kind of failures. In short, the client presumes an reintegration has failed if it does not receive a positive response from the surrogate, and it will retry the reintegration. At the same time, the server retains the information necessary to detect whether a record has already been reintegrated earlier. If a client attempts such an improper retry, the server will reply with the error code EALREADY (Operation already in progress), and the client will then know that the records have already been successfully reintegrated in a previous attempt, and it will simply locally commit those records.
A re-execution of a user operation is accepted when it produces a result that is identical to that of the original execution. We say the re-execution is repeating the original execution. Only these repeating re-executions are useful to operation shipping. We know that a deterministic procedure will produce the same result in different runs provided that it has the same input and the same environment in each run. Our file system makes use of this principle to facilitate repeating re-executions.
First, the re-executor runs with the four Unix-process attributes identical to that of the original execution: (1) the working directory, (2) the environment-variable list, (3) the command-line arguments, and (4) the file-creation mask [23].
Second, our file system expects that the surrogate machine has software and hardware configurations similar to the client machine. Two machines are said to be identically configured if they have the same CPU type, the same operating system, the same system-header files, the same system libraries, and any other system environments that can affect the outcomes of computations on the two machines.
Third, if a re-execution requires an input file stored in Coda, we can rely on the file system to ensure that the client and the surrogate will use an identical version of the input file. Coda can ensure this because a client ships updates to its servers in temporal order, and the surrogate will always retrieve the latest version of a file from the servers. For example, consider a user issuing three user operations successively on a client machine: (Op1) update a source file using an editor, (Op2) compile the source file into an object file using a compiler, and (Op3) update the source file again. When the surrogate re-executes Op2, the client must have shipped Op1 but not Op3, and the re-executor will see exactly the version updated by Op1.
We emphasize that our file system does not guarantee that all re-executions will be repeating their original executions; it just increases the probability of that happening. More importantly, it ensures that only repeating re-executions will be accepted. This is achieved by the procedure of validation (Section 3.5).
In the evaluation section, we will see that many applications exhibit repeating re-executions. Although some other applications exhibit non-repeating side effects during re-executions, these side effects can be handled in simple ways. Therefore, we believe we can use operation shipping with a large number of applications.
In addition to updating contents of files, user operations also update some meta-data (status information). Some of the meta-data are internal to the file system and are invisible to the users (e.g., every STORE operation has an identity number for concurrency control); some are external and are visible to the users (e.g., the last-modification time of a file). In both cases, to make a re-execution's result identical to that of the original execution, the values of the meta-data of the re-execution should be reset to those of the original execution. Therefore, the client needs to pack these meta-data as part of the operation log, and the surrogate needs to adjust the meta-data of the re-execution to match the supplied values.
We focus on applications that perform deterministic tasks, such as compiling binaries or formatting texts, and exclude applications such as games and probabilistic search that are randomized in nature. In an early stage of this project, we expected the re-executions of these applications to repeat their original executions completely. However, we found that some common applications exhibited non-repeating side effects. So far we have found two types of such side effects: (1) time stamps in output files, and (2) temporary names of intermediate files. Fortunately, we are able to handle these side effects automatically, so we are still able to use operation shipping with these applications. We will discuss the handling methods in the two following subsections.
We also anticipate a third possible kind of side effect: external side effects. For example, if an application sends an email message as the last step of execution, then the user may be confused by the additional message sent by re-execution. To cope with this kind of side effect, we plan to allow users to disable the use of operation shipping for some applications.
Some applications, such as rp2gen, ar, and latex, put time stamps into the files that they produce. rp2gen generates stubs routines for remote procedure calls, ar builds library files, and latex formats documents. They use time stamps for various reasons. Because of the time stamps, a file generated by a re-execution will differ slightly from the version generated by the original execution. Observing that only a few bytes are different, we can treat the changed bytes as ``errors'' and use the technique of forward error correction (FEC) [6,4] to ``correct'' them. (We are indebted to Matt Mathis for suggesting this idea to us.)
Our file system, therefore, does the following. Venus on the remote client computes an error-correction code (we use the Reed-Solomon code) for each updated file that is to be shipped by operation, and it packs the code as a part of the operation log. Venus on the surrogate, after re-executing the operation, uses the code to correct the time stamps that may have occurred in the re-generated version of the file.
Note that this is a novel use of the Reed-Solomon code. Usually, data blocks are sent together with parity blocks (the error-correction code); but our clients send only the parity blocks. The data blocks are instead re-generated by the surrogate. Whereas others use the code to correct communication errors, we use it to correct some minor re-execution discrepancies. If a discrepancy is so large that our error-correction procedure cannot correct it, our file system simply falls back to value shipping. This entails a loss of performance but preserves correctness.
The additional network traffic due to the error correction code is quite small. We choose to use a (65535,65503) Reed-Solomon block code over GF(216). In other words, the symbol size is 16 bits, each block has 65,503 data symbols (131,006 bytes) and 32 parity symbols (64 bytes). The system can correct up to 16 errors (32 bytes) in each data block.
However, the additional CPU time due to encoding and decoding is not small. We discuss this in more detail in Section 4.4. Also, the Reed-Solomon code cannot correct discrepancies that change length (for example, the two timestamps ``9:17'' and ``14:49'' have different lengths). The rsync algorithm [24] can handle length change, but we favor the Reed-Solomon code because it has a smaller overhead on network traffic.
The program ar is an example of an application that uses temporary files. Figure 2 shows the two CMLs on the client and the surrogate after the execution and re-execution of the following user operations:
ar rv libsth.a foo.o bar.o.The user operation builds a library file libsth.a from two object modules foo.o and bar.o.
Note that ar used two temporary files sta09395 and sta16294 in the two executions. The names of the temporary files are generated based on the identity numbers of processes executing the application, and hence they are time dependent. Our validation procedure (Section 3.5) might naively reject the re-execution ``because the records are different.''
Temporary files appear only in the intermediate steps of the execution. They will either be deleted or renamed at the end, so their names do not affect the final file-system state. An application uses temporary files to provide execution atomicity. For example, ar writes intermediate computation results to a temporary file, and it will rename the file to the target filename only if the execution succeeds. This measure is to ensure that the target file will not be destroyed accidentally by a futile execution.
If a temporary file is created and subsequently deleted during the execution of a user operation, its modification records will be deleted by the existing identity cancellation procedure [7]. They will not appear in the two CMLs and will not cause naive rejections of re-execution.
However, if a temporary file is created and subsequently renamed during the execution of a user operation, its modifications records will be present in the CMLs, and might cause our validation to reject the re-execution. Our file system uses a procedure of temporary-file renaming to compensate for the side effect. This procedure is done after the re-executor has finished the re-execution and before the surrogate begins the validation.
The idea of the temporary-file renaming is to scan the two CMLs and identify all the temporary files as well as their final names. We identify temporary files by the fact that they are created and subsequently renamed in an user operation. For each temporary file used by the surrogate, our file system determines the temporary file name N used by the client in the original execution. It thus renames the temporary file to N. In our ar example, the temporary file sta16294 will be renamed to sta09395.