David Brumley, Juan Caballero, Zhenkai Liang, James Newsome, Dawn Song
Carnegie Mellon University
{dbrumley,jcaballero,zliang,jnewsome,dawnsong}@cmu.edu
We propose a novel approach for automatically detecting deviations in the way different implementations of the same specification check and process their input. Our approach has several advantages: (1) by automatically building symbolic formulas from the implementation, our approach is precisely faithful to the implementation; (2) by solving formulas created from two different implementations of the same specification, our approach significantly reduces the number of inputs needed to find deviations; (3) our approach works on binaries directly, without access to the source code.
We have built a prototype implementation of our approach and have evaluated it using multiple implementations of two different protocols: HTTP and NTP. Our results show that our approach successfully finds deviations between different implementations, including errors in input checking, and differences in the interpretation of the specification, which can be used as fingerprints.
Finding these deviations in implementations is important for several applications. In particular, in this paper we show 1) how we can automatically discover these deviations, and 2) how we can apply the discovered deviations to two particular applications: error detection and fingerprint generation .
First, finding a deviation between two different implementations of the same specification may indicate that at least one of the two implementations has an error, which we call error detection. Finding such errors is important to guarantee that the protocol is correctly implemented, to ensure proper interoperability with other implementations, and to enhance system security since errors often represent vulnerabilities that can be exploited. Enabling error detection by automatically finding deviations between two different implementations is particularly attractive because it does not require a manually written model of the protocol specification. These models are usually complex, tedious, and error-prone to generate. Note that such deviations do not necessarily flag an error in one of the two implementations, since deviations can also be caused by ambiguity in the specification or when some parts are not fully specified. However, automatic discovery of such deviations is a good way to provide candidate implementation errors.
Second, such deviations naturally give rise to fingerprints, which are inputs that, when given to two different implementations, will result in semantically different output states. Fingerprints can be used to distinguish between the different implementations and we call the discovery of such inputs fingerprint generation . Fingerprinting has been in use for more than a decade [25] and is an important tool in network security for remotely identifying which implementation of an application or operating system a remote host is running. Fingerprinting tools [8,11,15] need fingerprints to operate and constantly require new fingerprints as new implementations, or new versions of existing implementations, become available. Thus, the process of automatically finding these fingerprints, i.e., the fingerprint generation, is crucial for these tools.
Automatic deviation discovery is a challenging task-- deviations usually happen in corner cases, and discovering deviations is often like finding needles in a haystack. Previous work in related areas is largely insufficient. For example, the most commonly used technique is random or semi-random generation of inputs [20,43] (also called fuzz testing). In this line of approach, random inputs are generated and sent to different implementations to observe if they trigger a difference in outputs. The obvious drawback of this approach is that it may take many such random inputs before finding a deviation.
In this paper, we propose a novel approach to automatically discover deviations in input checking and processing between different implementations of the same protocol specification. We are given two programs and implementing the same protocol. At a high level, we build two formulas, and , which capture how each program processes a single input. Then, we check whether the formula is satisfiable, using a solver such as a decision procedure. If the formula is satisfiable, it means that we can find an input, which will satisfy but not or vice versa, in which case it may lead the two program executions to semantically different output states. Such inputs are good candidates to trigger a deviation. We then send such candidate inputs to the two programs and monitor their output states. If the two programs end up in two semantically different output states, then we have successfully found a deviation between the two implementations, and the corresponding input that triggers the deviation.
We have built a prototype implementation of our approach. It handles both Windows and Linux binaries running on an x86 platform. We have evaluated our approach using multiple implementations of two different protocols: HTTP and NTP. Our approach has successfully identified deviations between servers and automatically generated inputs that triggered different server behaviors. These deviations include errors and differences in the interpretation of the protocol specification. The evaluation shows that our approach is accurate: in one case, the relevant part of the generated input is only three bits. Our approach is also efficient: we found deviations using a single request in about one minute.
The remainder of the paper is organized as follows. Section 2 introduces the problem and presents an overview of our approach. In Section 3 we present the different phases and elements that comprise our approach and in Section 4 we describe the details of our implementation. Then, in Section 5 we present the evaluation results of our approach over different protocols. We discuss future enhancements to our approach in Section 6. Finally, we present the related work in Section 7 and conclude in Section 8.
The output states need to be externally observable. We use two methods to observe such states: (a) monitoring the network output of the program, and (b) supervising its environment, which allows us to detect unexpected states such as program halt, reboot, crash, or resource starvation. However, we cannot simply compare the complete output from both implementations, since the output may be different but semantically equivalent. For example, many protocols contain sequence numbers, and we would expect the output from two different implementations to contain two different sequence numbers. However, the output messages may still be semantically equivalent.
Therefore, we may use some domain knowledge about the specific
protocol being analyzed to determine when two output states are semantically
different. For example, many protocols such as HTTP,
include a status code in the response to provide feedback about the
status of the request. We use this
information to determine if two output states are semantically equivalent or
not. In other cases, we observe the effect of a particular query
in the program, such as program crash or reboot. Clearly these cases are
semantically different from a response being emitted by the program.
Our goal is to find an input such that . Finding such an input through random testing is usually hard.
However, in general it is easy to find an input such that , i.e., most inputs will result in the same protocol output state for different implementations of the same specification. Let be the formula representing the set of inputs such that . When and implement the same protocol differently, there may be some input where will not be the same as :
The intuition behind the above expression is that when , then (because ) while (because ), thus the two implementations reach different output states for the same input . Similarly, indicates when , but . We take the disjunction since we only care whether the implementations differ from each other.
Given the above intuition, the central idea is to create the formula using the technique of weakest precondition [19,26]. Let be a predicate over the state space of a program. The weakest precondition for a program and post-condition is a boolean formula over the input space of the program. In our setting, if , then will terminate in a state satisfying , and if , then will not terminate in a state satisfying (it either ``goes wrong'' or does not terminate). For example, if the post-condition is that outputs a successful HTTP reply, then characterizes all inputs which lead to output a successful HTTP reply. The boolean formula output by the weakest precondition is our formula .
Furthermore, we observe that the above method can still be used even if we do not consider the entire program and only consider a single execution path (we discuss multiple execution paths in Section 6). In that case, the formula represents the subset of protocol inputs that follow one of the execution paths considered and still reach the protocol output state . Thus, , since if an input satisfies then for sure it will make program go to state , but the converse is not necessarily true--an input which makes go to state may not satisfy . In our problem, this means that the difference between and may not necessarily result in a true deviation, as shown in Figure 2. Instead, the difference between and is a good candidate, which we can then test to validate whether it is a true deviation.
Next, in the deviation detection phase, we use a solver (such as a decision procedure) to find differences in the two formulas and . In particular, we ask the solver if is satisfiable. When satisfiable the solver will return an example satisfying input. We call these inputs the candidate deviation inputs.
Finally, in the validation phase we evaluate the candidate deviation inputs obtained in the formula extraction phase on both implementations and check whether the implementations do in fact reach semantically different output states. This phase is necessary because the symbolic formula might not include all possible execution paths, then an input that satisfies is guaranteed to make reach the same semantically equivalent output state as the original input but an input that does not satisfy may also make reach a semantically equivalent output state. Hence, the generated candidate deviation inputs may actually still cause both implementations to reach semantically equivalent output states.
If the implementations do reach semantically different output states, then we have found a deviation triggered by that input. This deviation is useful for two things: (1) it may represent an implementation error in at least one of the implementations, which can then be checked against the protocol specification to verify whether it is truly an error; (2) it can be used as a fingerprint to distinguish between the two implementations.
In this section, we describe the details of the three phases in our approach, the formula extraction phase, the deviation detection phase, and the validation phase.
The goal of the formula extraction phase is that given an input such that , where is the output state when executing input with the two given programs, we would like to compute two formulas, and , such that,
Unfortunately, calculating the weakest precondition over an entire real-world binary program can easily result in a formula that is too big to solve. First, there may be many program paths which can lead to a particular output state. We show that we can generate interesting deviations even when considering a single program path. Second, we observe that in many cases only a small subset of instructions operate on data derived from the original input. There is no need to model the instructions that do not operate on data derived from the original input, since the result they compute will be the same as in the original execution. Therefore we eliminate these instructions from the WP calculation, and replace them with only a series of assignments of concrete values to the relevant program state just before an instruction operates on data derived from the input.
Hence, in our design, we build the symbolic formula in two distinct steps. We first execute the program on the original input, while recording a trace of the execution. We then use this execution trace to build the symbolic formula.
In order to generate the symbolic formula, we perform the following steps:
The output of this phase is the symbolic formula . Below we describe these steps in more detail.
The advantage of our IR is that it allows us to perform subsequent steps over the simpler IR statements, instead of the hundreds of x86 instructions. The translation from an x86 instruction to our IR is designed to correctly model the semantics of the original x86 instruction, including making otherwise implicit side effects explicit. For example, we insert code to correctly model instructions that set the eflags register, single instruction loops (e.g., rep instructions), and instructions that behave differently depending on the operands (e.g., shifts).
Our IR is shown in Table 1. We translate x86 instruction into this IR. Our IR has assignments (), binary and unary operations ( and where and are binary and unary operators), loading a value from memory into a register ( ), storing a value ( ), direct jumps (jmp ) to a known target label (label ), indirect jumps to a computed value stored in a register (ijmp ), and conditional jumps (if then jmp else jmp ).
The output of this step is an IR program consisting of a sequence
of IR statements.
The post-condition specifies the desired protocol output state, such as what kind of response to a request message is desired. In our current setting, an ideal post-condition would specify that ``The input results in an execution that results in an output state that is semantically equivalent to the output state reached when processing the original input.'' That is, we want our formula to be true for exactly the inputs that are considered ``semantically equivalent'' to the original input by the modeled program binary.
In our approach, the post-condition specified the output state should be the same as in the trace. In order to make the overall formula size reasonable, we add additional constraints to the post-condition which constraint the formula to the same program path taken as in the trace. We do this by iterating over all conditional jumps and indirect jumps in the IR, and for each jump, add a clause to the post-condition that ensures that the final formula only considers inputs that also result in the same destination for the given jump. For example, if in the trace if then else was evaluated and the next instruction executed was , then must have evaluated to false, and we add a clause restricting to the post-condition.
In some programs, there may be multiple paths that reach the same
output state. Our techniques can be generalized to handle this case,
as discussed in Section 6. In practice, we have found
this post-condition to be sufficient for finding interesting
deviations. Typically, inputs that cause the same execution path to
be followed are treated equivalently by the program, and result in
equivalent output states. Conversely, inputs that follow a different
execution path often result in a semantically different output state
of the program. Although more complicated and general post-conditions
are possible, one interesting result from our experiments is that the
simple approach was all that was needed to generate interesting deviations.
We describe the steps for computing the weakest precondition below.
Step 4a: Translating into single assignment form. We translate the IR program from the previous step into a form in which every variable is assigned at most once. (The transformed program is semantically equivalent to the input IR.) We perform this step to enable additional optimizations described in [29,36,19], which further reduce the formula size. For example, this transformation will rewrite the program x := x+1; x := x+1; as x1 := x0+1; x2 := x1+1;. We carry out this transformation by maintaining a mapping from the variable name to its current incarnation, e.g., the original variable x may have incarnations x0, x1, and x2. We iterate through the program and replace each variable use with its current incarnation. This step is similar to computing the SSA form of a program [39], and is a widely used technique.
Step 4b: Translating to GCL. The translation to GCL takes as input the single assignment form from step 4a, and outputs a semantically equivalent GCL program . We perform this step since the weakest precondition is calculated over the GCL language [26]. The resulting program is semantically equivalent to the input single-assignment IR statements. The weakest precondition is calculated in a syntax-directed manner over .
The GCL language constructs we use are shown in Table 2. Although GCL may look unimpressive, it is sufficiently expressive for reasoning about complex programs [26,29,28,24] 1. Statements in our GCL programs will mirror statements in assembly, e.g., store, load, assign, etc. GCL has assignments of the form where is a register or memory location, and is a (side-effect) free expression. assume assumes a particular (side-effect free) expression is true. An assume statement is used to reason about conditional jump predicates, i.e., we add ``assume '' for the true branch of a conditional jump, and ``assume '' for the false branch of the conditional jump. assert asserts that must be true for execution to continue, else the program fails. In other words, cannot be satisfied if assert is false. skip is a semantic no-op. denotes a sequence where first statement is executed and then statement is executed. is called a choice statement, and indicates that either or may be executed. Choice statements are used for if-then-else constructs.
For example, the IR:
will be translated as:
The above allows calculating the WP over multiple paths (we discuss multiple paths in Section 6). In our setting, we only consider a single path. For each branch condition evaluated in the trace, we could add the GCL statement assert if evaluated to true (else assert if evaluated to false). In our implementation, using assert in this manner is equivalent to adding a clause for each branch predicate to the post-condition (e.g., making the post-condition when evaluated to true in the trace).
Step 4c: Computing the weakest precondition. We compute the weakest precondition for from the previous step in a syntax-directed manner. The rules for computing the weakest precondition are shown in Table 2. Most rules are straightforward, e.g., to calculate the weakest precondition , we calculate . Similarly . For assignments , we generate a let expression which binds the variable name to the expression . We also take advantage of a technical transformation, which can further reduce the size of the formula by using the single assignment form from Step 4a [29,36,19].
If the instruction accesses memory using an address that is derived from the input, then in the formula the address will be symbolic, and we must choose what set of possible addresses to consider. In order to remain sound, we add a clause to our post-condition to only consider executions that would calculate an address within the selected set. Considering more possible addresses increases the generality of our approach, at the cost of more analysis.
We achieve good results considering only the address that was actually used in the logged execution trace and adding the corresponding constraints to the post-condition to preserve soundness. In practice, if useful deviations are not found from the corresponding formula, we could consider a larger range of addresses, achieving a more descriptive formula at the cost of performance. We have implemented an analysis that bounds the range of symbolic memory addresses [2], but have found we get good results without preforming this additional step.
As with memory reads, we achieve good results only considering the address that was actually used in the logged execution trace. Again, we could generalize the formula to consider more values, by selecting a range of addresses to consider.
We then query the solver whether the combined formula
is satisfiable,
and if so, to provide an example that satisfies the combined formula.
If the solver returns an example, then we have
found an input that satisfies one program's
formula, but not the other. If we had perfectly and fully modeled each
program, and perfectly specified the post-condition to be that ``the
input results in a semantically equivalent output state'', then this
input would be guaranteed to produce a semantically equivalent output
state in one program, but not the other. Since we only consider one
program path and do not perfectly specify the post-condition in this
way, this input is only a candidate deviation input.
Finally, we validate the generated candidate deviation inputs to determine whether they actually result in semantically different output states in the two implementations. As illustrated in Figure 2, it is possible that while an input does not satisfy the symbolic formula generated for a server, it actually does result in an identical or semantically equivalent output state.
We send each candidate deviation input to the implementations being examined, and compare their outputs to determine whether they result in semantically equivalent or semantically different output states.
In theory, this testing requires some domain knowledge about the protocol implemented by the binaries, to determine whether their outputs are semantically equivalent. In practice, we have found deviations that are quite obvious. Typically, the server whose symbolic formula is satisfied by the input produces a response similar to its response to the original input, and the server whose symbolic formula is not satisfied by the input produces an error message, drops the connection, etc.
After sending candidate inputs to both implementations, we determine the output state by looking at the response sent from the server. For those protocols that contain some type of status code in the response, such as HTTP in the Status-Line, each different value of the status code represents a different output state for the server. For those protocols that do not contain a status code in the response, such as NTP, we define a generic valid state and consider the server to have reached that state, as a consequence of an input, if it sends any well-formed response to the input, independently of the values of the fields in the response.
In addition, we define three special output states: a fatal state that includes any behavior that is likely to cause the server to stop processing future queries such as a crash, reboot, halt or resource starvation, a no-response state that indicates that the server is not in the fatal state but still did not respond before a configurable timer expired, and a malformed state that includes any response from the server that is missing mandatory fields. This last state is needed because servers might send messages back to the client that do not follow the guidelines in the corresponding specification. For example several HTTP servers, such as Apache or Savant, might respond to an incorrect request with a raw message written into the socket, such as the string ``IOError'' without including the expected HTTP Status-Line such as ``HTTP/1.1 400 Bad Request''.
The original inputs, which we need to send to the servers during the formula extraction phase to generate the execution traces, were obtained by capturing a network trace from one of our workstations and selecting all the HTTP and NTP requests that it contained. For each HTTP request in the trace, we send it to each of the HTTP servers and monitor its execution, generating an execution trace as output. We proceed similarly for each NTP request, obtaining an execution trace for each request/server pair. In Section 5.1, we show the deviations we discovered in the web servers, and in Section 5.2, the deviations we discovered in the NTP servers.
Table 4 summarizes our results when sending the HTTP GET request in Figure 3 to the three servers. Each cell of the table represents a different query to the solver, that is, half of the combined formula for each server pair. Thus, the table has six possible cells. For example, the combined formula for the Apache-Miniweb pair, is shown as the disjunction of Cases 1 and 3.
Out of the six possible cases, the solver returned unsatisfiable for three of them (Cases 1, 5, and 6). For the remaining cases, where the solver was able to generate at least one candidate deviation input, we show two numbers in the format X/Y. The X value represents the number of different candidate deviation inputs we obtained from the solver, and the Y value represents the number of these candidate deviation inputs that actually generated semantically different output states when sent to the servers in the validation phase. Thus, the Y value represents the number of inputs that triggered a deviation.
In Case 2, none of the five candidate deviation inputs returned by the solver were able to generate semantically different output states when sent to the servers, that is, no deviations were found. For Cases 3 and 4, all candidate deviation inputs triggered a deviation when sent to the servers during the validation phase. In both cases, the Miniweb server accepted some input that was rejected by the other server. We analyze these cases in more detail next.
|
|
|
|
Figure 5 shows one of the deviations found for the Savant-Miniweb pair. It presents one of the candidate deviation inputs obtained from the solver in Case 4, including the responses received from both Savant and Miniweb when the candidate deviation input was sent to them during the validation phase. Again, the candidate deviation input has a different value on the fifth byte, but in this case the response from Savant is only a raw ``File not found'' string. Note that this string does not include the HTTP Status-Line, the first line in the response that includes the response code, as required by the HTTP specification and can be considered malformed [27]. Thus, this deviation identifies an error though in this case both servers (i.e. Miniweb and Savant) are deviating from the HTTP specification.
Figure 6 shows another deviation found in Case 4 for the Savant-Miniweb pair. The HTTP specification mandates that the first line of an HTTP request must include a protocol version string. There are 3 possible valid values for this version string: ``HTTP/1.1'', ``HTTP/1.0'', and ``HTTP/0.9'', corresponding to different versions of the HTTP protocol. However, we see that the candidate deviation input produced by the solver uses instead a different version string, "HTTP/ b.1". Since Miniweb accepts this answer, it indicates that Miniweb is not properly verifying the values received on this field. On the other hand, Savant is sending an error to the client indicating an invalid HTTP version, which indicates that it is properly checking the value it received in the version field. This deviation shows another error in Miniweb's implementation.
To summarize, in this section we have shown that our approach is able to discover multiple inputs that trigger deviations between real protocol implementations. We have presented detailed analysis of three of them, and confirmed the deviations they trigger as errors. Out of the three inputs analyzed in detail, two of them can be attributed to be Miniweb's implementation errors, while the other one was an implementation error by both Miniweb and Savant. The discovered inputs that trigger deviations can potentially be used as fingerprints to differentiate among these implementations.
The important difference is on byte 0, which is presented in detail on the right hand side of Figure 7. Byte 0 contains three fields: ``Leap Indicator'' (LI), ``Version'' (VN) and ``Mode'' (MD) fields. The difference with the original request is in the Version field. The candidate deviation input has a decimal value of 0 for this field (note that the field length is 3 bits), instead of the original decimal value of 4. When this candidate deviation input was sent to both servers, Ntpd ignored it, choosing not to respond, while NetTime responded with a version number with value 0. Thus, this candidate deviation input leads the two servers into semantically different output states.
We check the specification for this case to find out that
a zero value for the Version field is reserved, and according to
the latest specification should no longer be supported by
current and future NTP/SNTP servers [38].
However, the previous specification
states that the server should copy
the version number received from the client in the request,
into the response, without
dictating any special handling for the zero value.
Since both implementations seem to be following different versions of the
specification, we cannot definitely assign this error to one of the
specifications.
Instead, this example shows that we can identify inconsistencies or ambiguity
in protocol specifications.
In addition, we can use this query as a
fingerprint to differentiate between the two implementations.
In Table 6, we show the time used by the solver in the deviation detection phase to produce a candidate deviation input from the combined symbolic formula. The results show that our approach is very efficient in discovering deviations. In many cases, we can discover deviation inputs between two implementations in approximately one minute. Fuzz testing approaches are likely to take much longer, since they usually need to test many more examples.
Although these techniques are useful, our approach is quite different. Instead of comparing an implementation to a manually defined model, we compare implementations against each other. Another significant difference is that our approach works directly on binaries, and does not require access to the source code.
Compared to fuzz testing, our approach is more efficient for discovering deviations since it requires testing far fewer inputs. It can detect deviations by comparing how two implementations process the same input, even if this input leads both implementation to semantically equivalent states. In contrast, fuzz testing techniques need observable differences between implementations to detect a deviation.
There is a line of research using model checking to find errors in protocol implementations. Musuvathi et.al. [41,40] use a model checker that operates directly on C and C++ code and use it to check for errors in TCP/IP and AODV implementations. Chaki et al. [22] build models from implementations and checks it against a specification model. Compared to our approach, these approaches need reference models to detect errors.
In this paper, we have presented a novel approach to automatically detect deviations in the way different implementations of the same specification check and process their input. Our approach has several advantages: (1) by automatically building the symbolic formulas from the implementation, our approach is precisely truthful to the implementation; (2) automatically identifying the deviation by solving formulas generated from the two implementations enables us to find the needle in the haystack without having to try each straw (input) individually, thus a tremendous performance gain; (3) our approach works on binaries directly, i.e., without access to source code. We then show how to apply our automatic deviation techniques for automatic error detection and automatic fingerprint generation.
We have presented our prototype system to evaluate our techniques, and have used it to automatically discover deviations in multiple implementations of two different protocols: HTTP and NTP. Our results show that our approach successfully finds deviations between different implementations, including errors in input checking, and differences in the interpretation of the specification, which can be used as fingerprints.
This material is based upon work partially supported by the National Science Foundation under Grants No. 0311808, No. 0433540, No. 0448452, No. 0627511, and CCF-0424422. Partial support was also provided by the International Technology Alliance, and by the U.S. Army Research Office under the Cyber-TA Research Grant No. W911NF-06-1-0316, and under grant DAAD19-02-1-0389 through CyLab at Carnegie Mellon.
The views and conclusions contained here are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ARO, NSF, or the U.S. Government or any of its agencies.
~
vganesh/stp.html.