Disassembly Challenges

Next: Disassembly Engine Implementation Up: Binary Disassembly Previous: Binary Disassembly

Disassembly Challenges

To accurately locate the procedure boundary, one needs to identify each instruction in the binary through a disassembler. There are two main classes of disassembly algorithms [22]. A linear sweep algorithm starts with the first byte in the code section and proceeds by decoding each byte, including any intermediate data byte, as code, until an illegal instruction is encountered. A recursive traversal algorithm starts at the program's main entry point and proceeds by following each branch instruction encountered in a depth-first or breadth-first manner, essentially a control flow analysis. Neither approach is 100% precise. The chief impediments to accurate disassembly are:

Data embedded in the code regions,
Variable instruction size,
Indirect branch instructions,
Functions without explicit CALL sites within the executable's code segment,
Position independent code (PIC) sequences, and
Hand crafted assembly code.

1) and 2) render the linear sweep algorithm less effective than ideal, whereas 3), 4) and 5) degrade the efficacy of the control flow analysis used in the recursive traversal algorithm. Distinguishing code from data in a binary file is a fundamentally undecidable problem. Because the linear sweep algorithm decodes each byte as code as long as it looks like a legitimate code byte, it ends up interpreting many data bytes as instructions. The reason for this behavior is that, in the Intel x86 instruction set, 248 out of 256 possibilities can be a legitimate starting byte for an instruction, making it more likely to mistake data for instruction. The fact that the Intel x86 instruction set allows variable instruction size further aggravates the problem of code/data distinction. Consider the following example sequence of bytes:


    0x0F 0x85 0xC0 0x0F  0x85 .....

If we consider 0x0F as a code byte then we'll end up with the following disassembly:


    jne offset

On the other hand if we consider 0x0F as a data byte and 0x85 as a code byte, then we get something like:


    0x0F    // data
    test eax, eax
    jne offset

Thus a single disassembly error could result in many subsequent bytes being interpreted incorrectly, with the extent of error potentially unbounded. In contrast, a fixed-instruction-size architecture exhibits a self-correcting property: an interpretation error for one instruction word does not propagate to the next instruction word. The recursive traversal algorithms cannot obtain 100% accurate disassembly results, either, because it is difficult to construct a complete control flow of an input binary in the presence of indirect branch instructions such as call/jmp reg32 (e.g. call eax) or call/jmp m32 (e.g. jmp dword[esp + xx]). One solution to this problem is to perform additional data flow analysis such as inter-procedural slicing and/or constant propagation [23] to figure out at compile time the value of the register or memory location used in indirect control transfer instructions. Apart from being difficult to implement, such an approach tends to greatly increase the disassembly time and itself does not guarantee 100% accuracy. Procedures for which no explicit call sites in the input program can be identified include exception or signal handlers, callback functions, which is rife in GUI applications, and procedures all calls to which are through indirect branch instructions. Because there is no identifiable call to these functions, they cannot be discovered through control flow analysis, and as a result may be misclassified as data. In practice, signal/exception handlers pose few problems because their entry points are included in the program header in some cases. The addressing in position independent code (PIC) does not rely on any particular position in the program's address space. Thus PIC code and jump table never have absolute address references. Instead the references are in the form of offsets with respect to a base value that is known at run time, mostly through the eip value. For example,


    10: call 10
    15:                    
          :
    25: pop eax   // gets the return addr 
                  // value 15 into eax
    26: call dword[eax + 20]  // call foo
          :
          :
    35: // foo

In this case, in spite of having explicit CALL sites, standard control flow analysis cannot discover the target location of the function foo(). Hand crafted assembly code makes it difficult to identify procedure boundaries because they do not necessarily follow the code conventions established by standard compilers. These conventions provide useful hints to resolve potential ambiguities. As an example of code convention violation, some assembly code programs jump from one function into another function without going through the latter's main entry point.

Next: Disassembly Engine Implementation Up: Binary Disassembly Previous: Binary Disassembly

Manish Prasad
2003-04-05