In the initial port, we defined a ``bare bones'' grammar that described a straightforward translation of the 124 LIR operators into MIR operators using 142 rules and 7 non-terminals. As the performance work progressed, the grammar tripled in size. The current grammar includes additional patterns for optimizing floating point computations and conditional branching, exploiting memory operands, and other miscellaneous enhancements such as recognizing complex addressing modes, exploiting special instructions such as LEA, TEST, INC, etc., and avoiding needless sign extension of byte/short loads. Table 1 reports the contribution of each group of enhancements to the size of the rules.
In addition to extending the grammar as described above, we also enhanced our BURS driver with heuristics to reduce register pressure. We label each tree node with an estimate of the number of live values required to compute it, using the algorithm from section 9.10 of the Dragon book [1]. The BURS driver uses this numbering to choose which child node to emit first when generating code for a given tree and to select from the set of ready trees12which tree to emit next. In both cases, choosing the node/tree with the largest number of registers first tends to reduce the number of simultaneously live values and thus reduce register pressure. As reported in section 4.3, this heuristic made a small but measurable difference in overall performance and was extremely simple to implement.
While the PowerPC architecture supports three-operand ALU operations (a=b+c), IA32 supports two-operand ALU operations(x+=y). The IA32 compiler converts the three-operand LIR to two-operand form in a pre-pass to BURS, which tripped some subtle issues. Initially, this pre-pass took obvious actions to avoid introducing useless move instructions. For example, it would transform a=a+b to a+=b. In other cases, it would use local liveness information to avoid inserting moves. For example, if b is dead after a=b+c and this is the only definition of a, then the pass would emit b+=c and the replace uses of a with uses of b. This second optimization tripped a subtle difficulty; it can hinder instruction selection of other expression trees within the basic block by introducing additional inter-tree anti and output dependencies and by extending the live range of b beyond the current basic block. Thus, in some cases it is actually better to translate a=b+c into b+=c; a=b and rely on the register allocator to coalesce away the move instruction.
Table 2 shows the performance improvements obtained by adding all of the enhancements to the basic grammar. The most important enhancement was adding rules to exploit the floating point stack (mpegaudio, mtrt), but the other enhancements also resulted in modest gains. The impact of instruction selection and register allocation on floating point performance is explored in more detail below.