Performance Optimizations and Analysis

Next: Related Work Up: Experimental Evaluation Previous: Overall Performance

Performance Optimizations and Analysis

The current level of performance achieved by Jupiter requires that a number of optimizations be implemented. In this section, we briefly describe these optimizations and comment on how the flexible structure of Jupiter facilitated their implementation. The optimizations are:

bottombased: the Frame interface was changed to use bottom-based operand stack indexing instead of top-based indexing to speed up stack accesses [16].
threaded: the ``loop-and-switch'' interpreter was replaced by a threaded version to improve performance, as was described in Section 3.5.
fieldsize: the field's size (either 4 or 8 bytes) was cached inside the Field pointer, relieving the interpreter from having to traverse data structures to find this information.
bytecode substitution: the implementations of getfield, putfield and invokevirtual were changed so they replace themselves in the bytecode stream with respective quick versions the first time they execute. These quick versions assume that the field in the opcode has been resolved, and are specialized for the appropriate field size. Furthermore, the opcodes are re-written so that the field offset is stored directly in the bytecode stream, avoiding constant pool access. Subsequent executions of the same bytecode will find the quick instructions, which execute faster.
register: a CPU register was assigned to store the location of the currently executing instruction, eliminating the need to load this value from memory in order to read and execute each instruction.

**Figure 12:** Effect of each optimization on benchmark execution times.
$\includegraphics{opt.eps}$

**Figure 13:** Bytecode execution profile.
$\includegraphics{before.eps}$ (a) Before optimization $\includegraphics{after.eps}$ (b) After optimization

The impact of these optimizations is shown in Figure 12, which charts the execution-time reduction due to each optimization. The optimizations vary in the degree to which they benefit the performance of each application. For instance, the fieldsize and substitution optimizations, which targeted the getfield and putfield opcodes, were especially successful on compress, which spends a large portion of its time executing these two instructions. Also, threaded reduces the overhead of each opcode, producing a larger impact on applications that execute many ``lightweight'' opcodes, such as mpegaudio, compared to applications that execute relatively fewer, but more time-consuming, opcodes, such as javac and jess.

The above optimizations required minimal effort to implement in Jupiter, attesting to the flexibility of our system. For example, Jupiter's stronger encapsulation confines the modifications required to implement the fieldsize optimization to changing a half-dozen lines of code within the Field module. In contrast, Kaffe's equivalent of opcodeSpec explicitly tests the field type and calls one of a number of ``load_offset'' macros, passing only the field offset as a parameter. To take advantage of cached field size information, all implementations of the load_offset macros must be modified to pass the field size in addition to its offset--even those that are not affected by this optimization. Hence, in contrast to Jupiter, the lack of encapsulation within Kaffe causes the scope of this modification to encompass a large number of unrelated modules.

We believe that further improvements to the performance of Jupiter are still possible, which will bring its performance closer to that of the Sun Microsystems JDK. We profiled the amount of time consumed executing each kind of opcode, grouped into the following categories: Invocation/Return, Conditional Branch, Local Variable Access Allocation, Object Access, Array Access, Integer, Floating-point, and Other. The resulting execution profile is depicted in Figure 13, before and after the optimizations described above. The charts show that our optimization targeted mostly object access overhead, and that more performance-improving opportunities remain. For example, the two slowest benchmarks, 202_jess and 213_javac, have similar profiles, with large proportions of invocation/return and allocation opcodes. We will target these opcodes in future work.

Furthermore, Jupiter's coding style relies heavily on function inlining to achieve good performance, so a weakness in the compiler's inlining ability can have a substantial impact. For example, our examination of gcc-generated assembly code for getfield indicates that the code can be further optimized with nothing more than common subexpression elimination. The exact reason that gcc did not successfully perform this optimization is hard to determine. However, it appears that the implementation of getfield is too stressful on the inlining facility of gcc, hindering its ability to apply the optimization. Applying the optimization manually in the assembly code improves the performance of getfield by approximately 15%. The overall performance improvement due to the optimization depends on the application. For example, object access accounts for only 11% of 201_compress, which would lead to less than 2% overall improvement, and even less for other applications³. Similar improvements are possible for other opcodes, and the accumulation of the individual improvements may be substantial.

Next: Related Work Up: Experimental Evaluation Previous: Overall Performance

Tarek S. Abdelrahman
2002-05-27