The current level of performance achieved by Jupiter requires that a number of optimizations be implemented. In this section, we briefly describe these optimizations and comment on how the flexible structure of Jupiter facilitated their implementation. The optimizations are:
The impact of these optimizations is shown in Figure 12, which charts the execution-time reduction due to each optimization. The optimizations vary in the degree to which they benefit the performance of each application. For instance, the fieldsize and substitution optimizations, which targeted the getfield and putfield opcodes, were especially successful on compress, which spends a large portion of its time executing these two instructions. Also, threaded reduces the overhead of each opcode, producing a larger impact on applications that execute many ``lightweight'' opcodes, such as mpegaudio, compared to applications that execute relatively fewer, but more time-consuming, opcodes, such as javac and jess.
The above optimizations required minimal effort to implement in Jupiter, attesting to the flexibility of our system. For example, Jupiter's stronger encapsulation confines the modifications required to implement the fieldsize optimization to changing a half-dozen lines of code within the Field module. In contrast, Kaffe's equivalent of opcodeSpec explicitly tests the field type and calls one of a number of ``load_offset'' macros, passing only the field offset as a parameter. To take advantage of cached field size information, all implementations of the load_offset macros must be modified to pass the field size in addition to its offset--even those that are not affected by this optimization. Hence, in contrast to Jupiter, the lack of encapsulation within Kaffe causes the scope of this modification to encompass a large number of unrelated modules.
We believe that further improvements to the performance of Jupiter are still possible, which will bring its performance closer to that of the Sun Microsystems JDK. We profiled the amount of time consumed executing each kind of opcode, grouped into the following categories: Invocation/Return, Conditional Branch, Local Variable Access Allocation, Object Access, Array Access, Integer, Floating-point, and Other. The resulting execution profile is depicted in Figure 13, before and after the optimizations described above. The charts show that our optimization targeted mostly object access overhead, and that more performance-improving opportunities remain. For example, the two slowest benchmarks, 202_jess and 213_javac, have similar profiles, with large proportions of invocation/return and allocation opcodes. We will target these opcodes in future work.
Furthermore, Jupiter's coding style relies heavily on function inlining to achieve good performance, so a weakness in the compiler's inlining ability can have a substantial impact. For example, our examination of gcc-generated assembly code for getfield indicates that the code can be further optimized with nothing more than common subexpression elimination. The exact reason that gcc did not successfully perform this optimization is hard to determine. However, it appears that the implementation of getfield is too stressful on the inlining facility of gcc, hindering its ability to apply the optimization. Applying the optimization manually in the assembly code improves the performance of getfield by approximately 15%. The overall performance improvement due to the optimization depends on the application. For example, object access accounts for only 11% of 201_compress, which would lead to less than 2% overall improvement, and even less for other applications3. Similar improvements are possible for other opcodes, and the accumulation of the individual improvements may be substantial.