Check out the new USENIX Web site. next up previous
Next: The compile test Up: The Design and Implementation Previous: Implementation

  
Performance

We have run some performance tests to try to evaluate the impact of introducing reflective capabilities into a Java interpreter. Like the other few papers in the literature on reflection that provide performance data, we have preferred to evaluate the overhead of reflection on each particular operation, instead of running standard benchmarks. In fact, there are no standard benchmarks to evaluate the impact of reflection. Existing general-purpose benchmarks usually focus on optimization of complex patterns of control flow, which would not be affected by the introduction of interception for objects operations, and calculations on large arrays, which would incur a huge overhead.


 
 
Table 1: Description of the platforms.
Tag Description

i586

100 MHz Pentium running RedHat Linux 5.1

i686

233 MHz Pentium Pro running RedHat Linux 5.0

spu1

167 MHz SPARC Ultra 1 running Solaris 2.6

spu2

200 MHz SPARC Ultra Enterprise 2 running Solaris 2.5

This table describes the platforms on which the performance tests were run.

Our tests have been performed on four different platforms, listed in Table 1. On the Solaris platforms, the tests were run in real-time scheduling mode, so as to ensure that no other processes would affect the measured times. On the GNU/Linux platforms, this scheduling mechanism was not available, so we just ensured that the tested hosts were as lightly loaded as possible.

On each host, we have run the same Java program, compiled with Sun JDK's Java compiler, without optimization, to prevent method inlining. The produced bytecodes were executed by different interpreters under different configurations.

We have used Guaranį 1.4.1 and the snapshot of Kaffe 1.0.b1 distributed with it, using the JIT compiler and the interpreter engines. Kaffe and Guaranį were compiled with EGCS 1.1b, with default optimization levels. The program used to perform the tests was the one distributed with Guaranį 1.4.1.


 
 
Table 2: Description of the tests.
Operation Description

emptyloop

No reflective operation.

synchronized

Empty block synchronized on an arbitrary object.

invokestatic

Invoke an empty static method that takes no arguments and returns void.

invokespecial

Invoke a non-static private do-nothing method that returns void and takes only the implicit this as argument. The same bytecode is used to invoke constructors and, in some cases, final methods.

invokevirtual

Invoke an empty method that takes only the implicit this as argument, and returns void. Dynamic binding, performed with a dispatch table, occurs before interception test.

invokeinterface

Invoke the same method, but through an object reference of interface type. Dynamic binding is much slower in this case.

getstatic

Load a static int field into a variable.

putstatic

Store a zero-valued variable in a static int field.

getfield

Load a non-static int field into a variable.

putfield

Store a zero-valued variable in a non-static int field.

arraylength

Load the length of an array of int into a variable.

iaload

Load the first element of an array of int into a variable.

iastore

Store a zero-initialized variable in the first element of an array of int.

println

Print the line ``Hello world!'' to System.err, which was redirected to /dev/null before starting the Virtual Machine. It is a first attempt to estimate the overall impact of introducing interception abilities.

compile

Compile the test program itself. Section 5.1 contains a detailed description and analysis.

This table describes the operation(s) performed within a loop in our performance tests.

For each configuration, we have timed several different operations, described in Table 2. Each operation was timed by running it repeatedly inside a loop, after running it once outside the loop, before starting the timer. This ensures that, before the loop starts, any JIT compilation has already taken place, all the data and code was brought into the cache and, unless the test involves object allocation, the garbage collector will not run.

This inner loop is run repeatedly, with the iteration count being adjusted at every outer iteration, aiming at a running time longer than 1 second. Since the operations that read the clock at the beginning and at the end of each inner loop take less than 1 microsecond to run, and the clock resolution is 1 millisecond, a total running time of 1 second is enough to elliminate any effects they might have in the outcome of the tests.

The inner-loop iteration count starts at 1, and is repeatedly multiplied by 10 until it is large enough to be measurable with the clock resolution. As soon as this happens, the elapsed time and the iteration count start to be used to estimate the running-time of an iteration. If the total elapsed time of an execution of the inner loop is longer than one second, the estimate is the final result of the test. Otherwise, it is used to compute the iteration count for the next execution of the inner loop, aiming at a total execution time of 1100 milliseconds.

With the exception of the tests println and compile, this mechanism selected an iteration count between 50,000 and 100,000,000, for the final execution of the inner loop of each test. In the case of println, the iteration count was never smaller than 500. The compile test was run stand-alone, not within this framework.

Each test case was run 50 times on each configuration and platform, and the average times of the runs were used to compute the relative overheads presented in Table 3 and Table 4. Although we have introduced the ability to intercept operations, no actual interception took place during those tests.


 
 
Table 3: Overhead on interpreter.
Operation i586 i686 spu1 spu2
emptyloop $-41\%$ $-15\%$ $-0\%$ $-0\%$
synchronized $-0\%$ $+1\%$ $+0\%$ $+4\%$
invokestatic $+13\%$ $+0\%$ $+4\%$ $-8\%$
invokespecial $+30\%$ $+8\%$ $+38\%$ $-10\%$
invokevirtual $+17\%$ $-0\%$ $+7\%$ $-9\%$
invokeinterface $-3\%$ $-7\%$ $+20\%$ $-10\%$
getstatic $-3\%$ $-2\%$ $+20\%$ $-0\%$
putstatic $-23\%$ $-3\%$ $+24\%$ $+4\%$
getfield $-22\%$ $-2\%$ $+19\%$ $-0\%$
putfield $-26\%$ $-2\%$ $+25\%$ $+6\%$
arraylength $-18\%$ $-9\%$ $+2\%$ $+12\%$
iaload $-64\%$ $-6\%$ $+1\%$ $-0\%$
iastore $-14\%$ $-3\%$ $+1\%$ $+1\%$
println $+6\%$ $+4\%$ $+3\%$ $-2\%$
compile $+5\%$ $+2\%$ $-2\%$ $-3\%$

No interception occurs in these tests, they just measure the overhead imposed on the interpreter to introduce the ability to intercept operations.


 
 
Table 4: Overhead on JIT compiler.
Operation i586 i686 spu1 spu2
emptyloop $+0\%$ $+1\%$ $+0\%$ $+0\%$
synchronized $+12\%$ $+10\%$ $+27\%$ $+3\%$
invokestatic $+91\%$ $+20\%$ $+23\%$ $+34\%$
invokespecial $+119\%$ $+8\%$ $+19\%$ $+28\%$
invokevirtual $+30\%$ $+158\%$ $-6\%$ $+0\%$
invokeinterface $+7\%$ $+2\%$ $+3\%$ $+2\%$
getstatic $+68\%$ $+148\%$ $+163\%$ $+163\%$
putstatic $+180\%$ $+97\%$ $+90\%$ $+90\%$
getfield $+293\%$ $+86\%$ $+149\%$ $+149\%$
putfield $+103\%$ $+96\%$ $+66\%$ $+66\%$
arraylength $+258\%$ $+86\%$ $+140\%$ $+150\%$
iaload $+191\%$ $+98\%$ $+55\%$ $+95\%$
iastore $+236\%$ $+55\%$ $+41\%$ $+45\%$
println $+45\%$ $+6\%$ $+5\%$ $+12\%$
compile $+36\%$ $+42\%$ $+32\%$ $+29\%$
compile-JIT $+105\%$ $+112\%$ $+81\%$ $+54\%$
compile-diff $+16\%$ $+17\%$ $+20\%$ $+20\%$

No interception occurs in these tests, they just measure the overhead imposed on the JIT compiler and the code it produces to introduce the ability to intercept operations.



 
next up previous
Next: The compile test Up: The Design and Implementation Previous: Implementation
contact the authors