Check out the new USENIX Web site.
Effect of GC on Cache PerformanceMemory Wall IssuesInteraction of Threads with Memory System

Interaction of Threads with Memory System

L1 Instructions L1 Data
Thread Cycles Records IPC Instructions References Misses References Misses
Warehouse 59.12% 1995 0.481 11,261,031,705 6,080,171,921 89,966,131 5,885,067,463 461,297,214
GC 13.96% 11 0.524 2,885,482,384 1,447,776,154 482,694 1,113,092,487 281,135,697
Compilation 6.46% 271 0.565 1,531,568,783 905,536,702 5,757,690 769,831,082 89,280,559
Controller 0.11% 141 0.212 11,867,348 8,730,297 161,096 8,496,326 835,170
TraceWriter 0.05% 82 0.097 2,411,860 1,395,686 68,802 1,599,610 447,691
Organizer 0.03% 141 0.166 2,037,304 1,322,770 39,568 1,307,913 145,696
L1 data and instruction cache references and misses for different Java threads.
 

We used the Performance Explorer to understand the impact of references and miss rates to the different memory hierarchy levels for both data and instructions of each Java thread. In particular, we compute the miss rate to L1 data cache by dividing the number of load misses from L1 (LD_MISS_L1) by the number of load references to L1 (LD_REF_L1). Unfortunately, the POWER4 HPM support does not provide miss and reference events to the other levels in the memory hierarchy, but does provides access counts: the number of times data is accessed from a particular cache. Therefore, we compute the number of references to a memory hierarchy level as the sum of the number of accesses to this level and to all lower levels. We compute the number of misses at a higher level in the memory hierarchy as the summation of the references at lower levels. The miss rate for a particular level is computed as the misses divided by the references to that level. For example, POWER4 HPM provides DATA_FROM_X events to specify the number of times data is accessed from the X level in the memory hierarchy. Thus, the L3 miss rate is (DATA_FROM_MM) / (DATA_FROM_L3 + DATA_FROM_MM).

Table * presents the references and misses metrics for the L1 instruction and data caches. The Thread column identifies the Java threads. The Cycles column identifies the percentage of total cycles spent executing this thread. The Records column reports the number of trace records captured for this thread. The IPC column reports instructions per cycle. The Instructions column reports the average number of instructions executed for each time quantum the Java thread is executed. The next two columns labeled L1 Instructions report the references and misses to the L1 instruction cache. The last two columns report similar values for the L1 data cache. The reference and miss rate computed metrics for the other levels of the memory hierarchy can be computed from Table * and the subsequent Figures * - *

Normalized Instruction references

Instruction references to levels in the memory hierarchy normalized to all instruction references.
 

Normalized Data references

Data references to levels in the memory hierarchy normalized to all data references.
 

Figures * and * illustrate the references to L2, L3, and main memory as a percentage of all references for both instructions and data. The horizontal axis is the Java threads and the vertical axis is the percentage of all references for a particular thread. For each thread, there are three bars, one for each memory hierarchy level: L2, L3, and MM. Figure * shows that the three Java threads (warehouse, GC, and compilation), which consume 79% of all cycles, have a 98% hit rate for L1 instruction cache, indicating excellent instruction locality. The daemon threads, which execute infrequently and for a short interval of time, have worse L1 instruction locality. This is expected, because of their infrequent execution, little or none of their instructions will be in the cache, and because of there short execution duration, the cache misses have a short interval over which to be amortized. Of the three daemon Java threads, TraceWriter stands out as having the worst instruction locality with almost 10% of its instructions not found in the L1 cache: half of the 10% is found in the L2 instruction cache. TraceWriter is the least frequently executed daemon thread.

For data references, illustrated in Figure *, the story is better and remarkable consistent across all threads. At most 5% of data is not found in the L1 cache for any of the Java threads. Both the warehouse and GC threads have the best locality with less than 3% of the data not found in the L1 cache. This implies that the working sets of the Java threads fit into the 64KB data cache. As is expected, the lower the memory level, the fewer the references: L2 has fewer than 3.4% of all references, and main memory has less than 0.2% of all references across all the Java threads.

Instruction miss rates

Instruction miss rates to levels in the memory hierarchy normalized to L1.
 

Data miss rates

Data miss rates to levels in the memory hierarchy.
 

Figures * and * illustrate the miss rates for the different levels of the memory hierarchy for both instructions and data. The figures illustrate that both the instruction and data L1 cache miss rates are small, less than 5% for instructions and less than 3.2% for data for all the Java threads. Furthermore, the figures illustrate that the L3 is of little help for all, but warehouse instructions: the L3 miss rate ranges from 33.57% to 84.01%. In general, the L2 miss rate is better for data than instructions. However, because the references to main memory and L3 are low, the high miss rates for L2 and L3 are not of a big concern.

Figure * shows that GC incurs a 0.033% instruction miss rate in L1. There are two reasons for this: (i) GC runs in uninterruptible mode where other Java threads cannot preempt its execution and evict its working set from the cache; and (ii) GC spends most of its time in a small inner loop which can easily fit in the cache. The instruction miss rates for the GC thread in L2 and L3 are not really relevant to performance since GC makes so few instruction references to L2 and L3.


Effect of GC on Cache PerformanceMemory Wall IssuesInteraction of Threads with Memory System