Check out the new USENIX Web site. next up previous
Next: 5.2.2 Chat Up: 5.2 Result Previous: 5.2 Result

5.2.1 WebBench

Figure 9 shows WebBench results for the following three kernel: standard kernel (vanilla), 32 coloring kernel (c32), and multi-queue scheduler kernel (MQ). Table 4 shows performance gains of c32 and MQ compared with vanilla.

Figure 9: WebBench Performance (# of httpd processes: 256)


Table 4: Speeding up to vanilla scheduler (WebBench)
1 CPU 2 CPUs 3 CPUs 4 CPUs 5 CPUs 6 CPUs 7 CPUs 8 CPUs
32 coloring 8.2% 13.3% 15.8% 17.6% 18.0% 20.6% 22.6% 23.3%
Multi-Queue -3.0% 13.6% 18.2% 21.7% 23.8% 26.9% 28.7% 30.6%


We can see in this graph that both of c32 and MQ show better performance than vanilla. In Table 4, c32 and MQ achieves maximum of 23.3% and 30.6% improvement on an 8-way, respectively. In fact, MQ shows more performance improvement than c32. However, MQ requires more extensive changes to be implemented than c32. Although requiring less modification, c32 nearly achieves the performance improvement of MQ. Furthermore, c32 achieves speedup over all of the CPU set. In contrast, MQ's performance drops on single CPU system.

In Table 4, as the number of CPUs increases, c32 gains larger speeding up. When the number of CPUs is one, the performance gains is 8.2%. When the number of CPUs becomes 8, the performance gains 23.3%. The major reason is the improved lock statistics. The reduction of cache misses in scheduler makes the traversal time shorter and dramatically decreases the holding time for the run queue lock. Thus, our coloring method could perform more effectively on a SMP system rather than a uniprocessor system.

To confirm this, we also measured following two characteristics.

L2 cache miss ratio:
measuring the number of L2 cache miss ratio during the execution of list_for_each loop. The miss ratio is calculated by the equation (1) as described in Section 4.2. We used performance monitoring counters built into the Pentium processor.

Lock hold time and lock contention:
measuring following two statistics: (1) the time that the run queue lock is held, and (2) the fraction of lock requests that found the run queue lock was busy when it was requested. These information are measured by using Lockmeter tool [10].

The results are shown in Table 5 and 6, respectively. In the case of vanilla kernel, the list_for_each() loop involves potentially large cache misses as seen in Table 5, resulting more than 90% cache miss ratio. In contrast, 32 coloring shows substantial reduction to about 10% on any number of CPUs system. Moreover, in Table 6, both of the mean of lock hold time and the lock contention are largely reduced on any number of CPUs system by using 32 coloring. The lock contention is reduced from 45.6% to 23.4% on an 8-way system. This leads to better scalability than vanilla kernel.


Table 5: L2 cache miss ratio during run queue traversal (WebBench)
1CPU 2 CPUs 3 CPUs 4 CPUs 5 CPUs 6 CPUs 7 CPUs 8 CPUs
vanilla 92.4% 98.0% 98.0% 92.4% 96.2% 94.6% 92.3% 93.9%
32 coloring 6.0% 7.5% 6.8% 7.2% 6.6% 7.4% 11.1% 13.7%



Table 6: lock statistics for run queue_lock (WebBench)

2 CPUs 3 CPUs 4 CPUs 5 CPUs 6 CPUs 7 CPUs 8 CPUs
Contention 9.2% 15.4% 19.7% 22.9% 27.2% 35.9% 45.6%
vanilla Hold Mean [us] 42 40 40 41 43 43 36
Hold Max [us] 133 137 143 157 153 153 156
Contention 2.3% 4.1% 6.0% 7.2% 9.3% 14.0% 23.4%
32 coloring Hold Mean [us] 12 12 13 13 13 11 13
Hold Max [us] 112 129 113 135 132 78 80



next up previous
Next: 5.2.2 Chat Up: 5.2 Result Previous: 5.2 Result
Shuji YAMAMURA 2002-04-16