In summary, the three major optimizations applied are as follows:
Lance related I/O port accesses from the virtual machine are handled
in the VMM whenever possible. During periods of heavy network
activity, packet transmissions are merged and sent during
IRQ-triggered world switches. This reduces the number of world
switches, the number of virtual IRQs, and the number of host IRQs
taken while executing in the VMM world. Finally, the VMNet driver is
augmented with shared memory that allows the VMApp to avoid calling
select() in some circumstances.
Figure 6 shows that these optimizations reduce
CPU overhead enough to allow VM/PC-733 to saturate a 100 Mbit
Ethernet link, and the throughput for VM/PC-350 more than doubles.
Table 2 lists the CPU overhead breakdown from the
time-based sampling measurements on VM/PC-733 with the
optimizations in place. Overall, the profile shows that the majority
of the I/O related overhead is gone from the VMM and that there is now
time when the guest OS idles. Additionally, guest context switch
virtualization overheads now become significant as the guest switches
between nettest and its idle task.
The ``Guest idle'' and ``Host IRQ processing while guest idle''
categories in Table 2 are derived with input from
direct measurements presented in Section 3.5. A
sample-based measurement of idle time indicates that 41.1% of VMM
time is spent idling the guest and taking host IRQs while idling.
However, discriminating the host IRQ processing time and guest idle
time via time-based sampling alone is hard because of synchronized
timer ticks and the heavy interrupt rate produced by the workload. We
use direct measurements that show that 21.7% of total time is
spent in the guest idle loop to arrive at the idle time breakdown in
Table 2.
The most effective optimization is handling IN and OUT accesses to
Lance I/O ports directly in the VMM whenever possible. This
eliminates world switches on Lance port access that do not require
real I/O. Additionally, Table 1 indicates that
accessing the Lance address register consumes around 8% of the VMM's
time and taking advantage of the register's memory semantics has
completely eliminated that overhead from the profile as shown in
Table 2.
An interesting observation is that the time to transmit a packet via
the VMNet does not change noticeably - all of the gains are along
other paths. Instrumenting the optimized version in appropriate
locations shows that the average cycle count on the path to transmit a
packet onto the physical NIC is within 100 cycles of the totals from
Figure 5. However, this is contrary to the times in
Table 2 for sending via the VMNet driver. This
disagreement stems from transmitting more than one packet at a time.
While simply sending and timing individual packets, the baseline and
optimized transmits look very similar, but with send combining active,
up to 3 packets are sent back to back. This increases the chance of
taking a host transmit IRQ from a prior transmit while in the VMNet
driver. Since Table 2 reports the time from the
start to finish of the call into the VMNet driver, it also includes
the time the host kernel spends handling IRQs.