Number of threads | 16 bytes | 64 bytes | 256 bytes | 1024 bytes | 8192 bytes | 16384 bytes |
1 | 3.06 Mbps | 11.45 Mbps | 33.15 Mbps | 59.49 Mbps | 79.19 Mbps | 80.75 Mbps |
2 | 5.53 Mbps | 18.40 Mbps | 56.07 Mbps | 111.60 Mbps | 154.18 Mbps | 160.02 Mbps |
3 | 6.44 Mbps | 23.25 Mbps | 71.31 Mbps | 152.28 Mbps | 229.60 Mbps | 238.24 Mbps |
4 | 6.83 Mbps | 25.77 Mbps | 80.91 Mbps | 182.65 Mbps | 292.15 Mbps | 299.33 Mbps |
32 | 7.37 Mbps | 27.51 Mbps | 94.05 Mbps | 249.17 Mbps | 313.79 Mbps | 320.19 Mbps |
Number of threads | 16 bytes | 64 bytes | 256 bytes | 1024 bytes | 8192 bytes | 16384 bytes |
1 | 5.42 Mbps | 18.88 Mbps | 61.94 Mbps | 151.95 Mbps | 300.88 Mbps | 254.79 Mbps |
32 | 9.91 Mbps | 37.01 Mbps | 120.71 Mbps | 410.27 Mbps | 758.85 Mbps | 801.81 Mbps |
Finally, we wish to determine how well the OCF can load-balance crypto requests when multiple accelerators are available, and the aggregate throughput that can be achieved in that scenario. We use a custom-made card by Avaya that contains four Hifn 7751 chips that can be used as different devices through a PCI bridge resident on the card. We use multiple threads that issue encryption requests for 3DES, and vary the buffer size across different runs. The results are shown in Table 1. As we can see, performance peaks in the case of 32 threads and 16 KB buffers at 320 Mbps, which is over 96% of the maximum rated throughput of four Hifn 7751 chips. The card was installed on the 64bit/66Mhz PCI bus, but because the chip is a 32bit/33Mhz device, the maximum bus transfer rate is 1.056 Gbps. At our peak rate, we use over 640 Mbps of the bus: 320 Mbps for data in each direction (to and from the card), plus the transfer initialization commands and descriptor ring probing, etc., thus utilizing over 60% of the PCI bus. Notice that because the card uses a PCI bridge, a 2-cycle latency is added on each PCI transaction.
The card was installed on the 64bit/66Mhz bus because the system's 32bit/33Mhz bus exhibited surprisingly bad performance, probably because many other system components are found on that bus and likely cause contention: since the machine is operating as it normally would while this test is being run, the scheduler is active, and two clock interrupts are being received at 100 and 128 Hz respectively. Other devices are also generating their own interrupts.
Another possible cause is an artifact of the i386 spl protection method: a regular spl subsystem disables the interrupts from a certain class of devices at the invocation of an splX() call. For instance, calling splbio() blocks reception of interrupts from all devices which are in the ``bio'' class of devices. On the i386, the registers used to do interrupt blocking (found on the programmable interrupt controller, also known as the PIC) are located on the 8Mhz ISA bus, which is what OpenBSD uses for interrupt management (as opposed to the APIC).
Worse yet, some operations on this device require a 1 usec delay before taking effect. To partially mitigate this extremely high overhead, the i386 kernel interrupt model instead makes the vectors for blocked interrupt routines point to a single-depth queuing function which does the actual interrupt blocking at the time of reception. When the spl is lowered again, the original interrupt handler is called. However, the 8Mhz ISA bus still had to be accessed. This has the effect of further reducing the available bandwidth on the PCI bus. One small-buffer benchmark generated over 62,000 interrupts/sec; we believe that the spl optimization is failing under such load.
Using four 5820 cards on a 64bit/66Mhz PCI bus allows us to achieve even higher throughput, as shown in Table 2. We show only the 1 and 32-thread tests; the rest of the measurements followed a similar curve as the quad-7751. Performance peaked at over 800 Mbps of crypto throughput. Using the same analysis as before, we are using in excess of 1.6 Gbps of the fast-PCI bus, which has a throughput of 4.22 Gbps, achieving slightly over 38% utilization of the bus. As we mentioned in Section 5.1, the vendor rates this card at 310 Mbps. Thus, the maximum theoretical attainable rate would be 1.24 Gbps. We achieve 64.5% utilization of the four cards in this case. A rough sampling of CPU utilization during these large block benchmarks on both cards showed around 10,000 interrupts/second, which is substantial for a PC.
Investigating further, we determined that all four 5820 cards were sharing irq 11. Thus, it is possible that the culprit is the spl optimization previously mentioned, at least for the small buffer sizes: the vmstat utility shows us anything from 50,000 to 60,000 interrupts per second when processing buffers of 16 to 1024 bytes. Furthermore, because of a quirk in the processing of shared irq handlers, some cards experience slightly worse interrupt-service latency: shared irq handlers are placed in a linked list; if multiple cards raise the interrupt at the same time, the list will be traversed from the beginning for each interrupt raised -- and each irq handler will poll the corresponding card to determine if the interrupt was issued by it. However, fixing this quirk or moving the cards on different irq's did not significantly improve throughput.
When we use 8192-byte buffers, the interrupt count drops to 12,000, which the system can handle. In each of these cases, the system spends approximately 65% of its time inside the kernel. Most of this cost can be attributed to data copying. However, as we move to larger buffer sizes, we find the system spending 89% of its time in the kernel, and only 1.9% in user applications, for the case of 16 KB buffers. The number of interrupts in this case is only 5,600, which the system can easily handle. The problem here is that there is considerable data copyin/copyout between the kernel and the applications; aggravating the situation, while such data copying is in progress no other thread can execute, causing a ``convoy'' effect: while the kernel is copying a 16 KB buffer to the application buffer, interrupts arrive that cause more completed requests to be placed on the crypto thread's ``completed'' queue. The system will not allow the applications to run again before all completed requests are handled, which cause more data copying. Thus, the queue will almost drain before applications will be able to issue requests again and refill it. We intend to further investigate this phenomenon.
Fundamentally, the data copyin/copyout limitation is inherent in the memory subsystem. We measured its write-bandwidth to be approximately 2.4 Gbps. Using the crypto cards, we are in fact doing 3 memory-write operations for each data buffer: one copyin to the kernel, one DMA from the card to main memory, and one copyout to the application. Notice that data DMA'ed in from the card is not resident in the CPU cache, as all such data is considered ``suspect'' for caching purposes. In addition, there is an equal amount of memory reads (copyin, DMA in from the card, copyout). Each of those transfers represents an aggregate of 800 Mbps. When we ran the same test with three 5820 cards, performance slightly improved to 841.7 Mbps in the case of 16 KB buffers, achieving over 90% utilization of the three cards. In this case, the memory subsystem is still saturated, but the cards can more easily get a PCI-bus grant and perform the DMA.