By replacing our exact memory residency check with a cheaper heuristic, we gain performance, but introduce blocking into the sendfile() system call. New PerSleepInfo measurements of the blocking behavior of sendfile() are shown in Table 5.
The resource label ``sfbufa'' indicates that the kernel has exhausted the sendfile buffers used to map filesystem pages into kernel virtual memory. We confirm that increasing the number of buffers during boot time may mitigate this problem in our test. However, based on the results of previous copy-avoidance systems (31,17), we opt instead to implement recycling of kernel virtual address buffers. With this change, many requests to the same file do not cause multiple mappings, and eliminates the associated virtual memory and physical map (pmap) operations. Caching these mappings may temporarily use more wired memory than no caching, but the reduction in overhead and address space consumption outweighs the drawbacks.
The other two resource labels, ``getblk'' and ``biord'', are related to disk access initiated within sendfile() when the requested pages are not in memory. Even though the socket being used is nonblocking, that behavior is limited only to network buffer usage. We introduce a new flag to sendfile() so that it returns a different errno value if disk blocking would occur. This change allows us to achieve the same effect as we had with mincore(), but with much less CPU overhead. We may optionally have the read helper process send data directly back to the client on a filesystem cache miss, but have not implemented this optimization.
However, even with blocking eliminated, we find performance barely changes when using sendfile() versus writev(), and we find that the problem stems from handling small writes. HTTP responses consist of a small header followed by file data. The writev() code aggregates the header and the first portion of the body data into one packet, benefiting small file transfers. In SpecWeb99, 35% of all static requests are for files 1KB or smaller.
The FreeBSD sendfile() call includes parameters specifying headers and trailers to be sent with the data, whereas the Linux implementation does not. Linux introduces a new socket option, TCP_CORK, to delay transmission until full packets can be assembled. While FreeBSD's ``monolithic'' approach provides enough information to avoid sending a separate header, its implementation uses a kernel version of writev() for the header, thus generating an extra packet. We improve this implementation by creating an mbuf chain using the header and body data before sending it to lower levels of the network stack. This change generates fewer packets, improving performance and network latency. Results of these changes on a microbenchmark are shown in Figure 7. With the sendfile() changes, we are able to achieve a SpecWeb99 score of 820, with a dataset size of 2.7GB.