Next: 6 Conclusions
Up: High-Performance Local Area Communication
Previous: 4 Performance

5 Related Work

Improving communications performance has long been a popular research topic. Previous work has focused on protocols, protocol and infrastructure implementations, and the underlying network device software.

The VMTP protocol [Cheriton & Williamson 1989] attempted to provide a general-purpose protocol optimized for small packets and request-response traffic. It performed quite well for the hardware it was implemented on, but never became widely established; VMTP's design target was request-response and bulk transfer traffic, rather than the byte stream and datagram models provided by the TCP/IP protocol suite. In contrast, Fast Sockets provides the same models as TCP/IP and maintains application compatibility.

Other work [Watson & Mamrak 1987][Clark et al. 1989] argued that protocol implementations, rather than protocol designs, were to blame for poor performance, and that efficient implementations of general-purpose protocols could do as well as or better than special-purpose protocols for most applications. The measurements made in [Kay & Pasquale 1993] lend credence to these arguments; they found that memory operations and operating system overheads played a dominant role in the cost of large packets. For small packets, however, protocol costs were significant, amounting for up to 33% of processing time for single-byte messages.

The concept of reducing infrastructure costs was explored further in th x-kernel [Peterson 1993][Hutchinson & Peterson 1991], an operating system designed for high-performance communications. The original, stand-alone version of the x-kernel performed significantly better at communication tasks than did BSD Unix on the same hardware (Sun 3's), using similar implementations of the communications protocols. Later work [Druschel & Peterson 1993][Druschel et al. 1993][Pagels et al. 1994][Druschel et al. 1994] focused on hardware design issues relating to network communication and the use of software techniques to exploit hardware features. Key contributions from this work were the concepts of application device channels(ADC), which provide protected user-level access to a network device, and fbufs, which provide a mechanism for rapid transfer of data from the network subsystem to the user application. While Active Messages provides the equivalent of an ADC for Fast Sockets, fbufs are not needed, as receive posting allows for data transfer directly into the user application.

Recently, the development of a zero-copy TCP stack in Solaris [Chu 1996] aggressively utilized hardware and operating system features such as direct memory access (DMA), page re-mapping, and copy-on-write pages to improve communications performance. To take full advantage of the zero-copy stack, user applications had to use page-aligned buffers and transfer sizes larger than a page. Because of these limitations, the designers focused on improving realized throughput, instead of small message latency, which Fast Sockets addresses. The resulting system achieved 32 MB/s throughput on a similar network but with a slower processor. This throughput was achieved for large transfers (16K bytes), not the small packets that make up the majority of network traffic. This work required a thorough understanding of the Solaris virtual memory subsystem, and changes to the operating system kernel; Fast Sockets is an entirely user-level solution.

An alternative to reducing internal operating system costs is to bypass the operating system altogether, and use either in a user-level library (like Fast Sockets) or a separate user-level server. Mach 3.0 used the latter approach [Forin et al. 1991], which yielded poor networking performance [Maeda & Bershad 1992]. Both [Maeda & Bershad 1993a] and [Thekkath et al. 1993] explored building TCP into a user-level library linked with existing applications. Both systems, however, attempted only to match in-kernel performance, rather than better it. Further, both systems utilized in-kernel facilities for message transmission, limiting the possible performance improvement. Edwards and Muir [Edwards & Muir 1995] attempted to build an entirely user-level solution, but utilized a TCP stack that had been built for the HP/UX kernel. Their solution replicated the organization of the kernel at user-level with worse performance than the in-kernel TCP stack.

[Kay & Pasquale 1993] showed that interfacing to the network card itself was a major cost for small packets. Recent work has focused on reducing this portion of the protocol cost, and on utilizing the message coprocessors that are appearing on high-performance network controllers such as Myrinet [Seitz 1994]. Active Messages [von Eicken et al. 1992] is the base upon which Fast Sockets is built and is discussed above. Illinois Fast Messages [Pakin et al. 1995] provided an interface similar to that of previous versions of Active Messages, but did not allow processes to share the network. Remote Queues [Brewer et al. 1995] provided low-overhead communications similar to that of Active Messages, but separated the arrival of messages from the invocation of handlers.

The SHRIMP project implemented a stream sockets layer that uses many of the same techniques as Fast Sockets [Damianakis et al. 1996]. SHRIMP supports communication via shared memory and the execution of handler functions on data arrival. The SHRIMP network had hardware latencies (4-9 microseconds one-way) much lower than the Fast Sockets Myrinet, but its maximum bandwidth (22 MB/s) was also lower than that of the Myrinet [Felten et al. 1996]. It used a custom-designed network interface for its memory-mapped communication model. The interface provided in-order, reliable delivery, which allowed for extremely low overheads (7 microseconds over the hardware latency); Fast Sockets incurs substantial overhead to ensure in-order delivery. Realized bandwidth of SHRIMP sockets was about half the raw capacity of the network because the SHRIMP communication model used sender-based memory management, forcing data transfers to indirect through the socket buffer. The SHRIMP communication model also deals poorly with non-word-aligned data, which required programming complexity to work around; Fast Sockets transparently handles this un-aligned data without extra data structures or other difficulties in the data path.

U-Net [von Eicken et al. 1995] is a user-level network interface developed at Cornell. It virtualized the network interface, allowing multiple processes to share the interface. U-Net emphasized improving the implementation of existing communications protocols whereas Fast Sockets uses a new protocol just for local-area use. A version of TCP (U-Net TCP) was implemented for U-Net; this protocol stack provided the full functionality of the standard TCP stack. U-Net TCP was modified for better performance; it succeeded in delivering the full bandwidth of the underlying network but still imposed more than 100 microseconds of packet processing overhead relative to the raw U-Net interface.

Next: 6 Conclusions
Up: High-Performance Local Area Communication
Previous: 4 Performance