USENIX Technical Program - Paper - Proceedings of the USENIX Annual Technical Conference, January 6-10, 1997, Anaheim, California, USA [Technical Program]

Pp. 257–274 of the Proceedings

Next: 2 Problems with TCP/IP
Up: High-Performance Local Area Communication
Previous: High-Performance Local Area Communication

1 Introduction

The development and deployment of high performance local-area networks such as ATM [de Prycker 1993], Myrinet [Seitz 1994], and switched high-speed Ethernet has the potential to dramatically improve communication performance for network applications. These networks are capable of microsecond latencies and bandwidths of hundreds of megabits per second; their switched fabrics eliminate the contention seen on shared-bus networks such as traditional Ethernet. Unfortunately, this raw network capacity goes unused, due to current network communication software.

Most of these networks run the TCP/IP protocol suite [Postel 1981a][Postel 1981c][Postel 1981b]. TCP/IP is the default protocol suite for Internet traffic, and provides inter-operability among a wide variety of computing platforms and network technologies. The TCP protocol provides the abstraction of a reliable, ordered byte stream. The UDP protocol provides an unreliable, unordered datagram service. Many other application-level protocols (such as the FTP file transfer protocol, the Sun Network File System, and the X Window System) are built upon these two basic protocols.

TCP/IP's observed performance has not scaled to the ability of modern network hardware, however. While TCP is capable of sustained bandwidth close to the rated maximum of modern networks, actual bandwidth is very much implementation-dependent. The round-trip latency of commercial TCP implementations is hundreds of microseconds higher than the minimum possible on these networks. Implementations of the simpler UDP protocol, which lack the reliability and ordering mechanisms of TCP, perform little better [von Eicken et al. 1995][Kay & Pasquale 1993][Keeton et al. 1995]. This poor performance is due to the high per-packet processing costs (processing overhead) of the protocol implementations. In local-area environments, where on-the-wire times are small, these processing costs dominate small-packet round-trip latencies.

Local-area traffic patterns exacerbate the problems posed by high processing costs. Most LAN traffic consists of small packets [Claffy et al. 1992][Caceres et al. 1991][Gusella 1990][Cheriton & Williamson 1987][Amer et al. 1987][Feldmeier 1986][Schoch & Hupp 1980][Kleinrock & Naylor 1974]. Small packets are the rule even in applications considered bandwidth-intensive: 95% of all packets in a NFS trace performed at the Berkeley Computer Science Division carried less than 192 bytes of user data [Dahlin et al. 1994]; the mean packet size in the trace was 382 bytes. Processing overhead is the dominant transport cost for packets this small, limiting NFS performance on a high-bandwidth network. This is true for other applications: most application-level protocols in local-area use today (X11, NFS, FTP, etc.) operate in a request-response, or client-server, manner: a client machine sends a small request message to a server, and awaits a response from the server. In the request-response model, processing overhead usually cannot be hidden through packet pipelining or through overlapping communication and computation, making round-trip latency a critical factor in protocol performance.

Traditionally, there are several methods of attacking the processing overhead problem: changing the application programming interface (API), changing the underlying network protocol, changing the implementation of the protocol, or some combination of these approaches. Changing the APImodifies the code used by applications to access communications functionality. While this approach may yield better performance for new applications, legacy applications must be re-implemented to gain any benefit. Changing the communications protocolchanges the ``on-the-wire'' format of data and the actions taken during a communications exchange - for example, modifying the TCP packet format. A new or modified protocol may improve communications performance, but at the price of incompatibility: applications communicating via the new protocol are unable to share data directly with applications using the old protocol. Changing the protocol implementationrewrites the software that implements a particular protocol; packet formats and protocol actions do not change, but the code that performs these actions does. While this approach provides full compatibility with existing protocols, fundamental limitations of the protocol design may limit the performance gain.

Recent systems, such as Active Messages [von Eicken et al. 1992], Remote Queues [Brewer et al. 1995], and native U-Net [von Eicken et al. 1995], have used the first two methods; they implement new protocols and new programming interfaces to obtain improved local-area network performance. The protocols and interfaces are lightweight and provide programming abstractions that are similar to the underlying hardware. All of these systems realize latencies and throughput close to the physical limits of the network. However, none of them offer compatibility with existing applications.

Other work has tried to improve performance by re-implementing TCP. Recent work includes zero-copy TCP for Solaris [Chu 1996] and a TCP interface for the U-Net interface [von Eicken et al. 1995]. These implementations can inter-operate with other TCP/IP implementations and improve throughput and latency relative to standard TCP/IP stacks. Both implementations can realize the full bandwidth of the network for large packets. However, both systems have round-trip latencies considerably higher than the raw network.

This paper presents our solution to the overhead problem: a new communications protocol and implementation for local-area networks that exports the Berkeley Sockets API, uses a low-overhead protocol for local-area use, and reverts to standard protocols for wide-area communication. The Sockets API is a widely used programming interface that treats network connections as files; application programs read and write network connections exactly as they read and write files. The Fast Sockets protocol has been designed and implemented to obtain a low-overhead data transmission/reception path. Should a Fast Sockets program attempt to connect with a program outside the local-area network, or to a non-Fast Sockets program, the software transparently reverts to standard TCP/IP sockets. These features enable high-performance communication through relinking existing application programs.

Fast Sockets achieves its performance through a number of strategies. It uses a lightweight protocol and efficient buffer management to minimize bookkeeping costs. The communication protocol and its programming interface are integrated to eliminate module-crossing costs. Fast Sockets eliminates copies within the protocol stack by using knowledge of packet memory destinations. Additionally, Fast Sockets was implemented without modifications to the operating system kernel.

A major portion of Fast Sockets' performance is due to receive posting, a technique of utilizing information from the API about packet destinations to minimize copies. This paper describes the use of receive posting in a high-performance communications stack. It also describes the design and implementation of a low-overhead protocol for local-area communication.

The rest of the paper is organized as follows. Section 2 describes problems of current TCP/IP and Sockets implementations and how these problems affect communication performance. Section 3 describes how the Fast Sockets design attempts to overcome these problems. Section 4 describes the performance of the resultant system, and section 5 compares Fast Sockets to other attempts to improve communication performance. Section 6 presents our conclusions and directions for future work in this area.

Next: 2 Problems with TCP/IP
Up: High-Performance Local Area Communication
Previous: High-Performance Local Area Communication

This paper was originally published in the USENIX Annual Technical Conference, January 6-10, 1997, Anaheim, California, USA
Last changed: 8 April 2002 ml

Technical Program

Workshop Index

USENIX home