While TCP/IP can achieve good throughput on currently deployed networks, its round-trip latency is usually poor. Further, observed bandwidth and round-trip latencies on next-generation network technologies such as Myrinet and ATM do not begin to approach the raw capabilities of these networks [Keeton et al. 1995]. In this section, we describe a number of features and problems of commercial TCP implementations, and how these features affect communication performance.
TCP/IP was originally designed, and is usually implemented, for wide-area networks. While TCP/IP is usable on a local-area network, it is not optimized for this domain. For example, TCP uses an in-packet checksum for end-to-end reliability, despite the presence of per-packet CRC's in most modern network hardware. But computing this checksum is expensive, creating a bottleneck in packet processing. IP uses header fields such as `Time-To-Live' which are only relevant in a wide-area environment. IP also supports internetwork routing and in-flight packet fragmentation and reassembly, features which are not useful in a local-area environment. The TCP/IP model assumes communication between autonomous machines that cooperate only minimally. However, machines on a local-area network frequently share a common administrative service, a common file system, and a common user base. It should be possible to extend this commonality and cooperation into the network communication software.
Standard implementations of the Sockets interface and the TCP/IP protocol suite separate the protocol and interface stack into multiple layers. The Sockets interface is usually the topmost layer, sitting above the protocol. The protocol layer may contain sub-layers: for example, the TCP protocol code sits above the IP protocol code. Below the protocol layer is the interface layer, which communicates with the network hardware. The interface layer usually has two portions, the network programming interface, which prepares outgoing data packets, and the network device driver, which transfers data to and from the network interface card (NIC).
This multi-layer organization enables protocol stacks to be built from many combinations of protocols, programming interfaces, and network devices, but this flexibility comes at the price of performance. Layer transitions can be costly in time and programming effort. Each layer may use a different abstraction for data storage and transfer, requiring data transformation at every layer boundary. Layering also restricts information transfer. Hidden implementation details of each layer can cause large, unforeseen impacts on performance [Crowcroft et al. 1992][Clark 1982]. Mechanisms have been proposed to overcome these difficulties [Clark & Tennenhouse 1990], but existing work has focused on message throughput, rather than protocol latency [Abbott & Peterson 1993]. Also, the number of programming interfaces and protocols is small: there are two programming interfaces (Berkeley Sockets and the System V Transport Layer Interface) and only a few data transfer protocols (TCP/IP and UDP/IP) in widespread usage. This paucity of distinct layer combinations means that the generality of the multi-layer organization is wasted. Reducing the number of layers traversed in the communications stack should reduce or eliminate these layering costs for the common case of data transfer.
Current TCP/IP implementations use a complicated memory management mechanism. This system exists for a number of reasons. First, a multi-layered protocol stack means packet headers are added (or removed) as the packet moves downward (or upward) through the stack. This should be done easily and efficiently, without excessive copying. Second, buffer memory inside the operating system kernel is a scarce resource; it must be managed in a space-efficient fashion. This is especially true for older systems with limited physical memory.
To meet these two requirements, mechanisms such as the Berkeley Unix mbuf have been used. An mbuf can directly hold a small amount of data, and mbufs can be chained to manage larger data sets. Chaining makes adding and removing packet headers easy. The mbuf abstraction is not cheap, however: 15% of the processing time for small TCP packets is consumed by mbuf management [Kay & Pasquale 1993]. Additionally, to take advantage of the mbuf abstraction, user data must be copied into and out of mbufs, which consumes even more time in the data transfer critical path. This copying means that nearly one-quarter of the small-packet processing time in a commercial TCP/IP stack is spent on memory management issues. Reducing the overhead of memory management is therefore critical to improving communications performance.