################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX 2nd Symposium on Operating Systems Design and Implementation (OSDI '96) Seattle, Washington, October 28-31, 1996 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Effects of Buffering Semantics on I/O Performance Jose' Carlos Brustoloni and Peter Steenkiste Department of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 jcb@cs.cmu.edu, prs@cs.cmu.edu Abstract We present a novel taxonomy that characterizes in a structured way the software and hardware tradeoffs for I/O data passing between applications and operating system. This work contributes new techniques, input-disabled pageout, transient output copy-on-write, and input alignment, that are used for copy avoidance in an optimized buffering semantics, emulated copy. Emulated copy offers the same API and integrity guarantees as those of copy semantics and, therefore, can transparently replace it. We implemented an I/O framework, Genie, that allows applications to select any semantics in the taxonomy. Using Genie for communication between PCs and AlphaStations over an ATM network at 155 Mbps, we found that all non-copy semantics performed similarly, and that only copy semantics had distinctly inferior performance. We analyzed end-to-end latencies in terms of the costs of primitive data passing operations and modeled how those costs scale with CPU, memory, and network speeds. The analysis suggests that current trends tend to intensify the observed performance clustering. The main conclusion is that existing I/O interfaces with copy semantics, such as that of Unix, can be transparently converted to emulated copy semantics and thus achieve performance comparable to the best obtainable with any semantics in the taxonomy. This research is sponsored in part by the Advanced Research Projects Agency under contract number DABT63-93-C-0054 and in part by the Advanced Research Projects Agency under contract number F19628-92-C-0116. 1. Introduction Most workstation operating systems continue to use I/O buffering schemes with copy semantics, similar to that of Unix [15]. In such schemes, the system inputs or outputs data only through system buffers. On an output call, the system copies data from application buffers to system buffers and, conversely, on input completion, the system copies data from system buffers to application buffers. The relative cost of the memory accesses necessary for such copying has increased dramatically since the 1970's, when Unix was introduced. Access times for DRAMs, the almost universal option for workstation main memory, have been improving by roughly only 50% each decade [12]. In contrast, CPU performance has been improving from 50 to 100% per year [12], and local area network (LAN) point-to-point bandwidth, as shown in Table 1, has been increasing by roughly an order of magnitude each decade. Today, LAN bandwidth sometimes actually exceeds main memory bandwidth. LAN | Year introduced | Bandwidth (Mbps) Token ring | 1972 | 1, 4, or 16 Ethernet | 1976 | 3 or 10 FDDI | 1987 | 100 ATM | 1989 | 155, 622, or 2488 HIPPI | 1992 | 800 or 1600 Table 1: Approximate year of introduction and point-to-point bandwidth of several popular LANs Data passing between applications and operating system, therefore, has become a major bottleneck in workstation I/O performance [17]. Emerging I/O-intensive applications, such as multimedia, parallel file systems, and supercomputing on clusters of workstations, among others, demand elimination of this bottleneck. Several previous works have proposed improving data passing efficiency by making I/O operations have move [6] or share [10,2] semantics. Move semantics avoids data copying by using typically much cheaper virtual memory (VM) manipulations. On output with move semantics, the system removes the region containing the application data from the application address space. The application buffer becomes the system output buffer, and its pages carry the data without copying, being simply unmapped. Conversely, on input with move semantics, the system inputs data into a system buffer and, after input completion, maps it to a freshly allocated region in the application address space. The system buffer becomes the application input buffer and its pages carry the data, again, without copying, being simply mapped. Although efficient, these manipulations also imply that applications cannot access output data after output nor determine the location or layout of input data. This restrictive application programming interface (API) may have prevented move semantics from gaining widespread use. Share semantics eliminates data copying by performing I/O in-place, that is, directly to or from application buffers, without distinct intermediate system buffers. Application buffers are simply promoted to double as system buffers during I/O. This promotion may require VM manipulations such as wiring the buffer to guarantee that it remains in physical memory. Share semantics can use the same API as copy semantics, but offers lower integrity guarantees on application I/O buffers and may require special hardware support. We present a novel taxonomy of I/O data passing semantics that extends previous works so as to permit a clear characterization of associated software and hardware tradeoffs. The taxonomy identifies a fourth basic semantics for I/O data passing, weak move, and associates with each basic semantics the possible optimizations. This work contributes two new techniques for safety and correctness of in-place I/O: I/O-deferred page deallocation and input-disabled copy-on-write. We also introduce four new optimizations: input-disabled pageout, region hiding, transient output copy-on-write (TCOW), and input alignment. Input-disabled pageout improves the performance of in-place I/O by making it unnecessary to wire the application buffer during I/O, in the traditional sense of removing its pages from lists where the pageout daemon might find them. Region hiding avoids region allocation and removal costs in the emulated move semantics. TCOW and input alignment eliminate data copying in an optimized semantics, emulated copy, that offers the same API and integrity guarantees as those of copy semantics and thus can, unlike other semantics in the taxonomy, replace copy semantics without requiring any changes in existing applications. We implemented a new I/O framework, Genie, that allows applications to select any semantics in the taxonomy. Using Genie for communication between PCs and AlphaStations over an ATM network at 155 Mbps, we found, rather surprisingly, that all semantics other than copy performed quite similarly. Only copy semantics had distinctly inferior performance. We analyzed these results in terms of the costs of primitive data passing operations required by each semantics and modeled how those costs scale with CPU, memory, and network speeds. The analysis suggests that, if current trends are maintained, performance differences between semantics other than copy will further decrease, and the performance gap between copy and the other semantics will widen. The main conclusion of this work is that good I/O performance does not require radical redesign of the I/O API. Existing interfaces with copy semantics can be transparently converted to emulated copy semantics and thus achieve performance that is close to the best that could be obtained with any semantics in the taxonomy. The rest of this paper is organized as follows. Section 2 presents our taxonomy. Sections 3, 4, and 5 describe techniques for in-place, emulated move, and emulated copy I/O. Section 6 describes the implementation of each data passing semantics in terms of primitive data passing operations. Section 7 shows our experimental results, and Section 8 analyzes them. Section 9 discusses related work, and Section 10 presents our conclusions. 2. Taxonomy of data passing semantics Data can be passed between applications and operating system according to different semantics. We classify buffering semantics, as shown in Figure 1, in three dimensions: buffer allocation scheme, guaranteed integrity, and level of optimization. The following subsections discuss each dimension in turn. Figure 1: Taxonomy of data passing semantics 2.1 Buffer allocation scheme In application-allocated buffering the application determines the location of its input buffers on input and retains access to its output buffers after output. This is the case of the copy semantics of Unix I/O [15]. In system-allocated buffering, on the contrary, the system automatically allocates application input buffers on input and deallocates application output buffers on output, and applications cannot or should not access their output buffers after output. System-allocated buffering is typified by the move semantics of V [6]. Application-allocated and system-allocated buffering require different APIs. The main difference is on input: In the application-allocated API, the application passes the location of its input buffers to the system, while in the system-allocated API, the system returns to the application the location of the newly allocated application input buffers. The system-allocated API also includes calls to allocate or deallocate I/O buffers. Applications with balanced amounts of input and output may be able to avoid explicit buffer allocation and deallocation by using buffers implicitly allocated in input operations for subsequent output operations. System-allocated I/O buffers can be implemented as regions that are marked moved in when accessible by the application. Regions that are not system-allocated are marked unmovable. Output with system-allocated semantics is only allowed on moved in regions because the resulting deallocation might open inconsistent gaps in unmovable regions, such as the heap or the stack. Compared to application-allocated buffering, system-allocated buffering imposes fewer constraints on the system and thus may be more easily optimized, but whether it results in better end-to-end performance depends on the application. Applications that require access to output buffers after output or that are sensitive to the layout of data, e.g. those using data structures such as arrays, may not be able to use system-allocated semantics without copying between system-allocated I/O buffers and application data structures. This user-level copying may defeat performance advantages of system-allocated semantics. 2.2 Guaranteed integrity For strong-integrity buffering the system guarantees that it will output the data contained in the application output buffer at the time of output invocation, unaffected by subsequent overwriting by the application, and that the application will not observe application input buffers in incomplete or erroneous states during input or after failed input operations. For weak-integrity buffering the system makes no such guarantees. Strong integrity is typified by copy and move semantics, both of which guarantee integrity by physically inputting or outputting data only through system buffers inaccessible by the application. Weak integrity allows I/O to be performed in-place, directly into or from buffers mapped to the application. The application can access these buffers during I/O and, consequently, can corrupt data being output or observe input data in inconsistent states. Weak-integrity buffering can be application-allocated, which we call share semantics, or system-allocated, which we call weak move semantics. In weak move semantics, output buffers remain mapped to the application after output, but the application should not access them because the system may reuse them for subsequent input and their contents until then are indeterminate. This reuse can be implemented by what we call region caching: The system marks the region weakly moved out and enqueues it in the corresponding list, per address space, where the system can find it for later reuse. On a subsequent input, the system dequeues the region and marks it again moved in. Genie uses region caching for weak move as well as the optimized emulated weak move semantics. A similar caching technique is used in the cached volatile fbuf scheme [10]. 2.3 Level of optimization Each of the four basic semantics can be optimized, achieving compatible behavior with normally better performance. Some optimizations may depend on special conditions, and inclusion of this dimension in the taxonomy makes those conditions visible to programmers. Contrary to the other two dimensions, which each had two discrete points, the optimization dimension admits a spectrum of possibilities, including many different from those presented here. We call Genie optimized semantics emulated because Genie only uses optimizations that are transparent to applications and do not require changes in programs written for the corresponding basic semantics. The rest of this subsection summarizes previously proposed optimizations, and the following three sections describe novel optimizations used in Genie. Output with copy semantics can be performed in-place if the region containing the output data is made copy-on-write (COW) [18]. The system removes write permissions from the application's mapping of the pages in the region. Attempts by the application to overwrite a page in this region will cause a VM fault. The system recovers from this fault by copying the data in the page to a new page and mapping this new page to the same virtual address in the application address space, with writing enabled. The application cannot overwrite output pages, which guarantees integrity, and copying only occurs if the application does attempt overwriting. A simpler alternative to COW, reported to have better performance [1], is to remove output page write permissions and mark output pages busy during I/O. Attempts by the application to overwrite an output page then cause the application to fault and stall until output completes. Input with copy semantics can be optimized, if the application buffer is page-aligned and of length multiple of a page size, by swapping pages between system and application buffers. Swapping unmaps application buffer pages and then maps system buffer pages to the same virtual addresses. This optimization is present in SGI and HP systems. I/O with share and weak move semantics can be optimized by requiring application I/O buffers to be located in special non-pageable areas [2,3], which eliminates the need to wire the buffers during I/O. 3. Safe and efficient in-place I/O Previous proposals have often restricted in-place I/O to special regions, such as exposed buffer areas [2] or fbuf regions [10]. In this section we describe how Genie makes in-place I/O safe and efficient regardless of data location. 3.1 I/O-deferred page deallocation Wiring is sufficient to guarantee that a system buffer will remain in physical memory during I/O because only the pageout daemon might asynchronously deallocate system buffer pages. In the case of application buffers, however, wiring is insufficient because other events may also cause application memory deallocation. These events include normal or abnormal termination of the application and explicit memory deallocation by the (possibly malicious) application. If deallocated pages still have pending I/O when they are reused for a different process, there may be corruption of output data or of the other process's memory. Genie makes in-place I/O safe by keeping counts of input and output references to each physical page in current input and output operations. Genie integrates in what we call page referencing the activities of preparing the descriptor with the physical addresses of an I/O request, verifying access rights, and updating reference counts. Genie changes the system page deallocation routine to refrain from placing pages with nonzero input or output count in the list of free pages, whence they might be allocated to other processes. After completing an I/O operation, Genie unreferences pages by updating their reference counts. If a given page no longer has any input or output references, Genie verifies whether the page is still allocated to a memory object; if not, Genie assumes that the page was deallocated during I/O, and enqueues it in the list of free pages for reutilization. We call this scheme I/O-deferred page deallocation and use it for all in-place I/O. 3.2 Input-disabled pageout Genie modifies the pageout daemon to refrain from paging out pages with nonzero input reference count. Pending input would modify these pages after pageout, making paged out data inconsistent. Moreover, the application that invoked the input is likely to access these pages after input, making them poor candidates for pageout. Genie allows the daemon to page out pages with zero input count normally, regardless of output reference count. This optimization, input-disabled pageout, adds no overhead to page referencing and makes wiring unnecessary in Genie's emulated semantics, both on input and on output. Note that input-disabled pageout allows application data to be arbitrarily located and does not reduce the memory available to the rest of the system, unlike previously proposed optimizations requiring special non-pageable buffer areas [2,3]. 3.3 Input-disabled COW COW is frequently used to optimize interprocess communication or memory inheritance with copy semantics [18]. However, it may not implement copy semantics correctly if there is a pending in-place input operation in the region. Indeed, if the input is by DMA, the input will modify memory without generating any write faults, even though the pages are mapped read-only to the applications. Consequently, changes may be observed by processes other than the one that issued the input, and COW in this case actually implements share rather than copy semantics. Genie guarantees correctness in this case by also monitoring the total number of input references to pages of each memory object in current input operations. Updating this count is integrated with page referencing and unreferencing. Genie modifies the system COW routine to perform a physical copy rather than COW if any of the region's backing memory objects has nonzero input count. The reverse case, when a region is COW before in-place input, does not require special handling, because input page referencing verifies write access rights, which will automatically fault-in a private, writable copy of the data. 4. Move emulation with region hiding Genie uses region hiding to implement a new semantics, emulated move, that is compatible with move semantics but performs I/O in-place. On output with emulated move semantics, Genie removes read and write permissions to pages in the output region, marks the region moved out, and enqueues it for later reuse, as in region caching. Genie modifies the system VM fault routine to recover from faults only in unmovable or moved in regions. Attempts by the application to access the region after output will therefore cause unrecoverable VM faults, as would be the case if the region had actually been removed, but the region and its pages remain allocated to the application address space. On a subsequent input Genie reuses the region, marks it moved in, and reinstates page read and write permissions. 5. Optimizations for copy emulation 5.1 TCOW Conventional COW can be too expensive for output copy avoidance [1]. Page referencing, however, allows Genie to implement a specialized, page-level form of COW, TCOW, that is highly efficient for this purpose. On output with emulated copy semantics, Genie simply references and removes write permissions to the application output pages. Attempts by the application to overwrite output pages will cause VM faults. Genie modifies VM fault processing as follows, in the case of write faults in regions for which the application has write permissions, when the page is found in the top memory object backing the region [18]: If the output count of the page is nonzero, the system recovers from the fault by copying the contents of the page to a new page, swapping pages in the memory object, and mapping the new page to the application buffer with writing enabled; if the output count is zero by the time the write fault occurs, the system recovers by simply reenabling writing on the page (no copying). If the page is found but not in the top object, the fault is a conventional COW fault and is handled normally. Note that TCOW adds to page referencing only the cost of removing page write permissions, which is arguably the minimum necessary overhead for strong-integrity, safe in-place output. TCOW differs from conventional COW in two ways. First, TCOW is transient --- COW is conventionally long-term, while TCOW only lasts during output, which is when it is actually useful. Second, TCOW operates at page level instead of at region level. This allows TCOW to prevent the proliferation of regions on output and reduce the number of VM data structures that it needs to manipulate. TCOW performs as well as the busy-marking scheme [2,3] with the added benefit of not stalling applications that overwrite output buffers during output. 5.2 Input alignment On input with emulated copy semantics, Genie inputs data into system buffers that start at the same page offsets and have the same lengths as the corresponding application buffers. Consequently, Genie can swap pages between system and application buffers even if application buffers are not page-aligned or have lengths that are not multiple of the page size. This scheme, system input alignment, goes against the traditional practice of allocating system buffers without regard to the alignment and length of application buffers. As illustrated in Figure 2, lack of alignment makes it impossible to swap pages. For situations where the system is unable to align its buffers, Genie offers application input alignment, where the application aligns its buffers to system buffers. Genie includes an interface through which applications can query I/O modules (e.g. a protocol stack) about the preferred alignment and length of input buffers. The preferred alignment may be nonzero and the preferred length may be not equal to a multiple of the page size because, for example, of unstripped packet headers and network maximum transmission units. Genie uses what we call reverse copyout to pass data in partially filled pages in aligned buffers. If data in the system page is shorter than the reverse copyout threshold, Genie simply copies it out, as shown for item 1 in Figure 2. If it is longer, however, Genie first completes the rest of the system page with corresponding data from the respective application page and then swaps the pages, as shown for items 3 and 4 in Figure 2. The threshold is set just above half the page size so as to minimize data copying. Figure 2: Input alignment 6. Data passing implementation in Genie This section describes, in terms of primitive data passing operations, how Genie implements each semantics for data passing between user-level applications and I/O modules (such as protocol stacks and drivers) in the kernel. The following subsections discuss output and input in turn. We will use these detailed descriptions to analyze experimental results in the next two sections. Both on output and on input, Genie takes advantage of the fact that copy semantics often is very efficient for short data. If data is shorter than configurable thresholds, Genie automatically converts emulated copy or emulated share semantics into copy semantics. 6.1 Output Output data passing involves two processing stages: prepare, when the application invokes the output operation, and dispose, when output completes and the system is ready to return control to the application. The operations used for output are summarized in Table 2, where ``read-only'' means ``remove write permissions'', and ``invalidate'' means ``remove all access permissions'' in the corresponding VM page table entries. | Prepare | Dispose Copy | Allocate system buffer. | Deallocate system buffer. | Copyin output data. | Emulated copy | Reference application pages. | Unreference application pages. | Read-only application pages. | Share | Reference application pages. | Unwire region. | Wire region. | Unreference application pages. Emulated share| Reference application pages. | Unreference application pages. Move | Reference application pages. | Unwire region. | Wire region. | Unreference application pages. | Mark region moving out. | Remove region. | Invalidate application pages.| Emulated move | Reference application pages. | Unreference application pages. | Mark region moving out. | Mark region moved out and | Invalidate application pages.| enqueue. Weak move | Reference application pages. | Unwire region. | Wire region. | Unreference application pages. | Mark region moving out. | Mark region weakly moved out and | | enqueue. Emulated weak | Reference application pages. | Unreference application pages. move | Mark region moving out. | Mark region weakly moved out and | | enqueue. Table 2: Operations for data output from application to kernel Note that Genie does not remove a system-allocated region until dispose time in order to guarantee that the corresponding virtual addresses will not be reassigned during I/O, thus allowing graceful recovery in case of error. 6.2 Input Input data passing involves three processing stages: prepare, when the application invokes the input operation; ready, when the device needs buffering; and dispose, when the input operation is complete and the system is ready to return control to the application. Input processing must match the type of input buffering used by the device controller. Genie distinguishes three types of device input buffering: early demultiplexed, pooled in-host, and outboard, as described in the following. 6.2.1 Input with early demultiplexing With early demultiplexed input buffering, buffers reside in host main memory and are specified by multiple (pointer, length) pairs. The device controller keeps separate input buffer lists per input request or connection and inputs data directly into a buffer from the appropriate list. Early demultiplexing enables in-place or system-aligned buffering if the application informs the system about the location of its buffers before physical input. This condition is trivially met when physical input is synchronous to application requests. In data communication, however, input may occur before solicited by the application, and location information must be provided either by the receiver, using a preposted, possibly asynchronous input operation, or by the sender, using a field in the packet header [5,20] or implicitly by the connection used [16]. If location information is not available to the system, copy avoidance is still possible by application-aligned or system-allocated buffering. Table 3 summarizes the operations used for preposted input with early demultiplexing. | Prepare | Ready | Dispose Copy | | Allocate | Copyout input data. | | system buffer.| Deallocate system buffer. Emulated | | Allocate | Swap pages. copy | | aligned buffer.| Deallocate aligned buffer. Share | Reference application pages.| | Unwire region. | Wire region. | | Unreference application pages. Emulated share | Reference application pages.| | Unreference application pages. Move | | Allocate | Create region. Zero-complete | | system buffer.| system pages and fill region. | | | Map region and mark moved in. Emulated | Dequeue moved out region,| | Check region, move | mark region moving in,| | unreference application | and reference | | pages, reinstate page accesses, | application pages.| | and mark region moved in. Weak move| Dequeue weakly moved out region,| | Check region. Unwire region. | mark region moving in, and| | Unreference application pages | reference application pages.| | and mark region moved in. | Wire region. | | Emulated | Dequeue weakly moved out region,| | Check region, weak move| mark region moving in, and| | unreference application pages, | reference application pages.| | and mark region moved in. Table 3: Operations for data input with early demultiplexing Note that, to maintain protection, move semantics has to clear the unused portions of a system buffer before mapping it to the application. If the semantics is emulated move, weak move, or emulated weak move and at prepare time no suitable cached region can be found in the appropriate queue, Genie allocates a new region and marks it moving in. For the same three semantics, Genie checks, at dispose time, that the cached region prepared and used for input is still present in the application address space. If it was removed (advertently or not) by the application, Genie maps the corresponding pages to a new region, guaranteeing that the location information returned to the application correctly points to the input data. 6.2.2 Input with pooled buffering With pooled in-host buffering, the device controller allocates input buffers from a pool of fixed-size buffers (commonly pages) in host main memory without considering the corresponding input request or connection. This is still the most popular of the input buffering schemes, although early demultiplexing is becoming more common with the advent of ATM networks. Pooled buffering does not allow in-place or system-aligned buffering, and copy avoidance is possible only with application-aligned or system-allocated buffering. Genie always prepares application input buffers according to Table 3, enabling early demultiplexing. Actual input may use pooled buffering, however, either because the device does not support early demultiplexing or because the application did not inform the location of its input buffers before physical input. For pooled buffering, the ready-time and dispose-time operations are those shown in Table 4. At ready time the I/O module inputs data into overlay buffers allocated from a private pool of pages in main memory. At dispose time, Genie passes data from overlay buffers to application buffers and deallocates the overlay buffers, returning them to the respective I/O module pool for reuse. | Ready | Dispose Copy | Allocate overlay buffer. | Copyout input data. | Overlay buffer. | Deallocate overlay buffer. Emulated copy| Allocate overlay buffer. | If aligned, swap pages, else | Overlay buffer. | copy out. Deallocate overlay buffer. Share | Allocate overlay buffer. | Unwire region. Unreference | Overlay buffer. | application pages. If aligned, | | swap pages, else copy out. | | Deallocate overlay buffer. Emulated | Allocate overlay buffer. | Unreference application pages. share | Overlay buffer. | If aligned, swap pages, else | | copy out. Deallocate overlay buffer. Move | Allocate overlay buffer. | Create region. Zero-complete overlay | Overlay buffer. | pages, fill region and refill | | overlay buffer. Map region and mark | | moved in. Deallocate overlay buffer. Emulated move| Allocate overlay buffer. | Check region. Unreference application Emulated weak| Overlay buffer. | pages. Swap pages. Mark region moved move | | in. Deallocate overlay buffer. Weak move | Allocate overlay buffer. | Check region. Unwire region. | Overlay buffer. | Unreference application pages. | | Swap pages. Mark region moved in. | | Deallocate overlay buffer. Table 4: Ready and dispose-time operations for input with pooled buffering Note that in the case of move semantics, Genie maps overlay pages to the application and therefore needs to refill the overlay buffer with the same number of newly-allocated pages to avoid pool depletion. 6.2.3 Input with outboard buffering With outboard buffering, the device controller allocates input buffers from a pool in outboard memory. Outboard buffers can be transferred directly into application buffers after successful input completion, providing strong integrity regardless of the semantics of application buffers and even if the system was not previously informed of the location of application buffers. However, outboard buffering can also add complexity and cost to the controller. In the context of network adapters, early demultiplexed and pooled in-host buffering are examples of ``cut-through'' architectures, while outboard buffering has a ``store-and-forward'' architecture that typically imposes higher latency. If the device uses outboard buffers for input, Genie alters the operations in Table 3 as follows: For all semantics other than emulated copy, at ready time, after the operations in the table, start DMA into host memory; and at dispose time, after the operations in the table, deallocate the outboard buffer. For emulated copy semantics, no buffer is allocated at ready time, and at dispose time, the system references the application pages, DMAs data from the outboard buffer to the application buffer, unreferences the application pages, and deallocates the outboard buffer. Consequently, with outboard buffering, emulated copy is implemented much as emulated share semantics. 7. Experimental comparison of data passing semantics This section reports end-to-end latencies and CPU utilization for datagram communication measured using the various buffering semantics and early demultiplexed or pooled input buffering. We ran our experiments on computers of the types shown in Table 5, connected by the Credit Net ATM network [14] at OC-3 (155 Mbps) rates. Unless otherwise noted all figures and tables in this and the next sections refer to the Micron P166 PCs. Results for the other platforms were similar and can be found in [3]. The Credit Net network adapter transfers data between main memory and the physical medium by burst-mode DMA over the PCI I/O bus. We used the NetBSD 1.1 operating system augmented with an implementation of the Genie interface, through which applications accessed the network. We measured latencies by capturing the value of the CPU on-chip cycle counter at appropriate points in the code. All measurements were made on warm caches and are the averages of five runs. We report primarily latencies rather than throughput to simplify analysis. | Micron P166 | Gateway P5-90 |DEC AlphaStation 255/233 CPU | Pentium 166 MHz | Pentium 90 MHz | 21064A 233 MHz Integer rating| 4.52 | < 2.88 | < 3.48 L1-cache | 8 KBI + 8 KBD | 8 KBI + 8 KBD | 16 KBI + 16 KBD | 3560 Mbps | 1910 Mbps | 2860 Mbps L2-cache | 256 KB, 486 Mbps | 256 KB, 244 Mbps | 1 MB, 1366 Mbps Memory | 32 MB, 4 KB page | 32 MB, 4 KB page | 64 MB, 8 KB page | 351 Mbps | 146 Mbps | 350 Mbps Table 5: Characteristics of the computers used in the experiments. The integer rating used for the Micron P166 is the listed SPECint95 of the Dell XPS 166. The rating taken as upper bound for the Gateway P5-90 is the listed SPECint95 of the Dell XPS 90, which has a bigger and faster L2-cache. The rating taken as upper bound for the AlphaStation is its listed SPECint_base95 because the version of NetBSD used on it could not be compiled with optimizations. The cache and memory bandwidths listed are the peak values we observed using a bcopy benchmark at user level. Figure 3: End-to-end latency with early demultiplexing. Semantics other than copy have similar performances. Figure 3 (In all figures in this section, we list the semantics in the legend in the same order as the respective curves. For fine discrimination between curves that appear cluttered in the figure, please refer to Table 7.) shows latencies using early demultiplexing. We measured latencies for increasing datagram lengths equal to a multiple of the page size, up to 60 KB, the largest such multiple allowed by ATM AAL5. For these datagrams, copy semantics gave much higher latency than did any of the other semantics, which pass data using VM manipulations instead of copying. The differences between semantics other than copy were small. The most striking difference was that between copy and emulated copy semantics, which offer the same API and integrity guarantees. Using TCOW and input alignment, emulated copy reduced latencies for 60 KB datagrams by 37%. For all data lengths emulated copy also gave lower latency than those of move and share semantics. This advantage is due to the use of input-disabled pageout instead of region wiring in emulated copy semantics. Emulated move semantics gave slightly lower latencies than those of emulated copy semantics because it simply invalidates and reinstates page table entries instead of fully swapping pages, which also requires updating the respective memory object. Latency was still lower, but only slightly, with the emulated weak-integrity semantics, which do not require page table updates. The equivalent throughput for single 60 KB datagrams was 78 Mbps for copy, 121 Mbps for move, 124 Mbps for share, emulated copy, and weak move, 126 Mbps for emulated move, 128 Mbps for emulated weak move, and 129 Mbps for emulated share semantics. Figure 4: CPU utilization. Copy semantics leaves much less CPU time available for applications. We instrumented the idle loop in the operating system scheduler to measure CPU idle time while performing the experiment of Figure 3. We show the results in Figure 4. CPU utilization was much higher for copy semantics than for any other semantics. For 60 KB datagrams, the utilizations were 26% for copy, 12% for move, weak move, and share, 10% for emulated copy and emulated move, 9% for emulated weak move, and 8% for emulated share semantics. Figure 5: End-to-end latency for short datagrams with early demultiplexing. Using reverse copyout, emulated copy avoids copying more than about half a page. Figure 5 shows short datagram latencies with early demultiplexing. Move semantics gave by far the highest latency for short datagrams because it has to zero the part of the page not occupied by data. Emulated move semantics gave much lower latencies because it performs I/O in place, using region hiding, and therefore does not need to zero the remainder of the page. Copy semantics gave the lowest (145 usec) but also the most rapidly rising latency because of the high incremental cost of copying. We set output thresholds so that Genie automatically converted to copy semantics output of data shorter than 1666 bytes with emulated copy or 280 bytes with emulated share semantics. We set the reverse copyout threshold at 2178 bytes. (Performance is only moderately sensitive to these settings; we empirically determined these values to give good results.) With these settings, emulated copy semantics had about the same latency as that of copy semantics for data up to half page long; above that, reverse copyout and swapping significantly reduced the latency of emulated copy relative to that of copy semantics. Emulated share had, for all data lengths, the lowest latency, because its data passing overhead consists solely of page referencing. The difference between latencies with emulated copy and emulated share semantics was maximal at half page size: 325 vs. 254 usec. Weak move and emulated weak move semantics gave slightly higher latencies than those of share and emulated share, respectively, because of region caching costs avoided in application-allocated semantics. The slightly higher latency of emulated move semantics relative to that of emulated weak move semantics is due to region hiding. The higher latencies of share and weak move semantics relative to their emulated counterparts are due to region wiring and unwiring, which cost about 35 usec for the first page and become unnecessary in the emulated semantics because of the input-disabled pageout optimization. Figure 6: End-to-end latency with application-aligned pooled input buffering. If there is alignment, non-copy application-allocated semantics give performances similar to those of system-allocated semantics. Figure 6 shows latencies with pooled input buffering and application-aligned application buffers. Copy and emulated copy have latencies only very slightly higher than the respective latencies with early demultiplexing, corresponding to the same operations plus buffer overlay overhead. Share, move, and weak move semantics have higher latencies than those of emulated copy because of region wiring and unwiring. All other semantics have latencies very close to that of emulated copy. For single 60 KB datagrams, the equivalent throughput is 77 Mbps for copy, 120 Mbps for share, move, and weak move, 123 Mbps for emulated move, emulated copy and emulated weak move, and 124 Mbps for emulated share semantics. Figure 7: End-to-end latency with unaligned pooled input buffering. Without alignment, application-allocated semantics require copying at the receiver. Figure 7 shows latencies with pooled input buffering and unaligned application buffers. In this case, emulated copy, share, and emulated share semantics require data copying on input, whereas the other semantics are unaffected. This figure clearly shows the impact of data copying, splitting the semantics into a group with no copies (system-allocated semantics), another with two copies (copy semantics, with one copy at the sender and another at the receiver), and the remaining group, between the other two, with one copy. For single 60 KB datagrams, the equivalent throughput is 77 Mbps for copy semantics, around 92 Mbps for the other application-allocated semantics, and 121 Mbps for the system-allocated semantics. Figure 7 may give the impression that system-allocated semantics are intrinsically more efficient than application-allocated semantics if pooled buffering is used. However, if an application is insensitive to data layout enough to use system-allocated semantics, then in principle that application can also align its buffers to system buffers, and then emulated copy, emulated share, and system-allocated semantics give very similar performance, as shown in Figure 6. If, on the contrary, an application is sensitive to data layout, it would require application-level copies between system-allocated I/O buffers and application data structures. The total number of copies (possibly one copy on output, if the application needs to retain access to the same or other data on the same pages, plus one copy on input) is then at best the same as if emulated copy or emulated share semantics were used (one copy on input only). In those cases, system-allocated semantics may actually give worse end-to-end performance than application-allocated semantics. We do not show results with outboard buffering because of limitations in the hardware used. However, we expect that, compared to early demultiplexing, the staging of data at an outboard buffer will add an equal amount of latency to all semantics except emulated copy. Because of the special way the latter is handled, we expect it to have performance even closer to that of emulated share semantics. 8. Analysis This section analyzes empirical end-to-end latencies in terms of the costs of primitive data passing operations and models how those costs scale with CPU, memory, and network speeds. We use this model to discuss the sensitivity of our results. End-to-end latencies can be broken down into the sum of a base latency and data passing latencies at the sender and receiver. The base latency captures end-to-end costs that are independent of the particular buffering semantics or input buffering scheme used, such as crossing the application-kernel boundary and incurring driver, device, network, and interrupt latencies. Data passing latencies, on the contrary, depend on the semantics and input buffering scheme used. We take the base latency to be equal to the end-to-end latency of emulated share semantics with early demultiplexing, reduced by the costs of referencing and unreferencing application I/O buffers. Only prepare time data passing operations at the sender contribute to end-to-end latency, because dispose-time operations overlap with network latencies and latencies at the receiver. Conversely, with early demultiplexing, prepare and ready time operations at the receiver overlap with latencies at the sender and in the network, and only the dispose time operations at the receiver contribute to end-to-end latency. With pooled or outboard buffering, only ready time and dispose time operations at the receiver contribute to end-to-end latency. We directly measured the latencies of primitive data passing operations by instrumenting the Genie code. We added instructions at appropriate points in the code to record the value of the CPU on-chip cycle counter. We recorded the time intervals for each operation and datagram length when performing the experiments reported in Figures 3, 6, and 7, taking the averages of five runs. We then performed a least-squares linear fit on each operation latency versus datagram length. We obtained excellent correlation except in cases of constant or very small latencies. We averaged the fitted equations of each operation latency over the semantics and input buffering schemes where the operation is used. We show the results in Table 6. Note that copyin cost less than copyout because our experiments were on warm caches. On output (copyin), data can be read from the cache, while on input (copyout) it has to be read from memory. The copyin cost is actually nonlinear because the L1-cache has much higher bandwidth than that of the L2-cache. This causes a negative y-intercept in the corresponding linear fit. Operation | Latency | Operation | Latency Base | 0.0598 B + 130 | Swap | 0.00163 B + 15 Copyin | 0.0180 B - 3 | Copyout | 0.0220 B + 15 Reference | 0.000363 B + 5 | Unreference | 0.000100 B + 2 Wire | 0.00141 B + 18 | Unwire | 0.000237 B + 10 Read only | 0.000367 B + 2 | Region create | 24 Invalidate | 0.000373 B + 2 | Region fill | 0.000398 B + 9 Region mark out | 3 | Region fill&overlay refill | 0.000716 B + 11 Overlay allocate| 7 | Region map | 0.000474 B + 6 Overlay | 7 | Region check, unreference, | 0.000507 B + 11 Overlay deallocate| 0.000344 B + 12 | reinstate, mark in | Region check | 5 | Region check, unreference, | 0.000194 B + 6 Region mark in | 1 | mark in | Table 6: Costs of primitive data passing operations, in usec. B is the data length in bytes. Taking the values from Table 6, we added, for each semantics, the base latency, the costs of the respective output prepare-time operations indicated in Table 2, and the costs of the respective input dispose-time operations indicated in Table 3, obtaining an estimate of the respective end-to-end latency with early demultiplexing. We show these estimates in Table 7, along with the least-squares linear fit of the actual end-to-end latencies from Figure 3. Adding base latency, the costs of output prepare-time operations, and the costs of input ready-time and dispose-time operations indicated in Table 4, for each semantics, we obtained estimates of the respective end-to-end latencies with pooled buffering and application-aligned or unaligned application buffers. We show these estimates in Table 7, along with the least squares linear fit of the actual end-to-end latencies from Figures 6 and 7. The good fit between estimated and actual latencies suggests that our breakdown model is accurate for the datagram lengths considered, which are multiples of the page size (additional terms would increase accuracy for intermediate lengths, but also make the model more complicated). Semantics | | Early demultiplexing | Appl.-aligned pooled | Unaligned pooled Copy |E| 0.0997 B + 141 | 0.100 B + 166 | 0.100 B + 166 |A| 0.0998 B + 125 | 0.101 B + 139 | 0.101 B + 144 Emulated |E| 0.0621 B + 153 | 0.0625 B + 178 | 0.0828 B + 177 copy |A| 0.0622 B + 150 | 0.0622 B + 175 | 0.0848 B + 195 Share |E| 0.0619 B + 165 | 0.0637 B + 204 | 0.0841 B + 203 |A| 0.0621 B + 162 | 0.0638 B + 197 | 0.0846 B + 219 Emulated |E| 0.0602 B + 137 | 0.0621 B + 175 | 0.0825 B + 175 share |A| 0.0600 B + 137 | 0.0619 B + 167 | 0.0824 B + 178 Move |E| 0.0628 B + 197 | 0.0634 B + 224 | 0.0634 B + 224 |A| 0.0626 B + 202 | 0.0631 B + 234 | 0.0631 B + 234 Emulated |E| 0.0610 B + 151 | 0.0625 B + 185 | 0.0625 B + 185 move |A| 0.0609 B + 150 | 0.0623 B + 183 | 0.0623 B + 183 Weak move |E| 0.0620 B + 173 | 0.0637 B + 212 | 0.0637 B + 212 |A| 0.0615 B + 170 | 0.0633 B + 206 | 0.0633 B + 206 Emulated |E| 0.0603 B + 144 | 0.0621 B + 183 | 0.0621 B + 183 weak move |A| 0.0602 B + 143 | 0.0619 B + 184 | 0.0619 B + 184 Table 7: Estimated (E) and actual (A) end-to-end latencies, in usec. B is the data length in bytes. Using the breakdown model, the end-to-end latency when sender and receiver use different semantics can be expected to be equal to the sum of the base latency plus sender-side latencies of the semantics used by the sender plus receiver-side latencies of the semantics used by the receiver. The breakdown model can be extended into a scaling model that takes into account CPU, memory, and network speeds. To a first approximation: (1) The multiplicative factor of the base latency is network-dominated and equal to the inverse of the net network transmission rate, subject to adapter and I/O bus bandwidth limitations; (2) The fixed term of the base latency is equal to the sum of I/O bus, device, and network latencies, plus a term corresponding to fixed operating system overhead, which scales inversely to CPU speed; (3) The copyout multiplicative factor is memory-dominated and equal to the inverse of the main memory copy bandwidth, and the associated fixed term can be ignored; (4) The copyin multiplicative factor is cache-dominated and may vary between the inverse of the copy bandwidth of the L2-cache and the inverse of the copy bandwidth of main memory, depending on data and cache sizes and cache associativity and locality. The fixed term can be ignored; (5) All other parameters are CPU-dominated and scale according to CPU speed, as estimated by an appropriate integer benchmark, such as SPECint95. (The cost of page table updates may scale otherwise between processors of different architecture, causing the cost of the read-only, invalidate, swap, region map, and reinstate operations to diverge from this model. Page table updates are particularly costly in multiprocessors, where the semantics that do not use these operations -- weak (with early demultiplexing) and copy -- may have some advantage.) We verified (1), (3), and (4) in each platform [3]. Table 8 shows the verification of (3), (4), and (5) across platforms. Agreement between estimated and actual scaling was quite good for the Gateway P5-90, which has the same architecture as the base case. In the AlphaStation, CPU-dominated ratios had geometric means consistent with the model but variances that were much higher than those of the Gateway P5-90. This could be expected, given that the AlphaStation has a substantially different architecture. | Gateway P5-90 Type of Parameter | Estimated | GM | Min | Max Memory-dominated | 2.40 | 2.43 | 2.43 | 2.43 Cache-dominated | > 1.44, < 3.33 | 2.46 | 2.46 | 2.46 CPU-dominated | | | | mult. factor | > 1.57 | 1.79 | 1.58 | 1.92 fixed term | > 1.57 | 1.83 | 1.53 | 2.59 | AlphaStation 255/233 Type of Parameter | Estimated | GM | Min | Max Memory-dominated | 1.00 | 0.83 | 0.83 | 0.83 Cache-dominated | > 0.26, < 1.39 | 0.54 | 0.54 | 0.54 CPU-dominated | | | | mult. factor | > 1.30 | 1.64 | 0.75 | 3.77 fixed term | > 1.30 | 1.54 | 0.47 | 3.74 Table 8: Scaling of data passing costs relative to the Micron P166. ``GM'', ``Min'', and ``Max'' are the geometric mean, minimum, and maximum values of the ratios of parameters of each given type. Extrapolating based on the scaling model, if CPU speeds continue to increase faster than transmission rates, as is the current trend, the performance differences between semantics other than copy will tend to decrease, and if CPU speeds continue to increase faster than main memory bandwidth, the performance difference between copy and other semantics will increase. At OC-12 (622 Mbps) rates, the scaling model predicts that the end-to-end throughput for single 60 KB datagrams with early demultiplexing on the Micron P166 PCs will be close to 140 Mbps with copy semantics, 404 Mbps with emulated copy semantics, 463 Mbps with emulated share semantics, or 380 Mbps with move semantics, giving emulated copy almost three times better performance than that of copy semantics. 9. Related work Previous work in I/O buffering has typically focused on optimizations using a particular semantics, device, or application. We are not aware of other direct, controlled comparisons covering as broad a range of buffering options as we present here. The technique discussed here for emulated copy input with outboard buffering is a generalization of that used in [19,9]. However, those works also use outboard buffering for copy elimination on output. We, on the contrary, avoid copying on output by simple VM manipulations, which may simplify device interface design. Staging output through an outboard buffer may still be advantageous, however, if, for example, there is hardware to compute the TCP checksum (which goes in the packet header) while data is being DMAed from application buffer to outboard buffer (which was the case in [19]). Fbufs [10] are system-allocated but are optimized with mixed semantics. Cached fbuf output has semantics similar to emulated copy, but requires wiring and leaves buffers read-only until explicit deallocation because there is no COW scheme. Cached volatile fbuf output has semantics similar to share. Cached and cached volatile fbuf input have semantics similar to weak move but use read-only buffers that must be deallocated explicitly. We assumed the use of DMA for I/O, which is necessary for high throughput in most current systems. Programmed I/O may make it unnecessary to wire application buffers, reducing latency, but subject to page faults, which could cause deadlock if the faulted process is itself a (possibly user-level) pager or is invoked by a pager. In our implementation, input-disabled pageout eliminates wiring costs safely and cheaply. Programmed I/O may have the advantage of simplifying the implementation of early demultiplexing. There have been proposals to reduce the penalty of copying by integrating it with other data touching operations, such as TCP checksumming [7]. Integration of checksumming on input has semantic implications: If checksumming is integrated with the copy from device or system buffer to application buffer, and the checksum is wrong, the application buffer will be overwritten with faulty data, and the semantics is actually weak and not copy. If a system buffer is involved (i.e., not programmed I/O between device and application buffers), in our implementation, at least for long data, it costs less to pass the data by VM manipulation and then read it for checksumming than to read and write (one-step checksum and copy) the data [4]. The work reported here concerns I/O data path (buffering) optimizations, but we would like to point out recent work on I/O control-path optimizations, such as bypassing the operating system (OS) [8, 11, 21] and separating control from data transfer [22,16]. In terms of the analysis from the previous section, bypassing the OS reduces the constant term of the base latency (i.e., the latency for short data). Given that the wiring of application buffers requires OS intervention, OS bypassing may imply some form of either non-pageable buffer areas or application-level programmed I/O between application buffers and device controller. 10. Conclusions We introduced a new taxonomy that characterizes in a structured way a very broad and general range of options for I/O data passing between applications and operating systems. We described the implementation of data passing in Genie, a new I/O framework that allows applications to select any data passing semantics in the taxonomy. Using an implementation of Genie on NetBSD, we made direct, controlled evaluations and comparisons of the different buffering semantics. We identified the fundamental similarities and differences between the various semantics, assessed the impact of different architectural support in device controllers, and investigated how performance scales with CPU and memory speeds. The experiments demonstrated significant performance gains due to TCOW, input alignment, region hiding, and input-disabled pageout. The results indicate that large I/O performance improvements are possible in many Unix-derived operating systems by using emulated copy instead of copy semantics. This substitution does not require changes in applications because both semantics give the same integrity guarantees and can offer the same API. At the I/O rates tested, some additional performance improvement is possible by using emulated share semantics, particularly at short data lengths. The scaling model predicts that larger relative improvements would occur when interfacing with faster devices or in shared-memory multiprocessors. Although emulated share semantics can offer the same API as copy semantics, the lower integrity guarantees may require changes in applications. We expect that few or no changes would be necessary in multi-programmed or distributed applications that already lock and checkpoint data. Contrary to what might have been expected, we did not find system-allocated semantics to have any consistent advantage relative to application-allocated semantics. In fact, we found that application-aligned, application-allocated buffers differ very little in functionality or performance from system-allocated buffers. Because system-allocated semantics may require either substantial rework of pre-existing applications written for copy semantics or application-level copies that negate performance improvements, we believe that emulated copy and emulated share semantics offer more practical and general approaches for I/O performance improvement. Acknowledgements Prashant Chandra and Todd Mummert helped with experimental set-up. Anonymous referees and the OSDI paper ``shepherd'', Willy Zaenepoel, provided many helpful suggestions. References 1) J. Barrera III. ``A fast Mach network IPC implementation'', in Proc. USENIX Mach Symp., USENIX, Nov. 1991, pp. 1-11. 2) J. Brustoloni. ``Exposed Buffering and Sub-Datagram Flow Control for ATM LANs'', in Proc. 19th Conf. Local Computer Networks, IEEE, Oct. 1994, pp. 324-334. 3) J. Brustoloni and P. Steenkiste. ``Application-Allocated I/O Buffering with System-Allocated Performance''. Tech. Rep. CMU-CS-96-169, Carnegie Mellon Univ., Pittsburgh, PA, 1996. 4) J. Brustoloni and P. Steenkiste. ``Copy Emulation in Checksummed, Multiple-Packet Communication''. Submitted for publication. 5) G. Buzzard, D. Jacobson, M. Mackey, S. Marovitch and J. Wilkes. ``An implementation of the Hamlyn sender-managed interface architecture'', in Proc. OSDI'96, USENIX, Oct. 1996. 6) D. Cheriton. ``The V distributed system'', in Comm. ACM, 31(3):314-333, March 1988. 7) D. Clark and D. Tennenhouse. ``Architectural considerations for a new generation of protocols'', in Proc. SIGCOMM'90, ACM, Sept. 1990, pp. 200-208. 8) E. Cooper, P. Steenkiste, R. Sansom and B. Zill. ``Protocol Implementation on the Nectar Communication Processor'', in Proc. SIGCOMM'90, ACM, Sept. 1990, pp. 135-143. 9) C. Dalton, G. Watson, D. Banks, C. Calamvokis, A. Edwards and J. Lumley. ``Afterburner'', in IEEE Network, July 1993, pp. 36-43. 10) P. Druschel and L. Peterson. ``Fbufs: A High-Bandwidth Cross-Domain Transfer Facility", in Proc. 14th SOSP, ACM, Dec. 1993, pp. 189-202. 11) P. Druschel, L. Peterson and B. Davie. ``Experience with a High-Speed Network Adaptor: A Software Perspective'', in Proc. SIGCOMM'94, ACM, Aug. 1994, pp. 2-13. 12) J. Henessy and D. Patterson. ``Computer Architecture: A Quantitative Approach''. Morgan Kaufmann Pub., San Mateo, CA, 1990. 13) K. Kleinpaste, P. Steenkiste and B. Zill. ``Software Support for Outboard Buffering and Checksumming'', in Proc. SIGCOMM'95, ACM, Aug. 1995, pp. 87-98. 14) C. Kosak, D. Eckhardt, T. Mummert, P. Steenkiste and A. Fischer. ``Buffer Management and Flow Control in the Credit Net ATM Host Interface'', in Proc. 20th Conf. Local Computer Networks, IEEE, Oct. 1995, pp. 370-378. 15) S. Leffler, M. McKusick, M. Karels and J. Quaterman. ``The Design and Implementation of the 4.3BSD UNIX Operating System'', Addison-Wesley Pub. Co., Reading, MA, 1989. 16) T. Mummert, C. Kosak, P. Steenkiste and A. Fischer. ``Fine Grain Parallel Communication on General Purpose LANs'', in Proc. 10th Intl. Conf. Supercomputing, ACM, 1996, pp. 341-349. 17) J. Ousterhout. ``Why aren't operating systems getting faster as fast as hardware?'', in Proc. Summer 1990 Conf., USENIX, June 1990, pp. 247-256. 18) R. Rashid, A. Tevanian, M. Young, D. Golub, R. Baron, D. Black,W. Bolosky and J. Chew. ``Machine-Independent Virtual Memory Management for Paged Uniprocessor and Multiprocessor Architectures'', in Proc. 2nd ASPLOS, ACM, Oct. 1987, pp. 31-39. 19) P. Steenkiste, B. Zill, H. Kung, S. Schlick, J. Hughes, R. Kowalski and J. Mullaney. ``A Host Interface Architecture for High-Speed Networks'', in Proc. 4th IFIP Conf. High Performance Networks, IFIP, Dec. 1992, pp. A3 1-16. 20) T. Stricker, J. Stichnoth, D. O'Hallaron and S. Hinrichs. ``Decoupling Synchronization and Data Transfer in Message Passing Systems of Parallel Computers'', in Proc. Intl. Conf. Supercomputing, ACM, July 1995, pp. 1-10. 21) T. von Eicken, A. Basu, V. Buch and W. Vogels. ``U-Net: A User-Level Network Interface for Parallel and Distributed Computing'', in Proc. 15th SOSP, ACM, Dec. 1995, pp. 40-53. 22) C. Thekkath, H. Levy and E. Lazowska. ``Separating Data and Control Transfer in Distributed Operating Systems'', in Proc. 6th ASPLOS, ACM, Oct. 1994, pp. 2-11.