Next: Performance
Up: Hit-Server
Implementation Previous: Analysis
From the previous analysis, we know that every 14 us a 1K packet has to be sent to achieve maximum hardware utilization. However, we had to design the system for an even higher demand: with optimal PCI-bus DMA hardware (1 word per 30ns cycle), a packet had to be transmitted every 8 us3.Obviously, the hit-server software must be constructed carefully not to delay packet transmission. The following paragraphs discuss the methods we used to achieve these goals.
Early evaluation is a technique for reducing the latency and improving the throughput of operations. If an operation is requested several times on the same object, it needs to be executed only once. If an operation can be executed either at a place or at a time when free resources are available, its costs are hidden.
Object precompilation ensures that only negligible computations or data transformations are required by the hit-server for delivering a cached object to a client. When loading the object into the hit-server cache, the miss-server partitions it into network packets, generates the appropriate header information and computes a client-independent digest for each packet. On sending a packet, only the destination address, sequence number and sometimes a message-authentication code have to be generated.
To meet our security requirements, any delivered data has to be secured by a client-specific message-authentication code which is also unique in time to prevent replay attacks. By using a client-specific secret key, the authentication code is calculated from the precompiled client-independent digest and a nonce.
Many research experiments show that avoiding unnecessary data copies improves performance,(e.g., Fbufs [6], Unet [19]). Since we use precompiled packets in the object cache, we can always use unbuffered transmission for delivery. With put operations, the arriving 1K packets are linked, not copied, into the new version of the object (see also Section 2.3).
The default putget operation is implemented by the hit-server in a single address space. Since we would like to enforce security requirements on active objects, they execute their specialized operations in per-object address spaces.
Remember that the object granularity for put updates is 1K while hardware pages are 4K. Therefore, once a 4K region of an object with a non-default putget operation is no longer physically contiguous, the corresponding page must be removed from the object's address space. Upon the page fault on this page, the corresponding 1K chunks are physically copied into a fresh 4K page frame which is then mapped into the object's address space. Object-specific putgets often do not read the entire object data by itself but simply specify to the hit-server core what parts should be transmitted. In these cases, the above mentioned lazy-copying technique avoids copying of received put data packets even for non-default objects.
The u-kernel can be configured to support up to 65,000 address spaces. The space costs per address space are low for small objects, about 22 bytes for an object of 16 K. Nevertheless, the maximum number of address spaces is only 6.5% of the maximum number of objects. Currently, we do not yet know whether this is in practice sufficient to preallocate and preconstruct an address space for any object that uses non-default get and put operations. Otherwise, the hit-server would have to multiplex address spaces for active objects.
In the hit-server case, copying data in main memory is not only ``in principle avoidable'' but belongs to the class of the most expensive operations. Section 3.1 illustrated that the memory bus is the time-critical bottleneck. Therefore, processor accesses to main memory have to be minimized. Since L1 and L2 caches use a write-back strategy, the processor's reads and writes are uncritical as long as they hit in the hardware cache and do not touch the memory bus.
Fortunately, early evaluation techniques, in particular object precompilation, prevents unnecessary memory-to-cache copies. Since we use a precalculated digest, there is no need for the default get operation to read the object data.
The code segments of the u-kernel and the hit-server core are small enough so that their frequently used parts fit completely even into the L1 cache. The hit-server's memory bus is not burdened with handling instruction misses. For data, the situation is different:
Therefore in total, a default get on an nK object requires at least 11+ [9n/4] cache-line transfers between L2 cache and main memory.
3 From the above discussion, we know that transmitting a 1024B packet and a 32B header requires (1+1)step1 + (1+1)step5 + (256+8)step6 = 271 word transfers. So in the ideal situation, 1 word per 30ns through the PCI bus and no delay by the memory bus, 217 x 30ns = 8.13us.
Next: Performance
Up: Hit-Server
Implementation Previous: Analysis