VM System Support

Next:Buffer Cache LockingUp:Kernel Support for DAFSPrevious:Vnode Interface Support

VM System Support

In systems deriving from 4.4BSD [17], maintaing virtual/physical address mappings and page access rights used by the main CPU memory-management hardware is done by the machine-dependent physical mapping (pmap) module. Low level machine-independent kernel code such as the buffer cache, kernel malloc and the rest of the VM system are using pmap to add or remove address mappings and alter page access rights. Symmetric multiprocessor (SMP) systems sharing main memory can use a single pmap module as long as translation lookaside buffers (TLB) on each CPU are kept consistent. Pmap operations apply to page tables shared by all CPU. TLB miss exceptions thrown by a CPU result in a lookup for mappings in the shared page tables. Invalidations of mappings are applied to all CPU.

**Figure 2:CPU and NIC Memory Management Hardware.**
$\begin{figure}\epsfxsize =4.3in\begin{center}\leavevmode\epsfbox{mmu.eps}\end{center}\vskip-5pt\end{figure}$

Memory-to-memory NIC store virtual-to-physical address translations and access rights for all user and kernel memory regions directly addressable and accessible by the NIC. Figure 2 shows a system combining both CPU and NIC memory management hardware: Main CPU use their on-chip translation lookaside buffer (TLB) to translate virtual to physical addresses. A typical TLB page entry includes a number of bits such as V and ACC signifying whether the page translation is valid, and what the access rights to the page are, along with the physical page number. A miss on a TLB lookup requires a page table lookup in main memory. NIC on the PCI (or other I/O) bus have their own translation and protection (TPT) [5] tables. Each entry in the TPT includes bits enabling RDMA Read or Write (i.e. the W bit in the diagram) operations on the page, the physical page number, and a Ptag value identifying the process that owns the pages (or the kernel). Whereas the TLB is a high-speed associative memory, the TPT is usually implemented as a DRAM module on the NIC board. To accelerate lookups on the TPT, remote memory access requests carry a Handle index that helps the NIC find the right TPT entry.

This section focuses on operating system support to integrate NIC memory management units in the FreeBSD VM system. The main benefits of this integration are

VM pages exported for RDMA are wired in physical memory only for as long as RDMA transfers from or to them are in progress, resulting in better utilization of physical memory.
The entire VM cache (i.e. potentially all file VM objects) can be safely exported in the face of paging activity.

We consider the NIC as a processor (with special I/O capabilities [12]) in the asymmetric multiprocessor system of Figure 2 and allow sharing of kernel VM structures in main memory between NIC and main processors. We assume that access to VM structures on behalf of the NIC is done by the CPU, executing driver handlers in response to interrupts. Direct access by the NIC to VM structures in main memory is considered later in this section.

In FreeBSD (and other systems deriving from 4.4BSD), a physical page is represented by a vm_page structure and an address space by a vm_map structure. A page may be mapped on one or more vm_maps with each mapping represented by a pv_entry structure. Figure 2 shows a vm_page with two associated pv_entry structures in main memory.

In our design, the VM system maintains the NIC MMU via the OS-NIC interface. The NIC accesses VM structures via the NIC-OS interface.

OS-NIC Interface. The OS needs to interact with the NIC to add, remove and modify VM mappings stored on its TPT. A mapping of a VM page expected to be used in RDMA has to be registered with the NIC. Registering the mapping with the NIC happens in pmap, right after the CPU mapping with a vm_map is established. The NIC exports low-level VM primitives (Table 3) for use by pmap to add, remove and modify TPT entries. NIC mappings may later be deregistered (when the original mapping is removed/invalidated), or have their protection changed.

To keep an account of VM mappings that have been registered with the NIC, we add a PG_RDMA bit in the pv_entry structure to be set whenever a pv_entry has a NIC as well as a CPU mapping. In Figure 2, the pv_entry with the PG_RDMA bit set has both a CPU and a NIC mapping. In all pmap operations on VM pages, the pmap module interacts with the NIC only if the PG_RDMA flag is set on the pv_entry.

**Table 3:Low-level NIC VM primitives.**
Function	Description
tpt_init()	Initialize
tpt_enter()	Insert mapping
tpt_remove()	Remove mapping
tpt_protect()	Protect mapping

Higher-level code can trigger registration of a virtual memory region with the NIC by propagating appropriate flags from higher to lower-level interfaces and eventually to the pmap. For example, the DAFS server sets an IO_RDMA bit in the ioflags parameter of the vnode interface (Table 2) when planning to use the buffer for RDMA. This eventually translates into a VM_RDMA flag in the pmap_enter() interface that results in mappings being registered with the NIC.

A problem with invalidating pv_entry mappings that have also been registered with the NIC is that NIC invalidations may need to be delayed for as long as RDMA transfers using the mappings are in progress. Pmap_remove_all() is complicated by this fact as (for atomicity) it has to try to remove pv_entry structures with NIC mappings first and may eventually fail if NIC invalidations are not possible within a reasonable amount of time.

Another problem is with VM system policies that are often based on architectural assumptions that do not hold with NIC characteristics. For example, the FreeBSD VM system unmaps pages from process page tables when moving them from an active to inactive or cached state. This is because the VM system is willing to take a reasonable number of reactivation faults to determine how active a page actually is, based on the assumption that reactivation faults are relatively inexpensive [7]. NIC reactivation faults are significantly more expensive compared to CPU faults due to lack of integration between the NIC and the host memory system. To reduce that cost, it would make sense to apply the deactivation policy only to CPU mappings, leaving NIC mappings intact for as long as VM pages are memory resident. However, full integration of the NIC memory management unit into the VM system argues for this policy to be equally applied to NIC page accesses.

**Table 4:VM interface used by the NIC.**
Function	Description
vm_page_io_start()	Lock page
vm_page_io_finish()	Unlock page
vm_nic_fault()	Handle NIC fault
vm_page_reference()	Reference page
vm_page_dirty()	Dirty page

NIC-OS Interface. The NIC initiates interaction with the VM system in the following occasions using the interface of Table 4.

The NIC can be the initiator of DMA from or to VM pages in main memory for incoming RDMA Read or Write requests, and thus has to be able to lock (busy) VM pages for the duration of the transfer. The interface is similar to the kernel interface used to busy pages during pagein or pageout activity.
On a translation miss, the NIC needs to do a page table lookup for the missed virtual address. A new translation is loaded in the NIC TPT if the page is found to be resident in memory. If not, pagein may be initiated by the miss handler but the NIC may choose not to wait for it and report an RDMA exception instead.
The NIC updates reference, dirty bits whenever it accesses or writes to VM pages.

All this handling can be done by the host CPU in response to interrupts thrown by the NIC. For efficient access to vm_page bits in main memory without interrupting the host CPU, the NIC should share definitions of VM structures and store direct references to vm_pages. These references could either be physical (bus) addresses, or virtual addresses that are translated using the NIC TPT. In cases where a simple bit flip on a vm_page is needed, the NIC should be able to do that by a direct atomic memory access. Complicated page table lookups (i.e. in translation miss handling) are better handled by the host CPU.

Next:Buffer Cache Locking Up:Kernel Support for DAFS Previous:Vnode Interface Support

Kostas Magoutis 2001-12-03