Vnode operation | Description |
VOP_ACCESS | Check access permission |
VOP_BMAP | Map block number |
VOP_BREAD | Read a block |
VOP_BRELSE | Release a block buffer |
VOP_CLOSE | Mark file closed |
VOP_CREATE | Create a file |
VOP_FSYNC | Flush dirty blocks of a file |
VOP_GETATTR | Return file attributes |
VOP_INACTIVE | Mark vnode inactive |
VOP_IOCTL | Do I/O control operation |
VOP_LINK | Link to a file |
VOP_LOOKUP | Lookup file name |
VOP_MKDIR | Create a directory |
VOP_OPEN | Mark file open |
VOP_RDWR | Read or write a file |
VOP_REMOVE | Remove a file |
VOP_READLINK | Read symbolic link |
VOP_RENAME | Rename a file |
VOP_READDIR | Read directory entries |
VOP_RMDIR | Remove directory |
VOP_STRATEGY | Read/write fs blocks |
VOP_SYMLINK | Create symbolic link |
VOP_SELECT | Do select |
VOP_SETATTR | Set file attributes |
VOP_GETPAGES | Read and map pages in VM |
VOP_PUTPAGES | Write mapped pages to disk |
Existing vnode I/O interfaces are all synchronous. VOP_READ and VOP_WRITE take as an argument a struct uio buffer description and have copy semantics. VOP_GETPAGES and VOP_PUTPAGES are zero-copy interfaces transfering data directly between the VM cache and the disk. VM pages returned from VOP_GETPAGES need to be explicitly wired in physical memory to be used for device I/O. An interface for staging I/O should be designed to return buffers in a locked state. We believe that a vnode interface modeled after the low-level buffer cache interface with new support for asynchronous operation naturally fits the requirements of a DAFS server as outlined earlier. Such an asynchronous interface is easier to implement than an asynchronous version of VOP_GETPAGES, VOP_PUTPAGES, while being functionally equivalent to it in FreeBSD's unified VM and buffer cache.
Central to this new interface (summarized in Table 2) is a VOP_AREAD call which can be used to issue disk read requests and return without blocking. VOP_AREAD internally uses a new aread() buffer cache interface (described below) integrated with the kqueue mechanism. It takes as one of its arguments an asynchronous I/O control block (kaiocb) used to keep track of progress of the request.
aread(struct vnode *vp, struct kaiocb *cb) { derive block I/O request from cb; bp = getblk(vp, block request); if (bp not found in the buffer cache) { register kevent using EVFILT_KAIO; register kaio_biodone handler with bp; VOP_STRATEGY(vp, bp); } }On completion of a request issued by aread(), the data is in a bp, in a locked state, and kaio_biodone() is called to deliver the event:
kaio_biodone(struct buf *bp) { get kaiocb from bp; deliver event to knote in klist of kaiocb; }To unlock buffers and update filesystem state if necessary, VOP_BRELSE is used. Local filesystems would implement the interface of Table 2 in order to be efficiently exported by a DAFS server. For lack of this or another suitable interface, a local filesystem could always be exported by a DAFS server using existing interfaces, albeit with higher overhead mainly due to multithreading.
Network event delivery can be integrated with that of disk I/O as described earlier through the kevent mechanism. Each time an RDMA descriptor is issued, a kevent is registered using the EVFILT_RDMA filter and recorded in the completion group (CG) structure. Completion group handlers need to deal with kqueue event delivery:
send_event(CG *cq, Transport *vi) { deliver event to knote in klist of CG; }The DAFS server is notified of new events by periodically polling the kqueue. Alternatively, a common handler is invoked each time a network or disk event occurs. We illustrate the use of the proposed vnode interface to the buffer cache by breaking down and describing the steps in read and write operations implemented by a DAFS server. For comparison with existing interfaces, we describe the same steps implemented by NFS. Without loss of generality we assume an FFS underlying filesystem at the server.
Read. With DAFS, a client (direct) read request carries the remote memory address of the client buffers. The DAFS server issues a VOP_AREAD to read and lock all necessary file blocks in the buffer cache. VOP_AREAD starts disk operations and returns without blocking, after registering with kqueue. When pages are resident and locked and the server notified via kqueue, it issues RDMA Write operations to client memory for all requested file blocks. When the transfers are done, the server does VOP_BRELSE to unlock the file buffer cache blocks.
With NFS, on a client read operation the server issues a VOP_READ to
the underlying filesystem with a uio parameter pointing to a gather/scatter
list of mbufs that will eventually form the response to the read request
RPC. In the FFS implementation of VOP_READ and without applying any optimizations,
a loop reads and locks file blocks into the buffer cache using bread(),
subsequently copying the data into the mbufs pointed to by
uio.
For page-aligned, page-sized buffers, page-flipping techniques can be applied
to save the copy into the mbufs.
Write. With DAFS, a client (direct) write request carries only client memory addresses of data buffers. The DAFS server uses VOP_AREAD to read and lock in the buffer cache all necessary file blocks. When pages are resident and locked, it issues RDMA Read requests to fetch the data from the client buffers directly into the buffer cache blocks. When the transfer is done, the server uses one of VOP_BWRITE, VOP_BDWRITE, VOP_BAWRITE, depending on whether this is a stable write request or not, to issue a disk write I/O and unlock the buffers. Additionally, a metadata update is effected if requested.
With NFS, a client write operation carries the data to be written inline with the RPC request. The NFS server prepares a uio with a gather/scatter list of all the data mbufs and calls VOP_WRITE. Apart from the uio parameter that describes the transfer, an ioflags parameter is passed signifying whether the write to disk should happen synchronously. With NFS version 2 all writes and metadata updates are synchronous. NFS versions 3 and 4 allow asynchronous writes. In the FFS implementation of VOP_WRITE, a loop reads and locks the file blocks to be written into the buffer cache using bread(), copies into them the data described by uio, then uses one of bwrite (synchronous), bdwrite (delayed), or bawrite (asynchronous) writes depending on whether this is a stable write request (see ioflags) or not. Finally, a metadata update is effected if requested.
An interesting note on the ability of the DAFS server to implement file writes using RDMA Read (instead of client-initiated RPC or RDMA Write) is that this enables it to read data from client memory no faster than dirty buffers can be written to disk. This bandwidth matching capability becomes very important in multi-gigabit networks when network bandwidth is often greater than disk bandwidth.