Check out the new USENIX Web site. next up previous
Next: Status, Performance, and Related Up: Linux process internal environment Previous: Clone


Files, file descriptors, and name space

Usual implementations of UNIX APIs store client (application) specific state information regarding IO/IPC (files, sockets, pipes, etc) elements in the kernel. For example, for files, the kernel not only manages attributes such as ownership and file length, but also is responsible for keeping state information (e.g. file position) for each client. File descriptors fds (offsets in a per-process table kept by the kernel) are provided as an argument by the clients to identify the target of the operation. Most operations on such elements may need to block. If so, the blocking is also carried out in the kernel.

K42 takes a different approach. The client-specific information for open files, sockets, and pipes resides in the application's address space. K42 stores information for stateful interfaces in the object implementing that interface, which is placed in the application's address space. When blocking is necessary, the thread will block until notified by the server that the operation can proceed. As a consequence, kernel resources (kernel thread stack, process state, register state, etc.) are not tied up. This avoids the difficulties encountered in m on n scheduling models. In some scenarios, the operations can execute entirely in the application space, thereby achieving significant performance gains.

Operations such as socket(), pipe(), creat(), and open() create a new IO/IPC element, returning a file descriptor for it. K42's implementation of these system calls does the following: (1) the client contacts the name-space server (or MountPointServer) to discover the appropriate server for that resource, (2) the client initiates an interaction with the appropriate server. The server identifies the server object representing the resource (creating one, if necessary), checks credentials, and returns to the client a handle for the server object. This handle includes capabilities indicating that the client is allowed to invoke the server object directly, and (3) the client creates an object to represent the client side of the open file, socket, or pipe, storing in it the object handle to the server object. A file descriptor is associated with the newly created client object, and this mapping is stored in a file descriptor array kept by the application.

Subsequent operations on the file descriptor will be delegated to the client object. The client object implementation usually invokes the corresponding operation on the server object representing the resource, and updates its own local information. In some cases, the client object is able to carry out the operation completely. For example, for a small file with a single client, file position, file data, and all stat information (including file length), can be managed in the client. The information is propagated to the server when the file is closed, becomes too large to be reasonably cached in the client, or is accessed by additional clients.

K42's synchronous I/O model is similar to other asynchronous models in the sense that threads do not block in the kernel or servers. To illustrate how blocking and unblocking occurs in the application space even for synchronous requests, we describe our implementation of sockets. All socket interfaces in K42 are similar to Unix ones, except there is an extra parameter to pass back to the client information about the state of the object (for example, whether it is readable, writable, or if there is any exceptional condition on it). The client uses this state information to decide if it should block. The server makes an asynchronous upcall to notify the client of state changes (e.g. data becomes available). This scheme allows us to implement select and or poll purely in user space. Also, asynchronous notifications are piggybacked on synchronous responses to client invocations. Our implementation is also able to detect when a client is ignoring notifications (for example, a forked child that inherited the file descriptor but is not interested in the socket), and to stop sending them.

Many interfaces such as signals, select and poll, and cursor management in files, require synchronization. Linux provides this synchronization in the kernel. K42 either provides this in the application space (if that application is the only user of the resource), or in the server object managing that resource. In fact, we have an interesting example of the hot-swapping mechanism[24] that is used to switch between these two implementations when the requests for a given resource change from coming from a single application to originating from multiple applications.

K42's support for namespace traversal involved in pathname-based operations (similar to Welch and Ousterhout[25]) has performance and scalability advantages over the usual pathname lookup schemes due to its fine-grained locking and ability to resolve in the application address space the parts of the pathname that identify mount points. A MountPointServer stores the association between parts of the name space and specific file-system servers. The information available in this server is cached in the application's address space during its initialization phase. The MountPointServer publishes the version number for its up-to-date information. An application can use these version numbers to check efficiently if it has out-of-date information.

In the uncommon case that the information is not up-to-date, the application requests the current information from the MountPointServer. Once the mounting information is used to resolve part of the pathname and identify the corresponding file-system server, the client contacts the file-system server and passes arguments of the operation to be performed and the unresolved part of the pathname.

The file-system independent layer in K42 implements caching of directory and file entries recently resolved. Each file-system instance has its own caching data structure. Fine-grained locking is used when manipulating this data structure, avoiding well-known scalability bottlenecks in name resolution.


next up previous
Next: Status, Performance, and Related Up: Linux process internal environment Previous: Clone
2003-04-08