While safe runtime kernel extension has previously been addressed in the literature, unfortunately such functionality is not generally available in commonly used operating systems. Several classes of solution techniques have been proposed:
Programming Language Techniques
In the SPIN operating system,
the safety of kernel extensions is based on the properties of the
Modula-3 type-safe programming language and a trusted
compiler [4]. Furthermore, because SPIN's kernel extensions
use relatively heavyweight external compile/link/execute facilities,
creation costs must be amortized over extended and frequent use. As a
result, SPIN extensions are best suited to long-lived functionality.
The Open Kernel Environment (OKE) [5] employs a variation of the same idea, substituting the type-safe Modula-3 with Cyclone, an `elastic' customizable version of C, and trust management integrated with the compiler.
In contrast to these schemes, kernel plugins are designed to be lightweight, agile, and easy to adapt on-the-fly. Plugin creation, invocation, and removal overheads are very low and do not involve execution of external compilers or linkers. Furthermore, our facility implements both preemption and isolation and thus does not need to trust any binaries outside the kernel.
Proof-Carrying Code
Proof-carrying code [18] is a
mechanism for safety verification of code that requires that a `safety
proof' is attached to each piece of code, certifying its adherence to
a pre-defined `safety policy'. The proof is such that quick validation
is possible without cryptography or external references. Despite those
desirable properties there are three drawbacks to proof-carrying code.
The first and foremost one is that generating a comprehensive safety policy for non-trivial code is very hard. The difficulty results from the fact that the policy needs to cover all obvious and implied rules and invariants of the execution environment. Furthermore, there is no way to guarantee the completeness of the policy itself. Second, the method has scaling issues because the safety proof's size grows large rather quickly. As an example, a trivial function summing two numbers under a basic safety policy is quoted to have 60 bytes of code and 430 bytes of safety proof [18]. Finally, no automatic proof generators exist.
Kernel plugins provide an alternative - an engineering solution that achieves native code performance and safety without the burden of a proof or type-safe language restriction.
Software Fault Isolation
SFI approaches [26] rely on
rewriting the machine code of extensions so that memory accesses and
jump targets are checked and instrumented, thereby restricting them to
the scope of the extension's protection domain. Only after such sandboxing is an extension allowed to execute. Program interpretation
is a related approach in which extensions are executed by a trusted
interpreter that enforces safety.
Typical examples of such extensible kernels are VINO [21], which relies on SFI, and packet filters like the Berkeley Packet Filter [16], which implements an interpreted `little language' for custom, in-kernel, packet filtering rules. The primary problem with these approaches is that the price of safety is non-trivial performance degradation, which makes them less appealing for high-performance applications. The performance of type-safe language extensions is quoted to be 10% to 150% worse than regular C code, and SFI can be as much as 220% slower [8]. In comparison, kernel plugins do not incur per-instruction execution overheads. Plugin code generation is a one-time cost, significantly smaller than compilation alternatives and amortized over the lifetime of the plugin.
Hardware Fault Isolation
HFI relies on hardware-provided
memory management features to enforce the isolation between the kernel
and extensions. This is the same method that traditional operating
systems use to isolate their kernels from user-space applications. It
also forms the basis for most `virtualization' and `isolation'
systems, which can be viewed as very coarse-grain extension
mechanisms. Notable examples include the VMware [25] and
Virtual PC [9] virtual machines, as well as the library
operating systems supported by Exokernel [12], the Denali
isolation kernel [28], and Xen [3] - a new VM
monitor that defines an abstract VM to which kernels are then ported,
reportedly achieving close to native performance.
Palladium [8] also uses hardware features to achieve extension isolation, but on a somewhat finer grain and without striving to provide a complete virtualization environment. It limits its scope only to untrusted kernel modules, and uses segmentation and privilege-checking hardware to ensure that they cannot interfere with the kernel proper. While Palladium's strategy results in better performance compared to virtual machines, it still restricts system adaptation to relatively coarse-grain kernel modules, and limits the dynamic use of such extensions because it requires off-line module compilation.
Kernel Plugins
Like some of the above approaches, we choose to
employ a hardware-based scheme, exploiting the x86 architecture's
segmentation hardware and unused privilege rings to provide isolation.
Specifically, the x86 hardware provides 4 `privilege ring
levels'. Typical operating systems use ring-0 (most privileged) and
ring-3 (least privileged) for kernel and user modes,
respectively. Kernel plugins utilize one of the unused privilege
rings. Thus, memory protection and control-flow restrictions are
enforced entirely in hardware, causing no discernible performance
degradation. This is a popular isolation approach employed by all x86
virtual machine projects of which we are aware, as well as the
implementation of intra-address space protection in Palladium.
Unlike VMware and VirtualPC style VMs, however, we do not strive to provide the illusion of a dedicated machine. Instead, we define a streamlined, lightweight execution environment in a manner which is more meaningful and fitting to a plugin's purpose of customizing existing services rather than deploying new ones. Unlike Exokernel, Denali, and Xen, we do not modify host architectural assumptions and require no porting or reimplementation of host-kernel subsystems that do not need to be extensible. Finally, unlike Palladium we strive to achieve finer granularity and enable runtime online adaptation while keeping setup overheads low. Experimental results presented in this paper demonstrate that kernel plugins experience no additional runtime costs per instruction. We also show that the overhead of protected control transfers to and from plugins are both small and predictable.