BSDCon '03 Paper   
[BSDCon '03 Technical Program]
Running BSD Kernels as User Processes by Partial Emulation and Rewriting of Machine InstructionsHideki Eiraku
AbstractA user-level operating system (OS) can be implemented as a regular user process on top of another host operating system. Conventional user-level OSes, such as User Mode Linux, view the underlying host operating system as a specific hardware architecture. Therefore, the implementation of a user-level OS often requires porting of an existing kernel to a new hardware architecture. This paper proposes a new implementation method of user-level OSes by using partial emulation of hardware and static rewriting of machine instructions. In this method, privileged instructions and their related non-privileged instructions in a native operating system are statically translated into subroutine calls that perform emulation. The translated instructions of the user-level OS are executed by both the real CPU and the partial emulator. This method does not require detailed knowledge about kernel internals. By using the proposed method, NetBSD and FreeBSD are executed as user processes on NetBSD and Linux.
1 IntroductionRunning multiple operating systems (OSes) simultaneously over a single hardware platform has recently become a popular system structuring approach that offers a number of benefits [SVL01] [Pap00] [Dik00]. First, application programs written for different operating systems, such as Unix and Windows, can be simultaneously executed on a single computer. Second, several versions of a single operating system, such as MacOS9 and MacOSX, can co-exist on the same platform. Other benefits include virtual hosting and easier system management and maintenance [HH79] [SVL01] [Pap00] [Dik00]. There are two prominent approaches to running multiple operating systems over a single hardware platform: Virtual machines [LDG+03] [Law03] and user-level operating systems [Dik00] [AAS94] [Tad92]. Virtual machines provide isolated execution environments for multiple operating system kernels, which can run over the native hardware. A user-level operating system is an operating system that runs as a regular user process on another host operating system. Conventional user-level OSes view the underlying host operating system as a specific hardware architecture. Therefore, the implementation of a user-level OS often requires porting of an existing kernel to a new hardware architecture. For example, User Mode Linux (UML) [Dik00], which is a user-level OS that runs on Linux and provides another Linux system image, adds a new architecture called um based on the i386 architecture. In general, such porting involves significant implementation effort, and requires detailed knowledge about the kernel and the base and new architectures. In porting of User Mode Linux, the size of the new um-dependent part is 33,000 lines while the size of the base i386-dependent part is 40,000 lines. In this paper, we propose a new implementation method of user-level OSes with partial emulation of hardware and rewriting of machine instructions at compile time. The key idea is to enable the execution of most instructions by the real CPU with the exception of privileged instructions, hardware interrupts, and the interaction with some peripheral devices, which are emulated. We call this type of emulator a partial emulator or a lightweight virtual machine (LVM) because such a program does not have to emulate typical instructions, such as load, store, and arithmetic operations. In contrast, we refer to an emulator that executes all instructions as a full emulator. In our implementation method, we emulate all privileged instructions. In addition, we emulate some non-privileged instructions that are tightly related with the privileged instructions. It is easy to detect execution of privileged instructions because the real CPU throws privilege violation exceptions. However, it is not trivial to detect execution of such non-privileged instructions. To solve this problem, we use static rewriting of machine instructions at compile-time in two ways. One way is to insert an illegal instruction before each non-privileged instruction to be detected. Another way is to replace privileged instructions and related non-privileged instructions with subroutine calls for emulation. The translated instructions of a user-level OS are executed by both the real CPU and the partial emulator. By using our proposed method, we can generate a user-level OS based on a native OS without detailed knowledge about user-level OS internals. Furthermore, we can catch up the evolution of the base native OS easily. One disadvantage of our method is that we require source code of the user-level OS. By using the proposed method, NetBSD and FreeBSD kernels are executed as user processes on NetBSD and Linux. Our user-level NetBSD on Linux is faster than NetBSD on Bochs [LDG+03], a full PC emulator, by a factor of 10. However, our user-level NetBSD is slower than NetBSD on VMware and User Mode Linux on Linux. From the experiments results, we show that the main sources of slowdown are the emulation of memory mapping hardware and the redirections of system calls and page faults. The rest of the paper is organized as follows. Section 2 summarizes related work. Section 3 describes the emulation of privileged and non-privileged instructions, the redirections of system calls and page faults, and the emulation of memory mapping hardware. Section 4 shows modifications of the NetBSD kernel for hosting our partial emulator. Section 5 describes modifications of the NetBSD kernel and the FreeBSD kernel for running as user processes. Section 6 shows the performance of the user-level NetBSD. Section 7 shows future directions, and Section 8 concludes the paper.
|
In the insertion method, we insert an illegal instruction for each non-privileged instruction to be detected. The following is an example of insertion:
Before: str %eax After: .byte 0x8e, 0xc8 str %eax
Since the byte sequence "0x8e, 0xc8" works as an illegal instruction in IA-32, we can detect the execution of the str instruction in the parent process of the user-level OS through the signal facility and the process trace facility. The parent process is the partial emulator.
Figure 2 shows execution of a privileged instruction or an inserted illegal instruction by the partial emulator. The partial emulator is a parent process of the user-level OS, and it is usually waiting for the child process being stopped with the wait() system call. When the user-level OS executes a privileged instruction or an inserted illegal instruction, the partial emulator is notified as if the child process received a signal (SIGILL). Next, the partial emulator reads the registers including the program counter with the ptrace() system call, and fetches the instruction pointed by the program counter. If the instruction is a privileged instruction, it is handled by the partial emulator. If the instruction is an inserted illegal instruction, the partial emulator fetches and handles the next instruction. After that, the partial emulator adjust the program counter by setting the registers with the ptrace() system call. Finally, the partial emulator continues the execution with the ptrace() system call, and is going to wait for the child process being stopped again. In summary, the partial emulator has to issue four system calls for each execution of a privileged or non-privileged instruction to be detected.
We have implemented a partial emulator for IA-32 based on the proposed method. The partial emulator consists of 1,500 lines of C code and 30 lines of assembly language code. Our partial emulator is much smaller than Bochs that consists of 50,000 lines of C++ code.
We have also implemented an assembler preprocessor for rewriting machine instructions. For IA-32, this preprocessor rewrites the following instructions:
In the insertion method, the partial emulator has to issue four system calls for each execution of a privileged or non-privileged instruction to be detected, as described in Section 3.1.1. We eliminate this overhead by rewriting the privileged and non-privileged instructions with subroutine calls that emulate these instructions. An example of rewriting follows:
Before: mov %eax, %cr3 After: call mov_eax_cr3The subroutine mov_eax_cr3 performs emulation of the instruction. We also perform inlining for simple instructions. Our newer partial emulator consists of 800 lines of C code in the different address space, and 250 lines of C code and 300 lines of assembly language code in the same address space as a user-level OS.
When a user process on a user-level OS issues a system call or causes a page fault, the host OS should not interpret the event by itself. Instead, the host OS should notify the user-level OS of the event (a system call or page fault). If the host OS provides a redirection mechanism, this is an ideal mechanism for a user-level OS (Figure 3 (a)).
The Mach microkernel [GDFR90] includes a redirection mechanism of system calls for executing Unix binaries. When a user process (task) of Unix issues a system call, the Mach microkernel sends back a message to the same task (Figure 3 (b)). The user task includes the support module that receives the message from the microkernel, and performs an RPC to the remote Unix server. The Mach microkernel has a similar mechanism for handing page faults. This mechanism is called an external pager [GDFR90].
To implement the system call redirection, we use the process trace facility of Linux at first. In Linux, if a parent (or tracing) process specifies PTRACE_SYSCALL to the system call ptrace(), the child (or traced) process continues execution as for PTRACE_CONT1, but it will stop on entry or exit of the system call.
When the child process is stopped, the parent process can examine and modify the CPU registers and arguments or results of the system call. Solaris and other Unix System V also have a similar facility through the /proc filesystem.
The procedure for redirecting system calls is similar to that for handing privileged and non-privileged instructions described in Section 3.1.1. When a user process of a user-level OS issues a system call, the process of the host OS is stopped on entry of the system call (Figure 3 (c)). The partial emulator changes the system call number with an illegal one and continues the execution. Next, the host OS tries to execute the body of the system call in a regular way. However, the host OS cannot execute it because the system call number is wrong. Therefore, the host OS sets an error value and notifies the partial emulator of exiting of the system call. Next, the partial emulator prepares an interrupt frame on the user-level OS. Finally, the partial emulator changes the program counter and switches to the user-level OS.
Page faults (SIGSEGV) are handled in a similar way as system calls. A difference is that the partial emulator has to send a signal to get values of some control registers when the host OS is Linux. In the partial emulator described in Section 3.1.2, the signals (SIGSEGV) are handled by the partial emulator code in the user-level OS. Therefore, we can reduce the overhead of context switches between the user-level OS and the partial emulator.
In IA-32, operations of MMU (Memory Management Unit) are performed thorough the register cr3 (Control Register 3). IA-32 uses two-level page tables. The register cr3 holds the physical address of the top level page table called Page Directory [Int97].
To emulate MMU and build address spaces for the user-level OS and its user processes, we use the system calls mmap() and munmap(). When the content of the register cr3 is changed, the partial emulator first compares each entry of the new page table with that of the previous page table. If a new page table entry no longer has a page, the partial emulator unmaps the corresponding page with the system call munmap(). If a new page table entry has a new page, the partial emulator maps the page with the system call mmap(). Otherwise, the partial emulator does nothing.
To emulate MMU with the system calls mmap() and munmap(), we have to solve the following problem in the insertion method (Section 3.1.1). These system calls should be invoked by the process of the user-level OS. These system calls manipulate the address space of the issuing process itself, and the parent or tracing process cannot manipulate the child or traced process.
To solve this problem, we embedded a support module in the address space of the user-level OS. Figure 4 shows the address space of the user-level OS. The lower regions of the address space are for user processes of the user-level OS (Text, Data, and Stack). The upper end is used by the host OS, and cannot be used by the user-level OS. Below the host OS region, there is a region for the user-level OS. We allocate a space for the mmap/munnap module between the host OS region and the user-level OS region.
When the partial emulator detects the manipulation of MMU (setting a value to the register cr3), the partial emulator compares the old and the new page table. Based on the differences between the old and the new page table, the partial emulator makes an issuing plan of the system calls mmap() and munmap(), and stores the plan to the region of the emulator code and data in the address space of the user-level OS (Figure 4). After that, the partial emulator changes the program counter of the child process (the process executing the user-level OS) to the code of the partial emulator in the address space of user-level OS, and switches to the user-level OS. In the user-level OS, the emulator code issues the system calls mmap() and munmap() according to the plan. The partial emulator (the parent process) does not intercept these system calls and allows passing them to the host OS. Finally, the code executes a special instruction (int $3) to switch to the partial emulator. The partial emulator changes the program counter to the next instruction of the MMU operation, and continues execution.
We have described the procedure for the insertion method (Section 3.1.1). In the rewriting method (Section 3.1.2), MMU emulation is performed by the subroutines in the same address space as the user-level OS. The partial emulator in the separated address space does nothing about MMU emulation.
In both procedures, comparing the old and the new page table and issuing the system calls mmap() and munmap() are heavy tasks, and they are big sources of performance degradation. We will show the experimental results about MMU emulation in Section 6.
The current partial emulator provides minimum peripheral devices: the keyboard for input, the video RAM for console output, and the timer for periodic interrupt.
For persistent storage, we have developed a small device driver that is derived from the memory disk driver of NetBSD. The memory disk of NetBSD is a block device, and it does not use interrupt for I/O. Instead, the memory disk reads and writes a memory reason in the kernel. We have changed these operations with the partial emulator calls (Section 3.1.1) or the system calls for the host OS (Section 3.1.2).
At first, we settled our goal to implement our partial emulator to run NetBSD on Linux. We chose NetBSD because it supports many hardware architectures, and each architecture-specific part seems small. We chose Linux because Linux already had User Mode Linux, so it was obvious that Linux has enough facilities for running a user-level OS. However, porting the i386-dependent part of NetBSD to the Linux architecture was not easy for us. Therefore, we have developed the techniques with partial emulation and rewriting of machine instructions.
After we had succeeded in running NetBSD on Linux, we started running NetBSD on NetBSD. We found that the unmodified NetBSD does not provide enough facilities to implement our partial emulator. The ktrace facility of NetBSD can be used to record events of entering and exiting of system calls. However, the ktrace facility does not allow stopping traced processes and changing registers and memory.
To run NetBSD on NetBSD, we decided to add a new facility to the host NetBSD kernel. The basic task is to introduce the PTRACE_SYSCALL facility of Linux to NetBSD. In the following subsections, we will show our modifications to NetBSD 1.6.1 in detail.
We have modified the following three files in the core part of NetBSD (the architecture-independent part of NetBSD):
We have added the following flag to the struct proc in proc.h:
#define P_SYSTRACED 0x800000 /* System call is traced. */
We have added the following request for the system call ptrace() in the file ptrace.h:
#define PT_SYSCALL 12 /* Continue and stop at the next (return from) syscall. */
The flag and the request are used by the function sys_ptrace() of sys_process.c in the core part and the function syscall_plain() of syscall.c in the i386 architecture-specific part.
To the function sys_ptrace() in sys_process.c, we have added some lines for the request PT_SYSCALL next to the request PT_CONTINUE. The essential difference between PT_SYSCALL and PT_CONTINUE follows:
if (SCARG(uap, req) == PT_SYSCALL) { SET(t->p_flag, P_SYSTRACED); } else { CLR(t->p_flag, P_SYSTRACED); }
If the request for the system call ptrace() is PT_SYSCALL, we set the flag P_SYSTRACED to the traced process. Otherwise, we clear the flag.
In addition to the above modification, we have defined a new function as follows:
process_systrace(p) struct proc *p;This function is called from the entry point and the exit point of the system call when the flag P_SYSTRACED is set. This function stops the current process (the traced process) as if it would receive the signal SIGTRAP. If the tracing process is sleeping typically by issuing the system call wait(), the current process wakes up the tracing process.
We have modified the following three files in the i386-specific part of NetBSD:
We have inserted the invocation of the function process_systrace() to the function syscall_plain() in the file syscall.c, as shown in Figure 5. This function fetches the system call number in the register eax and calls the body of the system call by consulting the jump table in the p->p_emul->e_systent. Before getting the system call number from the register eax, we check if the flag P_SYSTRACED is set. If it is set, we call the function process_systrace() which is described in Section 4.1. The same function is also called at the end of the function before returning to the user mode.
Moreover, we have changed the files process_machdep.c and reg.h, and extended struct reg for the PT_GETREGS request of the system call ptrace(). In addition to regular registers for debugging, we need other values in control registers and the trap frame. For example, the register cr2 in PCB (Process Control Block) and the trap number are needed by the partial emulator. In Linux, we used a signal facility to get these values. If we set the program counter to an illegal address, the process receives a signal. At this time, the values that are needed by the partial emulator are pushed on the stack. In NetBSD, we did not use a signal facility. Instead, we extended struct reg in the request PT_GETREGS of the system call ptrace().
void syscall_plain(frame) struct trapframe frame; { ... p = curproc; if (ISSET(p->p_flag, P_SYSTRACED)) { CLR(p->p_flag, P_SYSTRACED); process_systrace(p); } code = frame.tf_eax; callp = p->p_emul->e_sysent; ... code &= (SYS_NSYSENT - 1); callp += code; ... error = (*callp->sy_call)(p, args, rval); ... if (ISSET(p->p_flag, P_SYSTRACED)) { CLR(p->p_flag, P_SYSTRACED); process_systrace(p); } userret(p); } |
In the implementation of a user-level OS, the final goal is to generate the user-level OS from the corresponding native OS for the bare hardware automatically. However, we had to slightly modify the native NetBSD and FreeBSD. In this section, we show the modifications to NetBSD and FreeBSD for running them as user-level OSes.
We ran NetBSD 1.5.2 as a user process by changing 6 constants to adjust the address space and removing device drivers from the configuration file. The base address of NetBSD is changed from 0xc0000000 to 0xa000000 because the memory region after 0xc0000000 is occupied by the host operating system kernel.
Note that this modification is achieved without detailed knowledge about the NetBSD kernel. This is the significant difference from conventional user-level OSes, such as User Mode Linux. User Mode Linux requires adding a new architecture called um. The code size under the um directory is 33,000 lines, and this is comparable with the code size of the native i386 architecture (44,000 lines). This porting may cause a maintenance problem. When the native i386 architecture gets a new facility, the um architecture has to catch up the facility manually. In contrast, the core of our user-level NetBSD is automatically generated from the native i386 NetBSD. Therefore, we can follow the evolution of native i386 NetBSD more easily.
We have also executed the FreeBSD 4.7 kernel as a user process. In addition to address constants, we have changed the places that call BIOS. We have simply commented out the places and replaced with the code that returns parameters, such as the size of memory and the type of CPU. Since we did not have BIOS code, changing the FreeBSD kernel was much easier than implementing BIOS. Furthermore, changing the kernel reduces the effort to implement the partial emulator. Calling BIOS requires emulation of the virtual 8086 mode of Pentium. Our partial emulator does not have that facility. If we have BIOS code and a more powerful emulator, we do not have to modify these places.
We made experiments to measure the performance of our user-level OS (NetBSD 1.5.2). In this section, we show the results of microbenchmarks and an application benchmark.
All experiments were performed on a PC with a Pentium III 1GHz and 512M bytes of main memory. The host operating system for our partial emulator is Debian 3.0 with the Linux kernel 2.4.20 2.
As microbenchmarks, we use the following user programs:
In these experiments, we measured the peak performance. In each execution, the task is repeated from 100 to 10,000,000 times. The number of trials is determined to be high enough to reach several tents of milliseconds to several seconds. The execution times were obtained using the system call gettimeofday() for the host OS, and divided by the number of iterations.
The result is shown in Table 1. In Table 1, ``PE-'' sands for our partial emulator. ``Insert'' means the insertion method (Section 3.1.1), and ``Rewrite'' means the rewriting method (Section 3.1.1), respectively. For reference, we include the results of the following operating systems and environments:
For the program loop, both of our partial emulators (``PE-Insert'' and ``PE-Rewrite'') produced almost same performance as the physical PC because the measured part of the benchmark program is executed by the real CPU directly. By the same reason, our user-level NetBSD is faster than NetBSD on Bochs by a factor of 100.
In the cases of getpid, pipesw and fork, our partial emulator is slower than the physical machine by a factor of 100 and VMware by a factor of 10. This slowdown is cased by overheads of the system call redirection (Section 3.2) and the MMU emulation (Section 3.3). We got performance improvement by a factor of 2.8 to 5.9 between ``PE-Insert'' and ``PE-Rewrite''.
We cannot simply compare our partial emulator with User Mode Linux and Plex86 because the user-level OSes are different. As shown in Table 1, NetBSD/Physical is slower than Linux/Physical in those microbenchmarks. If we ignore the difference of user-level OSes, NetBSD on our partial emulator of ``PE-Rewrite'' is faster than User Mode Linux in the cases of getpid and fork. Our partial emulators are slower than Plex86 because Plex86 uses an efficient hardware mechanism called PVI, as described in Section 2.
program | ||||
OS/Environment | loop | getpid | pipesw | fork |
(n sec) | (u sec) | (u sec) | (m sec) | |
NetBSD/PE-Insert/Linux | 2.04 | 136 | 2880 | 89.9 |
NetBSD/PE-Rewrite/Linux | 2.00 | 23.0 | 1030 | 19.0 |
NetBSD/Physical | 1.99 | 0.360 | 19.8 | 0.380 |
NetBSD/Bochs/NetBSD | 288 | 68.0 | 1600 | 34.9 |
NetBSD/VMware/Linux | 2.01 | 3.53 | 83.7 | 2.550 |
Linux/Physical | 1.99 | 0.299 | 5.53 | 0.114 |
User Mode Linux/Linux | 1.99 | 44.1 | 665 | 31.7 |
Linux/Plex86/Linux | 1.99 | 20.0 | 346 | 1.76 |
We ran the make command for compiling the GNU patch command (Version 2.5.4), and measured the execution times. The source code of the patch command consists of 15 C files and 17 headers. Total length of those files is 9200 lines or 244 k bytes 4. The result is shown in Table 2.
NetBSDs on our partial emulators were faster than NetBSD on Bochs by a factor of 10. However, they were slower than NetBSD on the physical PC and VMware by a factor of 15 and 4, respectively. The ratios are smaller than the results of the microbenchmarks getpid, pipesw, and fork because this application benchmark includes a CPU workload. Our user-level NetBSD is slower than User Mode Linux and Linux/Plex86 because of the MMU emulation overhead.
Our projects began on June 2002, and we have many tasks to be accomplished. Those tasks are classified into two categories:
The first priority task on functionality is to add a networking facility. We are implementing a pseudo serial device for communicating between a user-level OS and a host OS. This serial device can be used for passing PPP packets. We also have a plan to implement an Ethernet-like device.
We are also interested in a function to access host file systems, as similar to the host file system of User Mode Linux. A straightforward implementation method is to insert a module to the VFS layer while we have to implement the module for each host OS. By using the networking facility, we can access host file systems through the NFS protocol and the SMB protocol.
As shown in Section 6, the main sources of overhead are the MMU emulation and the system call/page fault redirection.
To enhance the MMU emulation, we can cache address spaces as user processes of the host OS. As similar to the hardware context table of SPARC [SPA92], we can preserve and reuse user processes as page tables. In other words, the partial emulator forks when the MMU register of the page table gets a new value. If we cache page tables, we have to discard unused page tables or user processes of the host OS. If some LRU algorithm works well, we do not have to modify the user-level OS. Otherwise, we should modify the user-level OS to invalidate the cache on termination of its user processes.
We are also studying to introduce a new kernel level abstraction called a virtual page table. With this facility, a user-level OS can manipulate its page tables by issuing a new system call for the host OS. Unlike regular system calls, this system call for virtual page tables should depend on underlying hardware because we can preserve the structure and semantics of the base native OS. If we use a different facility, such as the external pager of the Mach microkernel, we have to change the base native OS more.
In the current implementation, the partial emulator does not handle the segment facility of IA-32 completely because most operating systems including NetBSD do not make use of the segment facility. The partial emulator interprets only address translation and write protection in the two-level page table. The partial emulator does not interpret other bits, such as Accessed and User/Supervisor for performance. Therefore, a user-level OS cannot perform page replacement efficiently. Moreover, the partial emulator does not change the memory protection on switching from the kernel mode to the user mode for performance reason. In IA-32, most operating systems including NetBSD do not change the MMU setting when the context is transferred from the user mode to the kernel mode or vice versa. Therefore, user processes can access the kernel memory. We would like to add a protection facility of the kernel memory after implementing the caching facility.
In this paper, we have proposed an implementation method of user-level operating systems based on partial emulation of PC hardware and rewriting of machine instructions at compile time. Unlike conventional methods, user-level operating systems are generated from the native operating systems. Therefore, no detailed knowledge about the native operating systems is needed to implement the user-level operating system. Based on the proposed method, we have executed NetBSD and FreeBSD kernels as user processes on Linux and NetBSD with small changes from the corresponding native systems.
The partial emulator on Linux can be used for running NetBSD and FreeBSD applications on Linux. We have nested operating system environments for NetBSD, so we can use several versions of NetBSD co-exist on the same platform.
We hope that our partial emulator will be one of the most popular tools for nested BSD operating systems. For developers, nested operating systems will be an essential facility to experiment with new kernels or new release while keeping the base environment safely. We are working on adding a networking facility to our partial emulator. With the networking facility, we can execute Internet servers on user-level operating systems, and we can couple user-level operating systems with the host operating system more tightly.
https://plex86.sourceforge.net/
.
https://bochs.sourceforge.net/
.
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -link 1 -toc_depth 1 -local_icons -numbered_footnotes -auto_prefix -address 'BSDCon'03 2003/09/08-12' -image_type png -images -show_section_numbers -nonavigation shinjo.tex
The translation was initiated by Yasushi SHINJO on 2003-07-08
This paper was originally published in the
Proceedings of BSDCon '03,
September 812, 2003,
San Mateo, CA, USA
Last changed: 21 Aug. 2003 ch |
|