Nick L. Petroni, Jr.
William A. Arbaugh
Institute for Advanced
University of Maryland, College Park, MD 20742, USA
Purdue University, West Lafayette, IN 47907, USA
Recent advances in defensive technologies, such as external kernel integrity monitors [17,37,13,29] and code attestation/execution verification architectures [18,34,33], have demonstrated their ability to detect the kinds of tampering historically performed by rootkits. Unfortunately, rootkit technology has already moved to a more sophisticated level. While these defensive technologies have focused on the relatively straightforward task of detecting tampering in static and unchanging regions of kernel text and data structures--typical targets of the previous generation of rootkits--the new rootkit generation has evolved to more sophisticated tampering behavior that targets dynamic parts of the kernel. Seeking to avoid detection and subsequent removal from the system, clever intruders can hide their processes from legitimate administrators by modifying links in the Linux and Windows XP/2000 kernels' process tables. Because the state of the process table changes continuously during kernel runtime, identifying these modified links is difficult for the current generation of kernel integrity monitoring tools that focus only on static data. Although this targeting of dynamic data was not entirely unanticipated by researchers [37,13], there has yet to be a general approach for dealing with this threat.
In response to a continually advancing threat, we introduce an architecture for the runtime detection of semantic integrity violations in objects dynamically allocated in the kernel heap or in static objects that change depending upon the kernel state. This new approach is the first to address the issue of dynamic kernel data in a comprehensive way. In order to be effective against the latest rootkit technology, defensive mechanisms must consider both static and dynamic kernel data, as changes in either can lead to the compromise of the whole. We believe our approach provides an excellent complement to state of the art binary integrity systems.
Our approach is characterized by
the following properties:
The previous generation's detection methods, which can be characterized by
calculating hashes of static
kernel data and text and comparing the result to known-good values,
is not applicable to the continuously changing dynamic data
structures now being targeted by rootkits.
Instead of characterizing a correct state using hashes,
our architecture relies upon
an expert to describe the
correct operation of the system via an abstract model for
low-level data structures and the relationships between
them. This model is a simplified description of
security-relevant data structures and how they
interoperate. Additionally, part of the specification is a set of
constraints that must hold at runtime in order for the system to
remain correct with regard to the semantic integrity of the
Automatic. The architecture includes a compiler that
automatically translates the
high-level specification language into low-level
machine code to perform the checks. This automation allows experts to
maximize the use of their time writing the specification and
verifying its correctness, rather than writing low-level code.
Independent. Our architecture does not depend upon the
correctness of the monitored kernel in order to detect that
something is wrong. Instead, our approach relies on a trustworthy monitor
that has direct access to kernel memory on the protected
system and does not rely on the protected kernel's correctness.
Monitor agnostic. While our prototype implementation
utilizes a PCI-based kernel monitor similar to
Copilot  as
the low-level mechanism for accessing system
our architecture allows for the use of any monitor with
access to kernel memory that can also provide isolation. Other possibilities include
software-based systems such as Pioneer  or a
virtual machine introspection
approach . The focus of this
work is on the type of checks performed, not the mechanism
used to perform them. As such, our architecture is general enough to support different types of
monitors, both software- and hardware-based.
Extensible response. The architecture is designed to
allow specification writers to decide how the system should react
to the violation of a particular constraint. At a minimum, most cases will require administrator notification.
Currently, this is the only response we have implemented. However, the
possibility for extension to other responses is apparent,
particularly given the amount of forensic information available
to our monitor.
We have demonstrated the feasibility of our approach by writing sample specifications for two different kernel subsystems in the Linux 2.6 kernel: the process (task) accounting system and the SELinux  mandatory access control (MAC) system's access vector cache (AVC). We have tested the system's effectiveness at detecting real-world attacks on dynamic kernel data in each subsystem, including a publicly available rootkit for the Linux kernel. Our results show that low-level code based on our initial specifications successfully detects the example attacks, which include data-only process hiding and modifications of SELinux access control results directly in memory.
This section describes two examples of how intruders might, after gaining full administrative control of a GNU/Linux system, modify some of the kernel's dynamic data structures to their advantage. In the first example, an intruder removes tasks from the Linux kernel's all-tasks list in order to hide them from the system's legitimate administrators. In the second example, an intruder modifies an entry in the Linux kernel's SELinux access vector cache to temporarily elevate their privileges and disable auditing without making visible changes to the SELinux policy configuration. Note that neither of these examples expose flaws in the Linux kernel or its SELinux security module. These examples represent the potential acts of an intruder who has already gained full control of the system--perhaps by exploiting the trust or carelessness of the system's human operators in a manner entirely outside the scope of the system's technological safeguards.
Rootkits have evolved beyond the historical methods of hiding processes, which included modifying the text of the ps program to lie to legitimate administrators or causing the kernel itself to lie by replacing the normally-static values of kernel text or function pointers, such as the system call vector or jump tables in the /proc filesystem, with the addresses of malicious functions. Even the most sophisticated threats became easy to detect by monitors that could compare the modified values against a known-good value--after all, in a healthy system, these values should never change .
Unfortunately, attackers do not need to modify any kernel code to hide processes within a running kernel. In fact, they do not need to rely on manipulating the control flow of the kernel at all. Instead, adversaries have found techniques to hide their processes even from correct, unmodified kernel code. By directly manipulating the underlying data structures used for process accounting, an attacker can quickly and effectively remove any desired process from the view of standard, unmodified administrator tools. While the process remains hidden for accounting purposes, it continues to execute as normal and will remain unaffected from the perspective of the scheduler. To understand how this state is achieved, we provide a brief overview of Linux 2.6 process management.
The primary data structure for process management in the Linux kernel is the task_struct structure . All threads are represented by a task_struct instance within the kernel. A single-threaded process will therefore be represented internally by exactly one task_struct. Since scheduling occurs on a per-thread basis, a multi-threaded processes is simply a set of task_struct objects that share certain resources such as memory regions and open files, as well as a few other properties including a common process identifier (PID), the unique number given to each running process on the system.
In a correctly-running system, all task_struct objects are connected in a complex set of linked lists that represent various groupings relevant to that task at a particular time . For accounting purposes, all tasks are members of a single doubly-linked list, identified by the task_struct.tasks member. This list, which we refer to as the all-tasks list, insures that any kernel function needing access to all tasks can easily traverse the list and be sure to encounter each task exactly once. The head of the task list is the swapper process (PID 0), identified by the static symbol init_task. In order to support efficient lookup based on PID, the kernel also maintains a hash table that is keyed by PID and whose members are hash-list nodes located in the task_struct.pid structure. Only one thread per matching hash of the PID is a member of the hash table; the rest are linked in a list as part of task_struct.pid member. Other list memberships include parent/child and sibling relationships and a set of scheduler-related lists discussed next.
Scheduling in the Linux kernel is also governed by a set of lists . Each task exists in exactly one state. For example, a task may be actively running on the processor, waiting to be run on the processor, waiting for some other event to occur (such as I/O), or waiting to be cleaned up by a parent process. Depending on the state of a task, that task will be a member of at least one scheduling list somewhere in the kernel. At any given time, a typical active task will either be a member of one of the many wait queues spread throughout the kernel or a member of a per-processor run queue. Tasks cannot be on both a wait queue and a run queue at the same time.
Primed with this knowledge of the internals of Linux process management, we now describe the trivial technique by which an attacker can gain the ultimate stealth for a running process. Figure 1 depicts the primary step of the attack: removing the process from the doubly-linked all-tasks list (indicated by the solid line between tasks). Since this list is used for all process accounting functions, such as the readdir() call in the /proc filesystem, removal from this list provides all of the stealth needed by an adversary. For an attacker who has already gained access to kernel memory, making this modification is as simple as modifying two pointers per hidden process. As a secondary step to the attack, adversaries might also choose to remove their processes from the PID hash table (not pictured) in order to prevent the receipt of unwanted signals.
As shown in Figure 1, a task not present in the all-tasks list can continue to function because the set of lists used for scheduling is disjoint from the set used for accounting. The dashed line shows the relationship between objects relevant to a particular processor's run queue, including tasks that are waiting to be run (or are currently running) on that processor. Even though the second depicted task is no longer present in the all-tasks list, it continues to be scheduled by the kernel. Two simple changes to dynamic data therefore result in perfect stealth for the attacker, without any modifications to static data or kernel text.
The SELinux access vector cache provides a good example of this kind of capability and represents a potential target for an adversary seeking privilege escalation. This section describes the structure and purpose of the AVC and how an adversary might tamper with its state. Section 4 describes an experiment that demonstrates such tampering and the effectiveness of a prototype monitor for detecting this tampering.
SELinux  is a security module for Linux kernels that implements a combination of Type Enforcement  and Role-based  mandatory access control, now included in some popular GNU/Linux distributions. During runtime, SELinux is responsible for enforcing numerous rules governing the behavior of processes. For example, one rule might state that the DHCP  client daemon can only write to those system configuration files needed to configure the network and the Domain Name Service , but no others. By enforcing this rule, SELinux can limit the damage that a misbehaving DHCP client daemon might cause to the system's configuration files should it be compromised by an adversary (perhaps due to a buffer overflow or other flaw).
To enforce its rules, SELinux must make numerous decisions during runtime such as ``Does the SELinux configuration permit this process to write this file?'' or ``Does it permit process A to execute program B?'' Answering these questions involves some overhead, so SELinux includes a component called the access vector cache to save these answers. Whenever possible, SELinux rapidly retrieves answers from the AVC, resorting to the slower method of consulting the policy configuration only on AVC misses.
On our experimental system, the AVC is configured to begin evicting least frequently used entries after reaching a threshold of 512 entries. Our single-user system never loaded the AVC much beyond half of this threshold--although it was occasionally busy performing builds, these builds tended to pose the same small number of access control questions again and again. However, one could imagine a more complex multi-user system that might cause particular AVC entries to appear and disappear over time. Installations that permit SELinux configuration changes during runtime might also see AVC entries evicted due to revocation of privileges.
SELinux divides all resources on a system (such as processes and files) into distinct classes and gives each class a numeric Security Identifier or ``SID.'' It expresses its mandatory access rules in terms of what processes with a particular SID may and may not do to resources with another SID. Consequently, at a somewhat simplified abstract level, AVC entries take the form of tuples:
It is conceivable that adversaries who have already gained administrative control over a system might wish to modify the SELinux configuration to give their processes elevated privileges. Certainly, they could accomplish this most directly by modifying the SELinux configuration files, but such modifications would be easily detected by filesystem integrity monitors like Tripwire . Alternately, they might modify the in-kernel data structures representing the SELinux configuration--the same data structures SELinux consults to service an AVC miss. However, these data structures change infrequently, when administrators decide to modify their SELinux configuration during runtime. Consequently, any tampering might be discovered by a traditional kernel integrity monitor that performs hashing or makes comparisons with correct, known-good values.
The state of the AVC, on the other hand, is dynamic and difficult to predict at system configuration time. Entries come and go with the changing behavior of processes. An adversary might insert a new AVC entry or modify an old one to effectively add a new rule to the SELinux configuration. Such an entry might add extra allowed and decided field bits to grant additional privileges, or remove existing audit-allow and audit-deny field bits to turn off troublesome logging. Such an entry would override the proper in-memory and on-disk SELinux configuration for as long as it remained in the cache. On a single-user installation like our experimental system, it would face little danger of eviction. On a busier system, frequent use might keep it cached for as long as needed.
As shown in Figure 2, the first four of these are runtime components that work together to assess the integrity of a running kernel based on the input specification. The specification compiler is an offline component used only at the time of system setup or when specification updates are required. The primary logic of the monitor is driven by the constraint verifier, which iterates through all constraints to verify each in order. To facilitate the verification of each constraint, the verifier requests a consistent subset of the model from the model builder, which either has the information readily available or uses the low-level monitor to re-build that portion of the model. If a constraint passes, the verifier simply continues to the next. Failed constraints cause the verifier to dispatch a response mechanism according to the specification.
We now describe several aspects of the system in more detail, focusing primarily on the requirements for each component.
It should be noted that model specifications correspond to a particular version (or versions) of the kernel. Therefore, as updates are made to kernel subsystems, so must the specification be updated. However, once a specification is written for a given kernel version, it can be shared and used at any deployed location. Furthermore, the specification compiler takes into account site-specific kernel configuration and symbol information to allow more widespread use of the specification. Finally, the relationships described in the specification will not change frequently and, even when they do change, will rarely change significantly enough to invalidate the entire specification. Tools for automating and improving the specification process are an area for future work.
Similar to the model specification, the constraints that must hold for a system may change when kernel developers make changes. However, like model specifications, constraints can be distributed for use at any deployment where a given model is valid.
For a system that is externally analyzing a running kernel, the design of the model builder is non-trivial due to the complications of constantly changing data within the kernel. The assumptions that can be made by the model builder are closely tied to the properties of the low-level monitor. However, assuming a monitor that is running asynchronously relative to the protected kernel, the following are a minimal set of design considerations for the model builder and specification compiler components:
We begin our discussion by describing our specification language, an adaptation of that presented by Demsky and Rinard , in the context of our Linux process accounting example.
Low-level Structure Definition: The structure definition language provides C-like constructs for describing the layout of objects in memory. Demsky and Rinard provide a few additions to the normal C language syntax. First, fields may be marked ``reserved,'' indicating that they exist but are not used. Second, array lengths may be variable and determined at runtime through expression evaluation. Third, a form of structure ``inheritance'' is provided for notational simplicity whereby structures can be defined based on other structures and then expanded with additional fields. We found no need to change the structure definition language syntax developed by Demsky and Rinard. However, it was necessary to adapt the language's semantics in two important ways because of the ``external'' nature of our monitor.
First, named structure instances, which are also declared in the structure definition language, cannot be resolved because our monitor is not part of the normal software linking process. Instead, we must use an external source for locating variables. Our current implementation allows the user to provide these locations manually or to have them extracted automatically from a Linux System.map symbol table file. The second semantic modification necessary for the structure definition language is the handling of pointer values, which are not ``local'' to our monitor. Instead, pointers must be treated as foreign addresses accessed through the monitor's memory access mechanism.
Figure 3(a) contains
our specification of the Linux kernel's process accounting data
structures written in the structure definition language.
Figure 3(b) contains the result of a manual translation
from this specification into the corresponding C declarations that
will become part of the monitoring code. Note the use of the host_addr_t to represent host addresses after byte-order conversion
on the monitor. As described above, the appropriate value for the LINUX_SYMBOL_init_task constant (and other required symbols) is
automatically extracted from the Linux System.map symbol table file by our configuration tool.
Model Space Definition: The second language, shown in
Figure 3(c) for our process accounting example, defines
a group of sets or relations (there are no relations in our first example)
that exist in the model . There are two sets
in our specification: one corresponding to all processes in the
all-tasks list (the AllTasks set) and one corresponding to all
processes in the run queue (the RunningTasks set). Both are of type Task in the model. We made no modifications to this simple
language, as all of our example specifications were able to be
expressed in the context of sets and relations. The model space
definition language provided by Demsky and Rinard also provides
support for set partitions and subsets.
Model Building Rules: Thus far we have discussed languages for describing the low-level format and organization of data in kernel memory and for declaring the types of high-level entities we will use in our model. The model building rules bridge the gap between these by identifying which low-level objects should be used within the abstract model. These rules take the form
Figure 3(d) shows the model rules for our process
accounting example. The first rule indicates that a circular list
starting (and ending) at init_task.tasks.next will be
processed. The keyword true in the guard indicates that all
members of this list should be subject to the inclusion. The inclusion
itself uses our container() expression to locate the Task
that contains the list pointer and to include that Task in
AllTasks. The second rule is very simple; it creates a singleton set RunningTasks with the current task running on the run queue.
Constraints: The final part of the specification defines the set of constraints under which the model is to be evaluated. The basic form of a rule in Demsky and Rinard's constraint language is as follows :
Figure 3(e) shows the single constraint rule for our hidden process example. The rule states that if any process is ever seen running on the processor that is not in the all-tasks list, we have a security problem and need to alert the administrator. This example describes a relatively simple method of detecting hidden processes. In order to detect a hidden process, the monitor must catch the process while it has the host CPU--a probabilistic strategy that is likely to require the taking of many snapshots of the host's state over time before the hidden process's luck runs out. A more deterministic approach might be to compare the population of the kernel's numerous wait and run queues with the population of the all-tasks list. In order to be eligible for scheduling, a process must be on one of these wait or run queues; a process on a wait or run queue but not in the all-tasks list is hiding. This strategy would require a more complex model specification.
There are four primary entities in our SELinux specification: the security identifier table (of type SIDTab), the access vector cache (an AVCCache), the Type Enforcement access vector table (an AVTab), and its counterpart the Type Enforcement conditional access vector table (also an AVTab). The model definition rules first create a set of SIDs by walking through the SID table and then, similarly, create a set of all AVC nodes from the AVC. The third and fourth rules are used to create mappings between the AVC nodes and their source and target SIDs. Rules five and six look-up each AVC node in the full Type Enforcement policy for both conditional and non-conditional access vector tables. The final two model definition rules create a mapping between AVC nodes and their corresponding entries in the Type Enforcement access vector tables. The single constraint rule simply walks through all AVC nodes and checks that the allowable field matches the combined (bitwise OR) value of the two corresponding Type Enforcement access vector entries for that AVC node. As with the last example, an administrator is notified if the data structures are found to be inconsistent.
We have tested our code against an attacking loadable kernel module that modifies the permissions for a particular AVC entry. A rootkit might make such a modification to temporarily elevate the privileges of one or more processes in a manner that could not be detected by an integrity monitor that observed only static data structures. Our specification successfully detects the attack against our Fedora Core 4 system configured with the default SELinux ``targeted'' policy operating in ``enforcing'' mode.
Currently, there are two methods for identifying data properties and writing their corresponding specifications: (1) analyzing and abstracting on known threats and (2) deriving data properties and specifications from a high-level English-language security policy. In the analysis of known threats, the goal is to classify the techniques used by adversaries in previous attacks in order to abstract on these methodologies. The result is the identification of a set of data invariants that may be violated by future attacks. Of course, this approach permits the possibility that new attacks may avoid detection by exploiting only those details of the kernel abstracted out of the specification, leading to an interminable "arms race" between attackers and specification-writers. Nevertheless, this approach is still better than the traditional signature-based virus-scanning approach in that each specification has the potential to detect an entire class of similar attacks, rather than only a single instance.
It may be possible to avoid such an arms race by using an alternate approach: deriving specifications from a high-level English-language security policy rather than from an analysis of known attacks. In this approach, an analyst might begin with a policy such as "no runnable processes shall be hidden" or "my reference monitor enforces my particular mandatory access control policy" and then examine the kernel source to determine which data structures have relevant properties and what those properties should be in order for the high-level policy to hold. The analyst's task is similar to constructing a formal argument for correctness, except that the end result is a configuration for a runtime monitor.
Section 4 presents two examples of the types of
specifications one might obtain as a result of the methodologies just
described. Using these techniques, we have identified three
classes of attacks against dynamic kernel data. While it is likely
there are other classes of attacks, we believe the three identified
thus far provide evidence of the potential success of our
approach. The following are the attack classes we have identified:
Data hiding attacks. This class of attacks was
demonstrated in Section 2.1 with the Linux process
hiding example. The distinguishing characteristic of this class is
the removal of objects from data structures used by
important kernel subsystems for accounting and reporting. Writing
specifications capable of detecting these attacks requires
identifying data structures that are used by kernel resource
reporting procedures such as system calls and, in the case of Linux,
the /proc filesystem.
Capability/access control modification attacks. One
of the fundamental goals of kernel attackers is to provide their
processes with privileges and access to resources. To this end,
process capabilities in the form of tokens, flags, and
descriptors are likely targets of an attacker with kernel memory
access. In addition to the SELinux AVC example, described in
Section 2.2, we have identified user/group
identifiers, scheduler parameters (e.g., nice value), and POSIX
capabilities as potential targets. We are actively writing
specifications to protect this data.
Control flow modification attacks. One popular
technique for gaining control of kernel functionality is
the modification of function pointers in dynamic data structures
such as those associated with the virtual filesystem (VFS) and /proc filesystem. As demonstrated by popular rootkits like adore-ng, manipulating these pointers provides attackers with a
``hook'' to execute their inserted code. While previous generations of
kernel integrity monitors have demonstrated effective detection of
hooks placed in static data (e.g., the system call table), dynamic
function pointers have remained an elusive target. We are actively
writing a large number of simple specification rules to test the
validity of kernel pointers throughout dynamic data. Additionally,
we intend to investigate the use of automated tools to make this
process easier and more complete.
Unlike misuse detection systems, our specification-based approach allows for the identification and detection of classes of attacks without a priori knowledge of particular instances of threats.
The types of checks performed by these systems are not incorrect or without value. These systems provide a foundation on which our approach aims to extend to broaden the set of kernel attacks detectable from such platforms.
Traditional attestation systems verify only binary properties of static code and data. In such systems, the only runtime benefit provided is the detection of illegal modifications that utilize well-documented transitions or interfaces where a measurement has explicitly been inserted before the malicious software was loaded. Unfortunately, attackers are frequently not limited to using only these interfaces .
Haldar et al. have proposed a system known as ``semantic remote attestation''  in an attempt to extend the types of information the verifying party can learn about the attesting system. Their approach is to use a language-based trusted virtual machine that allows the measurement agent to perform detailed analysis of the application rather than simple binary checksums. The basic principle is that language-based analysis can provide much more semantic information about the properties of an application. Their approach does not extend to semantic properties of the kernel and, since their VM runs on top of a standard kernel, there is a requirement for traditional attestation to bootstrap the system.
Verifiable code execution is a stronger property than attestation whereby a verifier can guarantee that a particular piece of code actually runs on a target platform . This contrasts traditional attestation, where only the loading of a particular piece of software can be guaranteed. Once that software is loaded however, it could theoretically be compromised by an advanced adversary. With verifiable code execution, such a modification should not be possible without detection by the verifier. Both hardware-based [5,35] and, more recently, software-based  systems have been proposed.
Verifiable code execution is a promising direction for ensuring that the correct code is run on a potentially untrusted platform. As shown by Seshadri et al. , such a system could be used as the foundation for a kernel integrity monitor. We therefore view verifiable code execution as a potential monitor extension for our architecture.
In similar work, Nentwich and others  have developed xlinkit, a tool that detects inconsistencies between distributed versions of collaboratively-developed documents structured in XML . It does so based on consistency constraints written manually in a specification language based on first order logic and XPath  expressions. These constraints deal with XML tags and values, such as ``every item in this container should have a unique name value.'' In later work , they describe a tool which analyzes these constraints and generates a set of repair actions. Actions for the above example might include deleting or renaming items with non-unique names. Human intervention is required to prune repair actions from the list and to pick the most appropriate action from the list at repair time.
We have designed the semantic integrity architecture to be easily extended to other monitor platforms. Two of the most promising such platforms include virtual machine monitors [13,12] and software-based monitors achieved via verifiable code execution . These systems provide the possibility for unique extensions such as the inclusion of register state into specifications and the benefit of added assurance without the need for extra hardware. It is our intention to extend our work to at least one such software-based monitor.
A second avenue of work we intend to pursue is that of additional response vectors. Having an independent monitor with access to system memory and a system for easily interpreting that memory can provide a huge amount of leverage for advanced response. Perhaps the most significant potential for work is the advancement of automated runtime memory forensics.
Finally, as with all security systems, having a good policy is very important for the success of the system. Our current architecture requires experts with advanced knowledge of kernel internals to write and verify specifications. Developing tools to help automate the process, including a number of kernel static analysis tools, could significantly improve this process. We intend to investigate techniques for analyzing kernel properties automatically, both statically and at runtime.
We have introduced a novel and general architecture for defining and monitoring semantic integrity constraints--functionality required to defeat the latest generation of kernel-tampering rootkit technology. For our initial prototype implementation, we have adapted Demsky and Rinard's specification languages for implementing internal monitors for application data structures  to the task of implementing external monitors for operating system kernel data structures. This adaptation required adding features to their specification languages to overcome a number of issues not present in the original application problem domain, including managing memory transfer overhead and providing for flexible responses to detected compromises.
Our general architecture is applicable to a variety of low-level monitoring technologies, including external hardware monitors , software-based monitors  and virtual machine introspection . We believe our approach is the first to address the issue of monitoring the integrity of dynamic kernel data in a comprehensive way, and we believe it will provide an excellent complement to present state of the art binary integrity systems.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons main.tex
The translation was initiated by on 2006-05-11