LISA '06 Paper
Windows XP Kernel Crash Analysis
Archana Ganapathi, Viji Ganapathi, and David Patterson
- University of California, Berkeley
Pp. 149-159 of the Proceedings of LISA '06:
20th Large Installation System Administration Conference (Washington, DC:
USENIX Association, December 3-8, 2006).
Abstract
PC users have started viewing crashes as a fact of life rather
than a problem. To improve operating system dependability, systems
designers and programmers must analyze and understand failure data. In
this paper, we analyze Windows XP kernel crash data collected from a
population of volunteers who contribute to the Berkeley Open
Infrastructure for Network Computing (BOINC) project. We found that OS
crashes are predominantly caused by poorly-written device driver code.
Users as well as product developers will benefit from understanding
the crash behaviors elaborated in this paper.
Introduction
Personal Computer (PC) reliability has become a rapidly growing
concern both for computer users as well as product developers.
Personal computers running the Microsoft Windows operating system are
often considered overly complex and difficult to manage. As modern
operating systems serve as a confluence of a variety of hardware and
software components, it is difficult to pinpoint unreliable
components.
Such unconstrained flexibility allows complex, unanticipated, and
unsafe interactions that result in an unstable environment often
frustrating the user. To troubleshoot recurring problems, it is
beneficial to data-mine, analyze and document every interaction for
erroneous behaviors. Such failure data provides insight into how
computer systems behave under varied hardware and software
configurations.
To improve dependability, systems designers and programmers must
understand operating system failure data. In this paper, we analyze
crash data from a small number of Windows machines. We collected our
data from a population of volunteers who contribute to the Berkeley
Open Infrastructure for Network Computing (BOINC) project. As our
analysis is based on a small amount of data (with a self-selection
bias due to the nature of BOINC), we acknowledge that our results do
not represent the entire PC population. Nonetheless, the data reveals
several useful results for PC users as well as researchers and product
developers.
Most Windows users have experienced at least one ``bluescreen''
during the lifetime of their machine. A sophisticated PC user will
accept Windows crashes as a fact and attempt to cope with them.
However, a novice user will be terrified by the implications of a
crash and will continue to be preoccupied with the thought of causing
severe damage to the computer. Analyzing failure data can help users
gauge the dependability of various products and understand the source
of their crashes.
From a research perspective, the motivation behind failure data-mining
is manifold. First, it reveals the dominant failure cause of
popular computer systems. In particular, it identifies products that
cause the most user frustration, thus facilitating our efforts to
build stable, resilient systems. Furthermore, it enables product
evaluation and development of benchmarks that rank product quality.
These benchmarks can influence design prototypes for reliable systems.
Within an organization, analyzing failure data can improve quality
of service. Often, corporations collect failure data to evaluate
causes of downtime. In addition, they perform cost-benefit analysis to
improve service availability. Some companies extend their analyses to
client sites by gathering failure data at deployment locations.
For example, Microsoft Corporation collects crash data for their
Windows operating system as well as applications used by their
customers. Unfortunately, due to legal concerns, corporations such as
Microsoft will usually not share their data with academic research
groups. Companies do not wish to reveal their internal
vulnerabilities, nor can they share third party products' potential
weaknesses. In addition, many companies disable the reporting feature
after viewing proprietary data in the report. While abundant failure
data is generated on a daily basis, very little is readily sharable
with the research community.
The remainder of this paper describes our data collection and
analysis methodology, including: related work in the areas of system
dependability and failure data analysis, background information about
Windows crash data and the data collection process, crash data
analysis and results, a discussion of the merits of potential
extensions to our work, and a conclusion.
Related Work
Jim Gray's work [Gra86, Gra90] serves a model for most
contemporary failure analysis work. Gray did not perform root cause
analysis but rather Outage Cause that considers the last in the fault
chain. In 1989, he found that the major source of outages was
software, contributing about 55%, far outrunning its immediate
successor, system operations, which contributed 15%. This observation
led him to blame software for almost every failure. In an earlier
study [G05, GP05], we analyzed Windows application crashes to
understand causal relationships in the user-level. Departing from
Gray's outage cause analysis, in our study we perform root cause
analysis under the assumption that the first crash in a sequence of
crashes is responsible for all subsequent crashes within that event
chain.
The past two decades have produced several studies in root-cause
analysis for operating systems (OS) ranging from Guardian OS and
Tandem Non-Stop UX OS to VAX/VMS and Windows NT [Gra90, Kal98, LI95,
SK+00, SK+02, TI92, TI+95]. In server environments, Tandem computers,
VAX clusters as well as several operating systems and file servers
have been examined for software defects by several researchers. Lee
and Iyer focussed on software faults in the Tandem GUARDIAN operating
system [LI95], Tang and Iyer considered two VAX clusters running the
VAX/VMS operating system [TI92], and Sullivan and Chillarege examined
software defects in MVS, DB2, and IMS [SC91]. Murphy and Gent also
focussed on system crashes in VAX systems over an extended period,
almost a decade [MG95]. They concluded that system management was
responsible for over 50% of failures with software trailing at 20%
followed by hardware that is responsible for about 10% of failures.
While examining NFS data availability in Network Appliance's
NetApp filers, Lancaster and Rowe attributed power failures and
software failures as the largest contributors to downtime; operator
failure contributions were negligible [LR01]. Thakur and Iyer examined
failures in a network of 69 SunOS workstations [TI96]. They divided
problem root causes into network, non-disk and disk-related machine
problems. Kalyanakrishnam, et al. perused six months of event logs
from a LAN comprising of Windows NT workstations that delivered emails
[KK+99]. Using a state machine model of detailed system failure states
to describe failure timelines on a single node, they concluded that
most automatic system reboot problems are software-related; the
average downtime is two hours. Similarly, Xu, et al. considered
Windows NT event log entries related to system reboots for a network
of workstations that were used for enterprise infrastructure, allowing
operators to annotate event logs to indicate the reason for reboot
[XK+99].
In this progression, our study of Windows' crash data gauges the
evolution of PC reliability. Koopman, et al. test operating systems
against the POSIX specification [KD00]. Our study is complementary to
this work as we consider actual crash data that leads to OS
unreliability.
Recently, in Windows XP Machines, Murphy deduced that display
drivers were a dominant crash cause and memory is the most frequently
failing hardware component [Mur04]. We extend this work by studying
actual crash instances experienced by users rather than injecting
artificial faults as performed by fuzz testing [FM00]. Our study of
crash data differs from error log analysis performed by Kalakech, et
al. [KK+04]; we determine the cause of crashes in addition to time and
frequency.
Several researchers have provided insights on benchmarking and
failure data analysis [BC+02, BS97, OB+02, WM+02]. Wilson, et al.
suggest evaluating the relationship between failures and service
availability [WM+02]. Among other metrics, when evaluating
dependability, system stability is a key concern. Ganapathi, et al.
examine Windows XP registry problems and their effect on system
stability [GW+04]. Levendel suggests using the catastrophic nature of
failures to evaluate system stability [Lev89]. Brown, et al. provide a
practical perspective on system dependability by incorporating users'
experience in benchmarks [BC+02, BS97]. In our study of crashes, we
consider these factors when evaluating various applications.
Overview of Crashes and Crashdumps
A crash is an event caused by a problem in the operating system
(OS) or application (app) requiring OS or app restart. App crashes
occur at user level and typically involve restarting the crashing
application. An OS crash occurs at kernel-level, and is usually caused
by memory corruption, bad drivers or faulty system-level routines. OS
crashes are more frustrating than application crashes as they require
the user to kill and restart the Windows Explorer process at a
minimum, more commonly forcing a full machine reboot. While there are
a handful of crashes due to memory corruption and other common systems
problems, a majority of these OS crashes are caused by device drivers.
These drivers are related to various components such as display
monitors, network and video cards.
Upon each OS crash or bluescreen generated by the operating
system, Windows XP collects failure data as a minidump. Users have
three different options for the amount of information that is
collected upon a crash. We use the default (and smallest) option of
collecting small dumps, which are only 64K in size. These small
minidumps contain a partial snapshot of the computer's state at the
time of crash. They include a list of loaded drivers, the names and
timestamps of binaries that were loaded in the computer's memory at
the time of crash, the processor context for the stopped process, and
process information and kernel context for the stopped process and
thread as well as a brief stack trace. We do not collect personal data
files for our study. However, portions of such data may be resident in
memory at the time of crash and will consequently appear in our crash
dumps. To disable personal data inadvertently being sent, crash
reporting may be disabled or the user can choose not to send a
particular crash report.
When an OS crash occurs, typically the entire machine must be
rebooted. Any relevant information that can be captured before the
reboot is saved in a .dmp file in the
%windir%\Minidump directory. These minidumps are
uniquely named with the date of the crash and a serial number to
eliminate conflicting names for multiple crashes on the same day.
Overview of BOINC Crash Collector
Berkeley Open Infrastructure for Network Computing (BOINC) is a
platform for pooling computer resources from volunteers to collect
data and run distributed computations [And03]. A popular example of an
application using this platform is SETI@home, which aggregates
computing power to `search for extraterrestrial intelligence.' BOINC
provides services to send and receive data from its users via the HTTP
protocol using XML formatted files. It allows application writers to
run and maintain a server that can communicate with numerous client
machines through a specified Applications Programmer Interface (API).
Each subscribed user's machine, when idle, is used to run BOINC
applications. Project groups can create project web sites with
registration services for users to subscribe and facilitate a project.
The web site can also display statistics for contributing users.
Taking advantage of these efforts, we have created a data
collection application to run on this platform. BOINC provides a good
opportunity to collect and aggregate data from users outside our
department while addressing privacy concerns. BOINC anonymizes user
information while allowing us to correlate data from the same user. We
have written tools to read minidumps from users' machines and send the
data to our BOINC server. The drawback of this mechanism is that we
can only collect crash dumps that are stored in known locations on the
user's computer, consequently excluding application crash dumps that
are stored in unknown app-specific locations. Furthermore, configuring
the BOINC server is a tedious and meticulous task. We must also
monitor the number of work units we allot for the BOINC projects; if
there are not enough work units, the application will not run on
client machines.
An attractive aspect of using BOINC is that we can add more
features to our application as and when necessary. We can also provide
users with personalized feedback pages, consequently rewarding the
users with an incentive for sharing data. However, we must verify the
integrity of each crashdump we receive from the users; users often
create files in the crashdump directory to inflate their crash
contribution ranking.
We use a combination of Microsoft's analysis tools and custom-written
scripts to parse, filter and analyze the crash data. Received
crash dumps are parsed using Microsoft's ``Debugging Tools for
Windows'' (WinDbg), publicly available at
https://www.microsoft.com/whdc/devtools/debugging/default.mspx. We
retrieve debugging symbols from Microsoft's publicly available symbol
server
(https://www.microsoft.com/whdc/devtools/debugging/symbolpkg.mspx). Parsing crash dumps using WinDbg reveals the module that caused the
crash as well as the proximate cause of the crash via an error code of
the crashing routine. The drawback of this approach is that we rely on
the completeness and accuracy of Microsoft's symbols. For legal
reasons, Microsoft does not make third party debugging symbols
available, especially those related to antivirus and firewall
software.
We have conducted experiments and noted that 10% of crashdumps
parsed with publicly available debugging symbols have different
analysis results as compared to results when parsed with Microsoft's
internal symbols. Microsoft-written components such as ntoskrnl take
the blame for several third party and antivirus/firewall-related
crashes.
Once crash dumps are parsed by WinDbg, the importance of filtering
data is evident. When a computer crashes, the application or entire
machine is rendered unstable for some time during which a subsequent
crash is likely to occur. Specifically, if a particular piece of
hardware is broken, or part of memory is corrupt, repeated use is
likely to reproduce the error. It is inaccurate to double-count
subsequent crashes that occur within the same instability window. To
avoid clustering unrelated events while capturing all related crash
events, we cluster individual crash events from the same machine based
on temporal proximity of the events. The data that is collected can be
used to gather a variety of statistics. We can provide insight to the
IT team about the dominant cause of crashes in the organization and
how to increase product reliability. We can also use crash behavior to
track any potential vulnerability as frequent crashes may be a result
of malware on the machine. In the long run, we may be able to develop
a list of safe and unsafe hardware and software configurations and
installation combinations that result in crashes.
Understanding Crash Data
To study a broad population of Windows users, we studied data from
public-resource computing volunteers. Numerous people enthusiastically
contribute data to projects on BOINC rather than corporations as they
favor a research cause. Additionally, users appreciate incentive
either through statistics that compares their machine to an average
BOINC user's machine, or through recognition as pioneering
contributors to the project.
Currently, we have about 3500 BOINC users signed up to our
project. Over the last year, we have received 2528 OS crashes from 617
of these users; several users experienced (and reported) multiple OS
crashes while a majority of them reported zero or one crash. Users
reporting no crashes most likely do not actively run the BOINC client
on their machine.
According to results shown in Figure 1 most users experienced
(submitted) only one crash; however, several users suffered multiple
OS crashes. One user appears to have experienced over 200 OS crashes
over the last year! The number is staggering considering that this
data is for kernel-level crashes. Perhaps the user's user-mode crash
counts are as bad, if not worse, considering there is more opportunity
for variability in user-mode components.
Figure 1: A histogram of the number of crashes
experienced by users over the last year. One data point was omitted
from the graph for clarity (443 users experienced only 1 crash
each).
First we analyze each crash as a unique entity to determine
statistics on what components cause the Windows OS to crash often.
Then, to understand how crashes on the same machine relate to each
other, we carefully examined machines that experienced more than 5
kernel crashes within a 24 hour time period. In several cases, we
observed the same crash occurring repeatedly (i.e., same fault in same
module). There were also scenarios with crashes in various components
interleaved with one another. We examine user behavior, temporal
patterns and device driver software reliability to understand these
crashes.
A Human Perspective
The human user plays a huge role in the wear and tear of a
computer. User-interaction is among the most difficult patterns to
quantify. We extracted three distinct user-scenarios from examining
crash sequences from our data:
-
Case 1: The user retries the same action repeatedly, and consequently
experiences the same crash multiple times. He believes the repetition
will eventually resolve the problem (which may be true over a long
period of time). In this scenario, the user's model of how things work
is incomplete. He does not understand the complex dependencies within
the system.
-
Case 2: There is some underlying problem at a lower level that is
causing various different crashes. For example, if the user has
hardware problems, he is likely to have many more crashes in random
components. In this case, the user is simply flustered with all the
crashes and fixing each driver involved in each crash still will not
resolve his problem; he will have to fix the root cause.
-
Case 3: The user knows what the problem is and simply does not see an
incentive to fixing it. For example, he might be using an old version
of a driver for which an update is available. There are three
conceivable explanations for not updating the crashing driver: a) fear
of breaking other working components, b) laziness, and c) fear of
getting caught with an illegal copy of software.
A Temporal Perspective
There are factors beyond end user behavior that demonstrate inter-crash
relationships. Figure 2 shows a distribution of the uptime
between a machine reboot and a crash event. We observe that 25% of
crashes occur within 30 minutes of rebooting a machine. 75% of crashes
occur within a day of rebooting a machine. Perhaps shorter system
uptime intervals indicate the trend of several consecutive related
crashes.
Figure 2: A cumulative frequency graph of system uptime
between reboot and crash events. The dotted line extrapolates what the
CFG would look like if Microsoft wrote all the drivers while the
dashed line suggests what the CFG would look like if Microsoft wrote
none of the drivers that crashed.
Upon analyzing crash sequences on various machines, we observed
various distinct temporal indicators of crash cause:
-
<5 minute uptime: A crash that occurs within 5 minutes of rebooting
a computer is most indicative of a boot-time crash. The crash is not
likely to have been caused by a user action. These crashes are the
most frustrating as there is very little the user can do between the
time of reboot and the time of crash. The user may gain insight on
such crashes by examining the boot log.
-
5 minutes-1 hour uptime: These crashes are more likely to be caused by
a specific sequence of events initiated by the user (e.g., accessing a
particular file from a corrupt disk segment). They could be attributed
to software problems, hardware problems or memory corruption.
-
Regular interval between crashes: Several users experienced crashes
regularly at a particular time of day. Such crashes may be attributed
to a periodic process resembling a cron job or an antivirus scan.
-
Context-based: Various crashes are triggered by a logically preceding
event. For example, every time a virus scanner runs, we may observe a
failed disk access. In such scenarios, we cannot use exact time as an
indicator.
-
Random: Many crash sequences on users' machines did not fit in any of
the above profiles. Several consecutive seemingly unrelated crashes
could suggest a hardware problem and/or memory corruption.
Temporal crash patterns are useful in narrowing down a machine's
potential root cause problems. However, the underlying responsibility
of causing the crash lies in the longevity and reliability of the
hardware and software on the machine.
A Device Driver Reliability Perspective
Device drivers are a major contributor of kernel-level crashes. A
device driver is a kernel-mode module that communicates operating
system requests to the device and vice versa. These drivers are
inherently complex in nature and consequently difficult to write.
Among many reasons for device driver complexity are that these drivers
deal with asynchronous events. Since they interact heavily with the
operating system, the code must follow kernel programming etiquette
(which is difficult to master and follow). Furthermore, once device
drivers are written, they are exceedingly difficult to debug as the
typical device driver failure is a combination of an OS event and a
device problem, and thus very difficult to reproduce (see [SM+04] for
a detailed description of device driver problems).
Figure 3 is largely based on the OS Crash Type field in analyzed
crash reports. This field reveals graphics driver faults, common
system faults (such as memory/pool corruption and hardware faults) and
Application faults. However, there were many instances where the OS
Crash Type was not provided (or defaulted to ``Driver Fault'') for
legal reasons. In the absence of details revealed by the analysis
tools, we crawled the web to derive the type of each driver that
caused a crash. Where we were unable to determine the driver type (for
example, when the documentation was not in English), we defaulted to
``unknown.''
OS CRASH TYPE | NUMBER OF CRASHES
| OS Core | 726
| Microsoft | 488
| Unknown | 238
| Graphics Drivers | 495
| Intel | 287
| ATI Technologies | 97
| Nvidia | 67
| Other | 44
| Application Drivers | 482
| Intel | 89
| Microsoft | 64
| Symantec | 58
| McAfee | 55
| Zone Labs | 55
| Unknown | 13
| Other | 148
| Networking | 338
| Unknown | 194
| Microsoft | 51
| Conexant | 17
| Other | 76
| Common System Fault (Hardware and Software Memory
Corruption) | 136
| Audio | 130
| Avance Logic | 44
| C-Media | 33
| Microsoft | 16
| Other | 37
| Storage | 106
| Microsoft | 82
| Other | 24
| Other | 95
| Unknown | 20
| Figure 3: Number of OS crashes
of each type based on 2528 crashes received from BOINC users. (We
would need many more samples before it would be safe generalizing
these results to a larger user community.) This table also shows
the top few crash-causing driver writers in each category.
Figure 4 shows that a handful of organizations contribute a
significant number of crash-causing drivers to our data. Drivers
written by seven organizations (Microsoft, Intel, ATI Technologies,
Nvidia, Symantec, Zone Labs and McAfee) contributed 75% of all crashes
in our data set. This trend suggests that crashes caused by poorly-written
and/or commonly used drivers can be reduced significantly by
approaching these top seven companies. On the other hand, the graph
has a heavy tail, indicating that it would be extremely difficult to
eliminate the remaining 25% of crashes as they are caused by drivers
written by several different organizations.
Figure 4: Cumulative Frequency Graph of organizations
responsible for crash-causing drivers in our data. This graph does not
account for driver popularity. 113 companies are represented in this
graph.
Subsequently, we study the image (i.e., .exe, .SYS, or .dll file)
that caused these crashes and identify the organization that
contributed the crash-causing code see Figure 5.
The top contender in Figure 5 is ialmdev5.dll, the Intel graphics
driver. Recently, graphics drivers have become notorious for causing
crashes and ialmdev5.dll is perhaps one of the more commonly used
drivers in this category due to the popularity of Intel processors.
Image Name/ Crash Cause | Image Description
| Num Crashes | % Crashes | % Running Total
| Ialmdev5.DLL | Intel graphics driver | 275 | 11% | 11%
| ntoskrnl.exe | NT kernel and system | 187 | 8% | 19%
| CAPI20.SYS | ISDN modem driver | 182 | 7% | 26%
| Win32k.sys | multi user win32 driver | 114 | 5% | 31%
| IdeChnDr.sys | Intel Application Accelerator driver | 89 | 4% | 35%
| ntkrnlmp.exe | Multi-processor version of NT kernel and system | 87 | 4% | 39%
| vsdatant.sys | TrueVector Device Driver | 51 | 2% | 41%
| GDFSHK.SYS | McAfee Privacy Service File Guardian | 48 | 2% | 43%
| V7.SYS | IBM V7 Driver for Windows NT/2000 | 45 | 2% | 45%
| ALCXWDM.SYS | Windows WDM driver for Realtek AC'97 | 44 | 2% | 47%
| Figure 5:
Top 10 OS Crash-causing Images based on 2528 crashes received from
BOINC users. (We would need many more samples before it would be safe
generalizing these results to a larger user community.) A description
of the crash-causing image is provided in addition to the percentage
of crashes caused by each image.
The second highest contender in Figure 5 is ntoskrnl.exe, which
constitutes the bare-bones Windows NT operating system kernel code. It
is not surprising that this executable is responsible for a number of
driver crashes because it interacts with every other operating system
component and is thus the single most critical component that can
never be perfect enough. Furthermore, other systems code might
generate bad input parameters to the ntoskrnl functions that cause
exceptions; ntoskrnl bears the blame for the resulting crash as it
generated the exception. Also, as mentioned earlier, many
antivirus/firewall-related crashes may have been mis-categorized,
blaming ntoskrnl due to third party privacy concerns (hence the
significantly high percentage of crashes attributed to Microsoft in
Figure 3).
Other crash causing images range from I/O drivers to multimedia
drivers. It is difficult to debug or even analyze these crashes
further as we do not have the code and/or symbols for these drivers.
With the increasing need for numerous devices accompanying the PC,
it does not scale for the operating system developers to account for
and write device driver code for each device; consequently, device
drivers are written by device manufacturers, who are potentially
inexperienced in kernel programming. Perhaps such lack of expertise is
the most impacting cause for driver-related OS crashes.
We also observed numerous OS crashes caused by memory corruption.
Memory corruption-related crashes can often be attributed to hardware
problems introduced by the type of memory used (e.g., non-ECC memory).
In the event that the memory corruption was due to software, the
problem cannot be tracked down to a single image.
To further understand driver crashes, we studied the type of fault
that resulted in the crash. Figure 6 lists the number of crashes that
were caused by the various fault types. These fault types are reported
by Microsoft's analysis tools when analyzing each OS crash dump.
Driver Fault Type | Num Crashes
| IRQL NOT LESS OR EQUAL | 657
| THREAD STUCK IN DEVICE DRIVER | 327
| PAGE FAULT IN NONPAGED AREA | 323
| KERNEL MODE EXCEPTION NOT HANDLED | 305
| UNEXPECTED KERNEL MODE TRAP | 78
| BAD POOL CALLER | 74
| SYSTEM THREAD EXCEPTION NOT HANDLED | 73
| PFN LIST CORRUPT | 53
| DRIVER CORRUPTED EXPOOL | 38
| MACHINE CHECK EXCEPTION | 37
|
Figure 6: Top 10 crash generating driver fault
types.
While many of these fault types are straightforward to understand
from the name, many others are abbreviations of the event they
describe. Below, we enumerate each fault type and its significance
(based on the descriptions provided in the parsed crash dumps):
-
IRQL NOT LESS OR EQUAL - An attempt was made to access
a pageable (or completely invalid) address at an interrupt request
level (IRQL) that is too high. The driver is most likely using an
improper address.[Note 1]
-
THREAD STUCK IN DEVICE DRIVER - The device driver is
spinning in an infinite loop, most likely waiting for hardware to
become idle. This usually indicates problem with the hardware itself
or with the device driver programming the hardware incorrectly.
-
PAGE FAULT IN NONPAGED AREA - Invalid system memory
was referenced, for example, due to a bad pointer.
-
KERNEL MODE EXCEPTION NOT HANDLED - The exception
address pinpoints the driver/ function that caused the problem.
However, the particular exception thrown by the driver/function was
not handled.
-
UNEXPECTED KERNEL MODE TRAP - A trap occurred in
kernel mode, either because the kernel is not allowed to have/catch
(bound trap) the trap or because a double fault occurred.
-
BAD POOL CALLER - The current thread is making a bad
pool request. Typically this is at a bad IRQL level or double freeing
the same allocation, etc.
-
SYSTEM THREAD EXCEPTION NOT HANDLED - This fault type
is similar to an unhandled kernel mode exception.
-
PFN LIST CORRUPT - Typically caused by drivers passing
bad memory descriptor lists.
-
DRIVER CORRUPTED EXPOOL - An attempt was made to
access a pageable (or completely invalid) address at an interrupt
request level (IRQL) that is too high. This fault is caused by drivers
that have corrupted the system pool.
-
MACHINE CHECK EXCEPTION - A fatal machine- check
exception occurred (due to hardware).
Studying these fault types reveals various programming errors that
impact system behavior and what OS problems to tackle with caution.
However, this information is more useful to the software developer
than the end user. From a user's perspective, the most useful piece of
information is ``what can I fix on my machine?''
There are three distinct trends we observed on machines with
multiple crashes:
-
The same driver causes most crashes: This scenario is very simple to
resolve. Most likely, the crash-causing driver is an old version,
which has newer, more stable version available. There were other cases
where a newly downloaded driver caused various crashes as a result of
its incompatibility with other components installed on the machine. In
both situations, updating or rolling back the driver's version will
reduce crashes on the machine.
-
Related drivers cause most crashes: Two drivers are considered related
if they communicate with the same device or pertain to the same
component. In this scenario if different, yet related, drivers cause
the machine's crashes, then perhaps the common underlying component or
device is at fault and needs attention.
-
Unrelated drivers cause the crashes: This scenario is the most
difficult to comprehend. First, we understand what the drivers have in
common - whether they perform similar actions or function calls, have
similar resource requirements (e.g., requiring network connectivity),
or access the same objects.
In the above scenarios, it is useful to understand inter-driver
dependencies. We would also benefit from understanding the stability
of specific driver versions and how diverse their install base is.
Discussion
Windows users have started viewing crashes as a fact of life
rather than a problem. We have the single most valuable resource to
design a system that helps users cope with crashes better - crash
data. Microsoft's Online Crash Analysis provides users with feedback
on each of their submitted crashes. However, many users suffer from
multiple crashes and individual per-crash analysis is not enough to
identify the optimal solution to the root problem. There is a strong
need to use historical data for each machine and use heuristics to
determine the best fix for that machine.
The human, temporal and device-driver reliability perspectives
shed light on potential root causes for crashing behavior. There are
numerous other factors we can include to refine root cause analysis.
It would be very beneficial to scrape portions of the machine's event
log when analyzing crashes. We can look for significant events
preceding each crash (e.g., Driver installed/removed, process started
up, etc.), pinpointing likely sources of the machine's behavior.
It is also useful to collect various machine health metrics such
as frequency of prophylactic reboot and frequency of virus scans. Such
metrics will help us evaluate the relative healthiness of a machine
(compared to the entire user population) and customize analysis
responses on a per-machine basis. Ideally we would want our data
analysis system to have a built-in feedback loop (as seen in Figure 7)
so we can continuously adapt and improve our analysis engine. This
framework is useful for performing accurate post-mortem analysis.
Figure 7: Customer-centric kernel crash analysis
framework.
It is equally important to understand the manifestation of such
problems on each machine. It is important to characterize inter-component
interactions and model failure propagation patterns. Such
analysis will help improve inter-component isolation, reducing their
crash-likeliness. While post-mortem analysis and debugging helps cure
problems, it is also critical to prevent problems at their source. As
an industry, we must work towards determining the characteristics of
software that dictate software dependability.
Conclusion
Our crash-data related study, despite the small quantity of
Windows XP data analyzed, has contributed several observations. The
most notable reality is that the Windows operating system is not
responsible for a majority of PC crashes in our data set. Poorly-written
device drivers contribute most of the crashes in our data. It
is evident that targeting a few companies to improve their driver
quality will effectively eliminate 75% of our crashes. However, the
remaining 25% of crashes are extremely difficult to eliminate due to
the large number of organizations contributing the driver code.
Users can alleviate computer frustration by better usage
discipline and avoiding unsafe applications and drivers. With
additional data collection and mining, we hope to make stronger claims
about applications and also extract safe product design and usage
methodology that apply universally to all operating systems.
Eventually, this research can gauge product as well as usage
evolution.
Studying failure data is as important to the computing industry as
it is to consumers. Product dependability evaluations help evolve the
industry by reducing quality differential between various products.
Once product reliability data is publicized, users will use such
information to guide their purchasing decisions and usage patterns.
Product developers will react defensively and resulting competition
will improve quality control.
In the future, we hope to refine analysis engine and automate many
of the background queries for each driver. We would like to improve
our understanding of the dependencies between analysis categories such
as the temporal and device driver perspectives. We also plan to
investigate the relationship of various objects involved at the time
of crash. Lastly, we would like to obtain more environmental metrics
and draft more rules for analysis, and extend this work to other
Operating Systems.
Author Biographies
Archana Ganapathi is a graduate student at the University of
California at Berkeley. She completed her Masters degree in 2005 and
is currently pursuing her Ph.D .in Computer Science. Her primary
research interests include operating system dependability and system
management.
Viji Ganapathi is an undergraduate at the University of California
at Berkeley. She will complete her Computer Science Bachelors degree
in December, 2006.
David A. Patterson has been Professor of Computer Science at the
University of California, Berkeley since 1977, after receiving his
A.B., M.S., and Ph.D. from UCLA. He is one of the pioneers of both
RISC and RAID, both of which are widely used. He co-authored five
books, including two on computer architecture with John L. Hennessy.
They have been popular in graduate and undergraduate courses since
1990. Past chair of the Computer Science Department at U.C. Berkeley
and the Computing Research Association, he was elected President of
the Association for Computing Machinery (ACM) for 2004 to 2006 and
served on the Information Technology Advisory Committee for the U.S.
President (PITAC) from 2003 to 2005.
His work was recognized by education and research awards from ACM
and IEEE and by election to the National Academy of Engineering. In
2005 he shared Japan's Computer & Communication award with Hennessy
and was named to the Silicon Valley Engineering Hall of Fame. In 2006
he was elected to the American Academy of Arts and Sciences and the
National Academy of Sciences and he received the Distinguished Service
Award from the Computing Research Association.
Bibliography
[And03] Anderson, D., ``Public Computing: Reconnecting People to
Science,'' The Conference on Shared Knowledge and the Web,
Residencia de Estudiantes, Madrid, Spain, Nov., 2003.
[BS+02] Broadwell, P., N. Sastry and J. Traupman, ``FIG: A
Prototype Tool for Online Verification of Recovery Mechanisms,''
Workshop on Self-Healing, Adaptive and self-MANaged Systems
(SHAMAN), New York, NY, June, 2002.
[BC+02] Brown, A., L. Chung, and D. Patterson, ``Including the
Human Factor in Dependability Benchmarks,'' Proc. 2002 DSN Workshop
on Dependability Benchmarking, Washington, D.C., June, 2002.
[BS97] Brown, A. and M. Seltzer, ``Operating System Benchmarking
in the Wake of Lmbench: A Case Study of the Performance of NetBSD on
the Intel x86 Architecture,'' Proc. 1997 ACM SIGMETRICS Conference
on the Measurement and Modeling of Computer Systems, Seattle, WA,
June, 1997.
[FM00] Forrester, J. and B. Miller, ``An Empirical Study of the
Robustness of Windows NT Applications Using Random Testing,'' Proc.
4th USENIX Windows System Symposium, Seattle, WA, Aug., 2000.
[G05] Ganapathi, A. ``Why Does Windows Crash?'' UC Berkeley
Technical Report UCB//CSD-05-1393, May, 2005.
[GW+04] Ganapathi, A., Y. Wang, N. Lao and J. Wen, ``Why PCs are
Fragile and What We Can Do About It: A Study of Windows Registry
Problems,'' Proc. International Conference on Dependable Systems
and Networks (DSN-2004), Florence, Italy, June, 2004.
[GP05] Ganapathi, A. and D. Patterson, ``Crash Data Collection: A
Windows Case Study,'' To Appear in Proc. International Conference
on Dependable Systems and Networks (DSN-2005), Yokohama, Japan,
June, 2005.
[Gra86] Gray, J., ``Why Do Computers Stop and What Can Be Done
About It?'' Symp. on Reliability in Distributed Software and
Database Systems, pp. 3-12, 1986.
[Gra90] Gray, J., ``A census of Tandem system availability
between 1985 and 1990,'' Tandem Computers Technical Report
90.1, 1990.
[GS04] Gray, J. and A. Szalay, ``Where the rubber meets the
sky:bridging the gap between databases and science,'' Microsoft
Research TR-2004-110, 2004.
[KK+04] Kalakech, A., K. Kanoun, Y. Crouzet and J. Arlat,
``Benchmarking the dependability of Windows NT4, 2000 and XP,''
Proc. International Conference on Dependable Systems and Networks
(DSN-2004), Florence, Italy, June, 2004.
[Kal98] Kalyanakrishnam, M., ``Analysis of Failures in Windows NT
Systems,'' Masters Thesis, Technical report CRHC 98-08,
University of Illinois at Urbana-Champaign, 1998.
[KK+99] Kalyanakrishnam, M., Z. Kalbarczyk, and R. Iyer,
``Failure data analysis of a LAN of Windows NT based computers,''
Proceedings of the 18th IEEE Symposium on Reliable Distributed
Systems, 1999.
[KD+05] King, S., G. Dunlap and P. Chen, ``Debugging operating
systems with time-traveling virtual machines,'' Proceedings of the
2005 Annual USENIX Technical Conference, April, 2005.
[KD00] Koopman, P., and J. DeVale, ``The Exception Handling
Effectiveness of POSIX Operating Systems,'' IEEE Transactions on
Software Engineering, Vol. 26, Num. 9, pp. 837-848, Sept., 2000.
[LR01] Lancaster, L. and A. Rowe, ``Measuring real-world data
availability,'' Proceedings of LISA 2001, 2001.
[LI95] Lee, I. and R. Iyer, ``Software Dependability in the
Tandem GUARDIAN Operating System,'' IEEE Transactions on Software
Engineering, Vol. 21, Num. 5, pp. 455-467, May, 1995.
[Lev89] Levendel, Y., ``Defects and Reliability Analysis of Large
Software Systems: Field Experience,'' Digest 19th Fault-Tolerant
Computing Symposium, pp. 238-243, June, 1989.
[MR+04] Maniatis, P., M. Roussopoulos, T. Giuli, D. S. H.
Rosenthal, and M. Baker, ``The lockss peer-to-peer digital
preservation system,'' ACM Transactions on Computer Systems
(TOCS), 2004.
[Mur04] Murphy, B., ``Automating Software Failure Reporting,''
ACM Queue, Vol. 2, Num. 8, Nov., 2004.
[MG95] Murphy, B. and T. Gent, ``Measuring system and software
reliability using an automated data collection process,'' Quality
and Reliability Engineering International, Vol. 11, 1995.
[OB+02] Oppenheimer, D., A. Brown, J. Traupman, P. Broadwell, and
D. Patterson, ``Practical issues in dependability benchmarking,''
Workshop on Evaluating and Architecting System dependabilitY (EASY
'02), San Jose, CA, Oct., 2002.
[SS72] Schroeder, M. and J. Saltzer, ``A Hardware Architecture
for Implementing Protection Rings,'' Communications of the ACM,
Vol. 15, Num. 3, pp. 157-170, March, 1972.
[SK+00] Shelton, C., P. Koopman, K. DeVale, ``Robustness Testing
of the Microsoft Win32 API,'' Proc. International Conference on
Dependable Systems and Networks (DSN-2000), New York, June, 2000.
[SK+02] Simache, C., M. Kaaniche, A. Saidane, ``Event log based
dependability analysis of Windows NT and 2K systems,'' Proc. 2002
Pacific Rim International Symposium on Dependable Computing
(PRDC'02), pp. 311-315, Tsukuba, Japan, Dec., 2002.
[SC91] Sullivan, M. and R. Chillarege, ``Software defects and
their impact on system availability - a study of field failures in
operating systems,'' Proceedings of the 21st International
Symposium on Fault-Tolerant Computing, 1991.
[SM+04] Swift, M., Muthukaruppan, B. Bershad, and H. Levy,
``Recovering Device Drivers,'' Proceedings of the 6th ACM/USENIX
Symposium on Operating Systems Design and Implementation, San
Francisco, CA, Dec., 2004.
[TI92] Tang, D. and R. Iyer, ``Analysis of the VAX/VMS Error Logs
in Multicomputer Environments - A Case Study of Software
Dependability,'' International Symposium on Software Reliability
Engineering, Research Triangle Park, North Carolina, Oct., 1992.
[TI96] Thakur, A. and R. Iyer, ``Analyze-NOW-an environment for
collection and analysis of failures in a network of workstations,''
IEEE Transactions on Reliability, Vol. R46, Num. 4, 1996.
[TI+95] Thakur, A., R. Iyer, L. Young and I. Lee, ``Analysis of
Failures in the Tandem NonStop-UX Operating System,'' International
Symposium on Software Reliability Engineering, Oct., 1995.
[WL+93] Wahbe, R., S. Lucco, T. Anderson, and S. Graham,
``Efficient Software-Based Fault Isolation,'' Proc. Fourteenth ACM
Symposium on Operating Systems Principles (SOSP), pp. 203-216,
December, 1993.
[WC+01] Welsh, M., D. Culler and E. Brewer, ``SEDA, an
Architecture for well-conditioned scalable Internet Services,''
18th Symposium on Operating System Principles, Chateau Lake
Louise, Canada, October, 2001.
[WM+02] Wilson, D., B. Murphy and L. Spainhower, ``Progress on
Defining Standardized Classes for Comparing the Dependability of
Computer Systems,'' Proc. DSN 2002 Workshop on Dependability
Benchmarking, Washington, D.C., June, 2002.
[XK+99] Xu, J., Z. Kalbarczyk and R. Iyer, ``Networked Windows NT
system field failure data analysis,'' Proceedings of the 1999
Pacific Rim International Symposium on Dependable Computing, 1999.
Footnotes:
Note 1: The interrupt request
level is the hardware priority level at which a given kernel-mode
routine runs, masking off interrupts with an equivalent or lower
IRQL on the processor. A routine can be preempted by an interrupt
with a higher IRQL.
|