LISA '06 Paper

LiveOps: Systems Management as a Service

Chad Verbowski - Microsoft Research Juhan Lee and Xiaogang Liu - Microsoft MSN Roussi Roussev - Florida Institute of Technology Yi-Min Wang - Microsoft Research
Pp. 187-203 of the Proceedings of LISA '06: 20th Large Installation System Administration Conference
(Washington, DC: USENIX Association, December 3-8, 2006).

Abstract

Existing Management Systems do not detect the most time-consuming and technically difficult anomalies administrators encounter. Oppenheimer [25] found that 33% of outages were caused by human error and that 76% of the time taken to resolve an outage was taken by humans determining what change was needed. Defining anomaly detection rules is challenging and often cannot be shared across organizations. It requires a deep combined knowledge of the software, workload, system configuration, and tuning parameters specific to the workload and overall distributed application topology.

We present LiveOps, a scalable systems and security management service based on auditing the interactions between applications and the persistent state they use [33]. This approach simplifies identifying security vulnerabilities, performs compliance auditing, enables forensic investigations, detects patching problems, optimizes troubleshooting, and detects malware/ intrusions. The service enables knowledge sharing across organizations and administrative boundaries and allows for seamless integration between analysis results from disparate management products that build on it. Our configuration-free agent collects all read and write access to registry entries, files, binaries, and process creation. The agents streaming lossless compression creates log files of only 20 MB per day containing an average of 45 million events. The scalable LiveOps back-end service can analyze 1000 machine days of logs in 30 minutes. LiveOps agents have been deployed on 1149 machines from home systems to corporate desktops, including 381 production MSN servers across 11 sites.

Introduction

MSN System administrators spend a third of their time managing system configuration despite the use of state-of-the-art systems management solutions. This not only reduces the number of systems that a single administrator can effectively manage, but is a significant liability in overall system reliability.

Several examples of configuration errors impacting system reliability are described in [33], where at one MSN site, 70% of persistent errors were found to be persistent state (PS) related, and 28% of support calls at a large software company's help desk were configuration related. Furthermore, a lack of understanding about the impact configuration changes have on critical applications can delay the deployment of critical security patches [2, 28, 34], and has been found to be responsible for 76% of the time taken to resolve datacenter outages [25].

Existing systems management solutions are difficult to implement, expensive to deploy, and are largely ineffective at addressing the most costly systems and security management problems. Furthermore, for scalability and performance reasons they do not track all changes, or the interactions between applications, users and PS. Administrators must manually apply their experience and application knowledge to filter the subset of PS changes recorded by these systems to identify only the PS changes impacting their applications.

Defining anomaly detection rules based on application generated events [31, 40] is challenging and often cannot be shared across organizations. It requires a deep combined knowledge of the software, workload, system configuration, and tuning parameters that are specific to the workload and overall distributed application topology.

Software developers and systems management solution vendors are restricted to creating rules that are based on the available application events, and that are locally tunable or independent of workload, system configuration, and distributed application topology. Administrators are left with the challenge of determining the appropriate tuning parameters for existing rules and are required to identify and codify the bulk of rules needed to manage their environments. Sharing rules between organizations is further complicated by variations in tuning parameters and rule definition languages.

We present LiveOps, a scalable systems and security management service based on auditing the interactions between applications and persistent state they use [33]. This approach simplifies identifying security vulnerabilities, performs compliance auditing, enables forensic investigations, detects patching problems, optimizes troubleshooting, and detects malware/intrusions.

The LiveOps architecture consists of agents running on the managed systems that report logs to a back-end service which processes the data and provides extensible interface for generating of web reports, alerts, or integrating with other management products. Our configuration-free agent collects all read and write access to registry entries, files, binary loads, and process creation. The agent's streaming lossless compression creates log files of only 20 MB per day containing an average of 45 million events. The scalable LiveOps back-end service can analyze 1000 machine days of logs in 30 minutes

The key contribution of LiveOps is scalable and complete configuration monitoring, comprehensive anomaly detection through cross machine and cross time baseline analysis in an organization, and globally across all machines reporting to the service, enabling self-tuning of rules with respect to workload and topology. Secondly LiveOps' ability to manage 4000+ monitored machines via a single back-end server enables it to be run as a service supporting multiple organizations.

As a service, LiveOps provides immediate benefit to all subscribers when new rules are developed. Using a service rather than a locally installed management system eliminates the cost of maintaining a systems management solution, enables IT departments to draw on the expertise of other administrators, and potentially alerts the original developers of a problematic application. When rare and difficult issues are encountered, the service provides the specific management details that enable collaboration with experts that can help resolve the issue. Furthermore, after an issue is resolved the relevant centralized data can be annotated to benefit other users experiencing a similar problem.

We present our experiences and analysis of the alerts and reports collected from running LiveOps in MSN. LiveOps agents have been deployed on 1149 machines, from home systems to corporate desktops, and including 381 production MSN servers across 11 sites. In the next section, we describe the motivation for creating LiveOps and the related work comparing it with existing management approaches. We subsequently describe the LiveOps architecture and present the management scenarios that LiveOps addresses, provides sample reports, and a summary of results obtained from analyzing the reports from MSN production servers. We then present a data analysis of the PS used in a production environment, and an analysis of the feasibility of classifying them. Finally, we conclude.

Motivation and Related Work

The creation of LiveOps was motivated by the discovery that infrequently occurring unique configuration issues have a large reliability impact, require the most administrator time, and are the most technically challenging to resolve. Traditional management products [21] are largely ineffective at managing these problems because they rely on the monitored applications to log events [31, 40] when they are malfunctioning.

It is unrealistic to expect developers of the application to have correctly anticipated all possible problems that may occur with the application internally - from integrating with other applications, OS interactions, and interacting with distributed systems such as databases, firewalls, web servers, and network related services. Events logged by applications are best used for automating responses to frequently occurring problems that are well understood, but seldom provide sufficient insight for administrators to enable quick resolution of issues unanticipated by the original application developers. The following example of a problem at a large MSN site illustrates how understanding the impact of individual PS changes on an application enables quick problem resolution:
An administrator was assigned to resolve an intermittent web page issue with a large online site. During the course of solving the problem, a critical configuration file was inadvertently deleted. After the intermittency issue was resolved, the site appeared fixed and the issue was closed. After 18 hours it was discovered that the site was experiencing a partial outage. Investigation of the outage involved teams of engineers, and lasted for 27 more hours, before the originally modified configuration file was discovered as the root cause. With a record of what PS had been changed on each machine, the investigators could have quickly and easily identified the configuration file as a root cause candidate.

The next example illustrates how failing to verify the complete set of files and settings modified after a patch installation can prevent diagnosis of configuration problems:
A patch was deployed on servers across a large MSN site. Later it was found that the site was experiencing a partial outage after the installation, where some users experienced poor performance or received an error message after refreshing their web page two out of five times. Since the site load balances incoming requests over a large pool of servers, it was difficult to determine which servers were causing the problem. It took two engineering teams approximately 72 hours to determine that the root cause was that one of the servers had only received partial settings during the recent patch install.

The LiveOps approach is to manage configuration from the OS perspective of interactions between running processes and PS. This differs from asserting correctness constraints on subsets configuration for the system as a whole, as done with CFEngine [5], because using interactions enables consideration of the subset of PS that are used by each application. Overall this reduces the volume of PS to consider because less than 15% of non-temporary registry and files are typical in daily use [32].

We believe this is more effective than the traditional approach of analyzing application logs for the following reasons:

It is a non-participatory model, meaning that all changes made by applications are tracked regardless of the APIs they use or logs they create.
Only a few well-defined event types from a single source are required for monitoring all applications, as opposed to multiple event logs containing application specific events.
If the specific root cause of a problem is not obvious the information provided reduces potential root cause candidates, making problems easier to solve.
When the root cause of the problem is found, future incarnations of the problem can be avoided through knowing which process was used by whom at which specific time to make the breaking change.

Another approach to configuration management is to eliminate the need to identify the root cause of problems by following a non-traditional approach of designing applications with the expectation that they will fail [4], as opposed to attempting to eliminate all possible errors. This is achieved by making them quick to install and restart, and by creating software probes that remotely monitor the application to detect failures and restart the application when failures are detected. While this approach minimizes the time spent by operators investigating problems, the lack of root cause understanding means that future occurrences of the problem cannot be avoided.

The implicit assumptions are that problems will not simultaneously affect large numbers of machines, and that problems affecting a machine will be relatively infrequent. These may not be valid assumptions if DDOS attacks are made on the infrastructure. Furthermore, this approach requires that all failures be detectable by the probes.

LiveOps provides a critical missing component of the software configuration management cycle (shown in Figure 1) by detecting all system changes, verifying approved requests, and alerting upon unknown modifications. Closing this loop is becoming increasingly important in order to identify unwanted changes, malware for example, and to fulfill auditing requirements [27, 29].

Figure 1: Typical Change Management Process showing how LiveOps closes the loop.

Managing servers and desktops through a management service has several advantages over individual companies or departments maintaining a local management infrastructure. These are:

Current and historic information about the managed systems is available and accessible despite local outages and system failures, and that it provides a quick and easy method of sharing detailed information with support professionals and other problem domain experts without local network or system access.
The service enables correlation of system activities across large numbers of managed hosts to identify known good and known bad values.
The service enables application and OS developers to receive details on how their software is being configured and used which can help identify problems, and drive future product enhancements.
The service provides a centralized, globally available, and integrated system for application developers, administrators, and support professionals to add knowledge on specific PS, regarding optimal values, and new analysis rules to identify and resolve problems so future instances will not require detailed investigation.
The service becomes a management platform for new reports so problem analysis techniques can be added to the service's back-end servers without requiring software or configuration updates to the agents running on managed systems.

Architecture

In this section, we present an overview of the LiveOps system architecture. Figure 2 describes the system architecture and data flow of:

Our low-level server or desktop agent that logs all registry, file, binary load, and process creation interactions, compresses the trace events into log files, and uploads them for analysis [33]
The LiveOps back-end service that processes and archives the uploaded log files
An extensible web interface that supports retrieving reports, programmatic data access, and integrating with existing notification mechanisms. Our implementation does not require any changes to the core operating system or any applications specific changes or configuration

Figure 2: LiveOps System Architecture and data flow.

Figure 3 describes the processing flow of the back-end LiveOps servers, detailing the analysis performed by the daemons on newly received logs. Anomaly detection rules, such as checking for policy violations or correlating an observed change with a planned change work order, are written in standard C/C++/C# code and compiled into a dynamically loadable Query Module (QM). At startup the daemons load the QMs, passing each QM a reference to the newly uploaded logs.

Figure 3: LiveOps analysis framework.

Internally, a QM may integrate with external IT data sources, such as change management databases (CMDB) or change policy definitions. The QM then detects anomalies such as identifying patterns within the logs, correlating log activities with CMDB work items, or using change policy definitions to identify policy violations. All QMs write alerts to a common API that stores them in the Alert Database. An Alert Notification daemon reads the alerts and sends notifications through Instant Messenger, email, and RSS feeds. A QM may also maintain an internal database, such as the baseline database used to store a history of application interaction with PS. The baseline QM uses its database to detect deviations and log them as alerts. Once log files have been processed by all QMs they are moved into a central archive. Administrators can create ad-hoc queries against all archived data using the web service that exposes the LiveOps query API. This web service is used by the LiveOps web server to generate HTML reports, and can also be used by other products to integrate with LiveOps data.

In addition to providing access to the information, LiveOps also provides contextual information about each alert and the settings it describes. Three stages of annotations:

Automatic
Pre-defined context mapping
Expert, as described in Figure 4, are applied to the alerts and individual file and settings as they are read from the web interface

Figure 4: LiveOps annotation of information.

For example, let's consider the annotations added for an alert that describes a new application installation on a monitored machine. The first level provides statistical information such as how many other machines have this application installed, and the most common versions of files and values of settings associated with it. The second level of annotation compares the binary hash of the installed files with a database that indexes company name, application name, and version information based on binary hashes. The final level of annotation adds any comments or rating information provided by other users of the system.

If this installation was known to be spyware or malware, the final annotations could include support articles that describe how to deal with the installation. The quality of this information grows as more machines report their data to the service, and as more users provide comments and ratings on the alerts and individual file and settings entries. It also provides valuable insight into the application vendors about how consumers configure and use their products, along with insight into the real-world problems their products have.

LiveOps Scenarios

LiveOps was created within the management philosophy that administrators understand the PS of a running system, all changes made to the PS, and the scope of PS interactions for correctly running processes. This is required in all administrative management scenarios which are described in the following subsections. Each section contains a description of the LiveOps reports, how they are generated, and a summary of results from running LiveOps on production servers at MSN.

The section on ``Understanding Changes'' describes how LiveOps improves the change management processes used by most IT organizations by showing how it can be used to:

Verify that changes are correctly made and to discover unauthorized changes
Identify unapproved processes
Discover the impact and frequency of changes impacting critical systems and applications
Identify sensitive information that is being copied to removable devices and network locations
Analyze and account for the PS that are often left behind by software installation and removal programs

The section on ``Managing Abnormal System Activity'' shows how LiveOps can significantly reduce troubleshooting time by:

Detecting known problems
Providing contextual information about what PS processes and users were accessing
Identifying processes that have become stale from changes to PS

``Enforcing Best Practices'' describes how LiveOps can be used to improve the best practices followed by administrators by identifying:

When users are logging in to systems, potentially unnecessarily
Daemons running under local user or domain user accounts instead of the system account

Scenario: Understanding Changes

The typical change management process followed by many IT organizations is:

Identify change to be made and create a work order
Verify correctness of change in test environment with test workloads
Deploy the change to subset of servers and monitor for problems caused by real-world workloads
Broadly deploy the change and close the work order

The three main challenges with implementing this process are:

Defining the specific changes expected
Auditing the changes as they are made to the systems
Identifying the impact of the change to understand the applications that must be tested and verified

Without LiveOps `Change Management' reports defined in this section, scheduled changes are often not understood and therefore their impact is not well known. For example, work orders may define a change as applying a SQL Server Service Pack 1 update, rather than listing the specific files and settings being modified. Consequently, understanding if the change has been correctly and completely applied to all required servers is often based on unreliable and incomplete methodologies, such as:

Sampling each affected server for one of the known files in the change
Relying on the administrator notes about what changes were made and when
By examining logs that may not fully describe the change

Often times the software installation for patches and applications does not maintain an accurate record of the PS being created or changed on the system, and consequently their uninstall programs can leave PS behind, or even corrupt the remaining applications on the system. In a later section, we present a case study of software installation and removal that demonstrates the significance of this problem.

Furthermore, current auditing techniques can only detect changes where existing changes are recorded by the affected applications and system logs, or those which are manually reported by administrators. For example, if an administrator makes multiple configuration changes in an attempt to troubleshoot a problem and forgets to roll back one of them, this change will not be recorded and may cause a future problem on the server.

In addition to auditing changes as part of the change management process, there is a growing need to audit all interactions on systems. Several regulatory bodies require maintaining strict audit logs of changes made to systems [27, 29]. Also, new consumer protection laws enforced in Japan and being drafted in other countries, now make businesses liable for Personally Identifiable Information (PII) that is leaked or stolen from their companies [13, 14]. In addition, the growing trend to outsource business functions that relate to Intellectual Property (IP), such as the source code used to develop software, provides a need for auditing IP and PII for when it is accessed, modified, or copied.

To improve the change management process and facilitate regulatory auditing, LiveOps provides reports that describe critical changes; document unauthorized applications and change impact analysis; and list sensitive data copied to network and removable devices. The remainder of this section describes these reports in more detail and summarizes our results from generating them in MSN datacenters.

Report: Critical Changes

LiveOps defines critical changes as:

Unexpected program execution
Modifications made to PS, used by the operating system and line of business (LOB) applications

The remainder of this section defines the alerts used to generate critical change reports, and describes the distribution of alerts detected for a one-month sample of LiveOps data.

Datacenter reviewers of the report can manually associate the processes needing approval with work order systems. LiveOps can be used to verify configuration changes, potentially in an automated way, by correlating planned changes with the process name, user name, and time of the changes made. LiveOps marks alerts for processes requiring approval that run during `Lockdown' periods by integrating with the lockdown schedule maintained externally to LiveOps by operations managers.

Changes to PS used by the OS and LOB applications must be authorized and controlled to avoid reliability and availability problems. Changes made to management applications on these servers are also important because they control the insight administrators have into the performance, reliability, and change management behavior of the server. Incorrect changes to management applications may reduce or eliminate visibility, and may expose the system to attackers.

While it is important to understand the critical changes affecting systems, not all writes to PS are interesting to administrators. To identify the critical changes from all writes to PS the following nine prioritized classifications are used to label each entry:

Problem: Indicates a known problem - results from the existence or removal of this PS.
Install: PS added as part of an installation or upgrade.
Setting: Changes made to configuration PS.
Content: Web pages, images, and user data.
Management Change: Installation, patching, or configuration changes made to the management applications running on the system.
Unauthorized: Installation of prohibited applications, or configuration changes to prohibited values.
User Activity: PS modified as a result of users logging in or running window applications.
Noise: Temporary or cached PS.
Unknown: Unclassified PS.

The classification process involves associating each match of a substring contained in a classification rule to the PS name contained in each change event. Matches to classification substrings with higher priority take precedence over those with lower priority. For example, a PS name matching both `User Activity' and `Install' classification substrings will be labeled as `Install.' From examining 28 days of change activity from 34 systems, we identified 1 to 20 substring rules for each classification. Administrators can update the classification rules as needed, however, Table 1 shows that the initial rules cover all except 0.2% (39 k/16.7 M) of the PS observed.

	State Grouping		Process Grouping of State Changes
State Classification	All	Distinct	Daily Instances	Daily Distinct	Monthly Distinct	Average Per Machine Daily Distinct
Problem	0	0	0	0	0	0
Install	104,149	16,947	810	155	69	4
Configuration	176,300	3,340	399	86	22	3
Content	16,261,721	1,593,100	9,513	50	1	3
Management Change	57,145	864	234	79	11	2
Unauthorized	4,206	634	14	9	3	2
User Activity	1,221	189	96	33	4	3
Noise	104,109	1,727	909	66	14	2
Unknown	39,715	2,613	534	60	13	3
TOTAL	16,748,566	1,619,414	12,509	538	137	22

Table 1: Critical changes for one month of production server logs across 34 machines showing individual changes and changes grouped by process.

Table 1 shows a summary of the changes observed across one month of traces from 34 production servers. It shows that 16.7 M individual changes were made to 1.6 M distinct PS entries, meaning that on average each PS was changed 10 times. Considering an overall average of 300 k PS entries per machine, this means that only 16% (1.6 M distinct state grouped changes / (300 k PS * 34 machines)) of the PS was modified during this period.

Although 1.6 M changes are too many for an administrator to review we can significantly reduce the items to review if we assume that each process instance is updating a set of related PS. For example, an installation program will add thousands of Registry entries and hundreds of files during the installation of a single application.

Table 1 shows that there is an O(10² reduction in changes if daily change process instances are considered instead of distinct state. This can be further reduced to O(10³ by considering only changes made by daily distinct processes, however, comparing monthly distinct with daily distinct we reduce the number of changes to consider by only half. For example, in MSN it is common for the same patch installer, KB-123.exe, to be run on all 34 systems, which can be presented as one distinct kb-123.exe entry, providing the administrator with the ability to `drill in' to see the affected machines and individual PS changes. If we consider the average daily distinct changes for each machine, rather than the overall distinct changes, we find that there are only O(10¹

The LiveOps lockdown reports show configuration changes such as installations, patching, changes to LOB, OS, or management applications during periods when changes are prohibited on systems. Lockdown reports from five properties were analyzed for nine lockdown ranges over a seven month period. Figure 5 shows that all properties had lockdown violations during at least one period; two properties had violations in eight of nine lockdown periods. Violations ranged from minor issues such as running diagnostics on systems, to more severe changes like installing service packs and new applications. We expect identifying and attributing lockdown violations to specific administrators will be a strong deterrent.

Figure 5: A marker for each property with at least one violation during a Lockdown period.

Report: Unauthorized Applications

Only approved processes should be running on datacenter systems. Similar to systems such as TripWire [16], LiveOps uses a predetermined list of approved and unapproved applications to identify unauthorized processes on servers. The list contains the following details for each of the approximately 200 distinct process names observed on production servers, and 1300 distinct process names observed across all systems including desktop and home machines:

Approved - Marks the process as approved (or not) for running on production systems, or requires a work order before running it.
Type - Indicates if this process is a command line tool, daemon, or interactive window application. Reports will use this classification to determine if daemons are not running under local system credentials, if window applications are being remotely launched, or if local logins are made when administrative activities could be done remotely.

Category - Indicates the intended use of the process
- LOB: Created specifically for a business need, such as the infrastructure for a web site.
- OS: Processes that run as part of the OS as required components, such as lsass.exe.
- Desktop: A corporate desktop application like an email client. Applications that are used in server environments but designed for home or desktop use may not meet the stringent security requirements of a server environment.
- Dev: A developer tool such as a compiler.
- Home: For home entertainment, such as a game.
- Mgmt: Management products and tools used to manage systems like agents or event log tools that ship with the OS. Instances of diagnostic tool processes can indicate server problems may exist, and instances of configuration tools indicate that prior authorization should be granted.
Function - Describes what this process can potentially do on the system. This classification is used by reports that highlight users running processes that could make changes to the system's PS.
- Setup: Can be used to install or remove applications.
- Patch: Updates existing applications.
- Config: Can be used to modify the configuration of the system.
- Diag: Used for retrieving diagnostic information from the system. Examples are ping and traceroute.
- Viewer: Displays read-only data, from files or other sources. Tools like `winver.exe' fall into this category.
- WriteData: Indicates the process can create data such as log files like the LiveOps agent, or can modify existing data like notepad.exe.
- ``Fun'': have no business value, and used only for entertainment. Examples are games and DVD playing applications.
- System: Required for operation of the system. Some examples are lsass.exe and csrss.exe.
- Runtime: The process hosts other applications. Examples are scripting environments such as Perl, or surrogates like dllhost.exe, svchost.exe and java.exe.
- Malware: Malicious or unwanted software that should never be run.
Product - Name of the product.
Manufacturer - Details about the Manufacturer.
Description - Description of the product describing its key functionality.

Processes seen for the first time are by default not approved, and therefore show up in the report for classification by administrators. Additionally, processes that are known to be used for diagnostics or implementing configuration changes to the system are alerted on, and marked as needing approval.

Analyzing a sample report from 126 production servers over a one month period we found 37 systems (29%) ran an overall total of 18 distinct unauthorized processes. The breakdown of these is: three were associated with automatic updates to a runtime environment, seven were desktop applications and data manipulation tools, and eight could not be identified by administrators or security experts in the organization. 76 systems (60%) ran 17 distinct processes that required approval because they can be used for diagnosis or configuration changes.

Report: Change Impact Analysis

Each time a new binary is used on a system this alert shows the application that installed the binary file, and the user context that installed it. LiveOps catches binary installations regardless of whether the installing application participates with RPM, MSI, or PackageAdd. Figure 6 shows the daily alerts generated by iexplorer.exe (web browser) which was used twice to download and install applications. The first time it downloaded and installed msnsearchtoolbarsetup_en-us. exe (MSN Search Toolbar) and the second time it downloaded and installed winamp52_full_emusic-7plus. exe (Winamp Media Player). It shows that Winamp created several binaries on the system, including emusic-7plus.exe (EMusic), which later installed even more binaries.

Figure 6: LiveOps detecting a web browser (iexplorer.exe) downloading and installing two new applications.

This report is useful to administrators that may not have expected EMusic to be installed, and might at a later time wonder where, when, and how the EMusic binaries got on their machine. Figure 7 presents a `drill in' of the detailed process tree at the time of the installations. It clearly shows that the web browser launched the Winamp installation, which in turn launched the EMusic installation.

Figure 7: Drill in of the process tree at the time of the installation.

Identifying the impact that changing a PS entry has on a system's running applications is useful for test verification planning, and for troubleshooting. Knowing the specific PS affected by a change and the affected applications enables administrators to prioritize testing the specific features of the applications controlled by the updated PS. Administrators and support professionals can also significantly reduce the scope of potential root cause PS by knowing which recently modified PS is used by the broken application.

LiveOps identifies the dependencies of each process running on a system by tracking the PS read and modified by each observed process. This list of dependencies, called a manifest, can then be intersected with each change on the system to identify the impacted processes [8, 30]. If the contents of patches or applications are known before they are installed, their impact on existing applications can be predicted by intersecting their contents with the manifests for each observed process. The Application Compatibility Toolkit version 5.0 [19] contains a feature called the Update Impact Analyzer, which uses portions of the LiveOps technology to provide this functionality, enabling administrators to prioritize testing and migration of applications and patches.

Table 2 presents a summary of daily changes affecting the five categories of processes observed across 34 production servers over a one month period. Surprisingly we see that each day there are several changes made that impact LOB and OS processes, and that each machine on average only receives about half (90 of 163) of the distinct global installations impacting OS applications. This means that the 34 machines are not all receiving the same set of installation changes, whereas the two install changes affecting the desktop category of processes is uniformly applied to all systems. Overall, we see a trend of uniformly applied PS changes to desktop, developer, and home applications. However, less than half of the LOB, management application configuration, and OS configuration changes are consistently applied to all machines. Although we expect systems from all five MSN sites to run the same management processes, and configure them the same, these results show that there are in fact differences.

		Categories of Distinct Impacted Processes
Average Daily PS Change Type		Desktop	Developer	LOB	Management	OS
Across all Machines:	Install	2	182	56	168	25
Per Machine:	Install	2	124	23	95	5
Across all Machines:	Configuration	0	2	4	27	9
Per Machine:	Configuration	0	2	2	10	1
Across all Machines:	Content	0	8	37,063	46	9
Per Machine:	Content	0	8	18,112	22	6
Across all Machines:	Management	0	0	1	59	5
Per Machine:	Management	0	0	1	59	6

Table 2: Impact analysis showing the average daily global and per machine changes impacting each of the six categories of processes observed running across the 34 production servers during a one month period.

Report: Sensitive Data

Protecting a company's intellectual property (IP) is of paramount importance because IP is often a critical asset responsible for its competitive advantages. Examples are source code, strategic business plans, and product specifications. Similarly, many customers provide personally identifiable information (PII) to corporations as a part of transacting business. Examples include credit card and social security numbers, contact information, and medical records. Personal Information Protection Laws enacted in 2003 in Japan [13, 14] make companies liable for up to 300,000 yen or six months jail time, for each person affected by PII leaks, which, coupled with an increase in malware and hacking [26] means that corporations need to ensure data is protected.

Further demonstrating the significance of this problem, the CSI FBI report [11] asserts that 59% of corporate security abuse is caused by employees [15]. All anonymous security experts we surveyed on this point thought the number may in fact be closer to 80-90%. Protecting IP and PII from internal employees is challenging because many of them will have legitimate business needs to access the data, and administrators will have implicit access to the data by virtue of managing and backing up information on the server. This means network isolation, firewalls, and access control lists (ACLs) will not be effective at solving this problem.

LiveOps addresses this issue by providing an audit report on all files copied to remote network shares and removable devices, for each machine running its agent. The files being copied are then categorized according to their content using a set of configurable classification filters. An alert is then generated for each process instance that contains the user context of the process, the process name, and a list of the sensitive files being made remote.

From examining 383 machine days of logs from 35 datacenter servers and corporate desktops we found: 36 instances of internal documents, six instances of source code, and three instances of applications being copied to network and removable devices. The reports also identified 100's of Corporate IT tools copying logs off of these machines.

Case Study: Tracking Software Ownership

Intuitively, we may think that managing the PS of a system should be easy because executable files and configurations are infrequently created or are modified by well-known software installers. We analyze below our traces to find that installations actually occur quite frequently. Additionally, while most software installers provide manifests, listing the PS entries owned by the installed application to facilitate its clean un-installation, these manifests are often incomplete or incorrect [12]. In the very next section we analyze PS entries across 70 machines and find that many entries cannot be accounted for via the machines' static manifests. To better understand the origins of these orphaned PS entries, we describe point experiments with installing and uninstalling three popular applications.

How Often Is Software Installed?

To identify a software installation we examined our PS traces to identify the creation or modification of files that were later loaded by a process as an executable file. We found that on average 20% of all machine days had at least one installation. However this varied significantly across each environment. 15% of Home and Lab machine days and 30% of desktop machine days had at least one install. Server environments had a wide variance in the frequency of software installations, ranging from 7%-80% of machine days having at least one install. This reflects the variation in change management policy for each Internet service.

While we might have thought that centralized administration of corporate desktops, or Windows auto-update service might cause synchronized updates, this does not appear to be the case. Overall, we observed that most software installations, even in the server environment, occur in an unpredictable manner.

Table 3 describes the distribution of observed installations across install types. We see that processes which update themselves (Self Update), predominantly anti-virus applications, account for a significant portion of installs. Also, enterprise software distribution applications and Windows auto-update account for the majority of software installs in most environments. As we expect, servers have a large portion of scripted installs from administrators manually rolling out upgrades and in-house applications. Installations caused by users running install programs are infrequent. It is interesting to see that our analysis was able to distinguish binary files created and used on developer machines as `installed' by the developer tools by tracking the processes that created each binary.

	User Setup	Script	Auto Update	Update	Developer
Home	7%	7%	51%	35%	0%
Desktop	5%	8%	58%	22%	7%
Lab	2%	5%	61%	32%	0%
Server	1%	17%	38%	44%	0%

Table 3: Distribution of installations by program type for all observed installations.

Static Software Ownership Manifests

Unfortunately, a statically declared manifest is not always complete. Today's manifests are not always correctly specified, nor do they account for PS created post-installation, such as user preference settings, log files, etc. This means that during software upgrades or removal, entries can become orphaned on the machine. Furthermore, installation or removal of software can fail, or be interrupted which often leaves registry entries and files in an inconsistent PS. Over time the orphaned files and registry entries accumulate on a machine causing a buildup of unused entries that can lead to system problems.[Note 1] Because of this phenomenon, common advice is to reinstall your OS and all applications occasionally to return the system to a known state.

To quantify the significance of this problem, we wrote a tool to extract PS ownership manifests from within the Windows OS. The tool identifies the components installed on the system by analyzing the OS installation configuration files, enumerating the list of programs that have registered with the Windows `Add/Remove programs' component, the Windows Installer database (WI), the OS configuration for launching applications when a file of a given extension is run, and manifests of patch contents as described in the Microsoft Security Baseline Analyzer tool [20]. We then enumerate all files and registry entries on a machine and compare them with these manifests to identify unaccounted for entries. We associate entries that are descendents of entries that are referenced in the manifests as implicitly associated with the manifest as well. Finally, we filter the remaining list to exclude well known user data files by their extension, and remove entries from well known temporary folders. We define the remaining subset as leaked entries on the system. Table 4 contains the results of running this tool across eight desktops, 20 Server, and 42 lab machines. It shows that 31-70% of files and 38-53% of registry entries could not be accounted for.

		Manifest	Implicit	Data	Temp	Unknown
File	Desktop	18.5%	21.0%	20.6%	8.3%	31.6%
	Server	4.7%	52.6%	2.4%	3.4%	36.9%
	Lab	13.2%	5.8%	9.5%	1.4%	70.1%
Registry	Desktop	28.2%	32.2%	N/A	N/A	39.6%
	Server	10.5%	36.4%	N/A	N/A	53.1%
	Lab	30.3%	31.7%	N/A	N/A	38.0%

Table 4: Average file and registry entries that are specified in manifests, implicitly in manifests, user data, or temp entries. We do not have heuristics to recognize data and temporary registry entries.

To further understand the prevalence of software leaks, we measured the leakage of three popular commercial applications. By running our data collector while installing the application, using it for a short period, and then uninstalling the application; we were able to measure the net increase of file and registry entries on the machine.

The first application was the game `Doom3,' which left nine files and 418 registry entries. The second was the common corporate desktop application suite Microsoft Office, which left no files, but 1490 registry entries. Additionally, it left 129 registry entries for each user that logged into the system and used the program while it was installed. The third example was the enterprise database application Microsoft SQL Server Yukon edition, which leaked 57 files and six registry entries.

Scenario: Managing Abnormal System Activity

Administrators are often required to reactively manage systems that do not operate as expected because of:

Hardware problems
PS problems such as incorrect configuration or mismatched binary versions
Programming logic related issues such as memory corruptions and crashes.

The process for debugging hardware and programming logic related problems is relatively straightforward because there are many tools to aid in the analysis and the correct behavior is known.

For example, solutions often exist for hardware diagnostics such as correctness tests performed in software [41], and often devices like hard-drives are capable of reporting when they are about to fail [1, 17]. Similarly many tools exist for software debugging such as code analysis tools that are run during the software development process [22], specialized software debuggers [10, 18] and crash analysis software programs for diagnosing application problems when they happen [3, 39].

Configuration related problems on the other hand, are very difficult to debug because tools do not typically exist to analyze the 200,000 settings [36], and 100,000 files [7, 32] that exist on a typical system. Furthermore, the knowledge of which files and settings the application under investigation depends on is typically manually learned by operators as they gain experience with applications. The task of debugging application configuration is further complicated by the complexity caused by frequent updates to configuration, and the customization of system environmental variables and settings.

Several strategies have been proposed for minimizing the potential for configuration problems at the expense of either application functionality or availability. One approach is to statically link libraries with executables to reduce the overall number of executables on the system and thereby minimize the potential for mismatched binary versions. However, doing this makes it more difficult to patch libraries when security vulnerabilities are discovered because all applications using the library need to be rebuilt and reinstalled.

To improve the troubleshooting and forensic investigation of problems, LiveOps provides reports that describe stale PS which identifies applications that are using stale versions of PS compared to the version that exists on disk, the ability to generate ad-hoc forensic reports for specific time ranges that can be filtered by processes, users, and files and setting interactions, and instances of known problems. The remainder of this section describes these reports in more detail and summarizes our results from generating them in MSN datacenters.

Report: Stale Processes

When applications read PS into memory there is the potential for the persisted version of the PS to be updated by another process, causing it to use an old `stale' copy of the PS. An example of this is applying a security patch to a critical file such as the tcpip.sys network driver in Windows, which updates the file on disk. If the system is not rebooted after the patch has been applied, the old copy of the file is used in memory and therefore the system is still vulnerable to the security exploit. Similarly if an application reads its configuration settings at startup and does not monitor for changes that happen while it is running, any new changes made while the application is running will not take effect. During installations and upgrades it may be expected that some PS will becomes stale while upgrades are made, however, the length of time applications are stale should be small. We define any process that has not re-read updated PS within five minutes as a stale process.

LiveOps has the unique ability to detect when binary, file, or registry PS read by a process has become stale due to external changes. A report on stale processes is created by correlating all writes to PS with the PS loaded by currently running processes. The LiveOps report contains an entry for each stale process, and enables the administrator to `drill in' to each entry to examine the individual stale PS, showing the time it was modified and the process and user context that made the change.

Figure 8 presents a section of a LiveOps stale report that shows a specific user running Windows Update (update.exe) at 2/28/2006 on three production servers. It shows the exact time that the tcpip.sys driver was modified on each of the machines, and the time when the new version of the driver was reloaded into memory. We can see that the systems were updated between 4:35 am and 5:05 am, but only the first one reloaded the driver six hours later at 10:49 am. This means that the other two machines were still exposed to the security vulnerability caused by the old tcpip.sys binary more than 24 hours later when this report was viewed. Without this LiveOps report the server could have been vulnerable for several days because reboots of servers happen infrequently.

Figure 8: LiveOps detected that the updated tcpip.sys binary was not reloaded on the bottom two machines, therefore they are still vulnerable.

We examined several examples of stale processes found from a sample set of 34 machines over a one month period. Several stale OS, Management, and LOB processes were stale for more than 20 hours from software binary updates and management configuration changes. In cases where administrators are troubleshooting problems with these applications, knowing that they are stale could significantly reduce troubleshooting time, because examination of the newly changed state could mislead administrators to believe that the new values were being used.

The observed stale state ranged from:

Cached domain server information which could affect the performance of authentication operations
Management configuration which could affect the availability of monitoring information or the ability of administrators to connect and diagnose the system
Updated environment variables that could affect LOB applications that required specific path settings to locate needed binaries and data files
Web site content and configuration which may be expected to have an immediate affect once modified on the system, or could break other websites due to partially available new content

Query Interface: Forensics and Troubleshooting

The Forensic query interface is useful when administrators want to know who, when or how changes are being made [23, 37, 38]. For example, if malware is detected on the system by anti-malware tools, the forensics interface can be used to identify when it arrived and which user context and process was used to create the malicious files to identify the root cause of the issue and prevent future occurrences.

LiveOps has successfully identified performance problems in several applications that were unnecessarily reading registry entries hundreds of times per second. One instance was found in an LOB application running on a web server where the registry entry \HKLM\SOFTWARE\Microsoft\Cryptography\Defaults\Provider\ Microsoft Strong Cryptographic Provider, and all of its values were read 240 times per second. Another case was found in a commercial management product agent that was deployed on a production server, where it was continuously reading all registry entries and values under the \HKLM\System\CurrentControlSet \Services\ key to detect changes in the parameters of daemons installed on the system. The developers of both products were notified and told that they could make their applications vastly more efficient by using the RegNotifyChangeKeyValue function to register for registry changes rather than polling for them.

LiveOps can also be used in an ad-hoc manner when troubleshooting to identify if access violations reading files or registry entries are causing applications to fail, or even for investigating intrusions from hackers where the investigator wants to know which processes were used during the attack, and potentially which sensitive data files were compromised.

Report: Detecting Known Problems

The root cause of configuration problems identified by administrators, support professionals or technically adept end-users, can be used to define assertions about the correctness of configuration on any system. The LiveOps known problem report is generated by applying codified configuration assertions, called rules, to the registry entry names and values collected from each managed system. The report contains a row for each machine and configuration entry that matches a rule. Administrators can review the rule description that indicates why the entry is suspicious, and view the data contained in the problematic entry.

While several registry scanning products exist, LiveOps is different in that it does not need to run on each managed system to scan the entries because it can instead scan the logs that are centrally uploaded. LiveOps results are also more relevant because the logs contain the registry entries and values that are currently being used by the applications so there are less false positives from entries that are not actually used. Also, user perceived application problems can be correlated with PS usage because the logs contain the times an application used the problematic registry entry. The logs will also contain the exact time, process, and user context that changed the registry entry to the problematic value.

The LiveOps known problem rule set contains approximately 100 assertions created from reviewing the Strider troubleshooter cases [36], PC fragility cases [9], and from critical registry entries identified by MSN administrators such as the setting that controls the operating system paging file. This system can be further enhanced to take advantage of PeerPressure [35] comparisons to identify problematic configurations based on the configuration of similar machines.

Reviewing the known problem report for a sample one month period from 350 desktop, home, and server machines revealed several issues. Most interesting were the 35 servers that had their page file setting configured to null, which caused the system to crash when physical memory was exhausted. Reviewing the problem incidents from the time when these changes were made revealed that several of these servers had in fact crashed from this incident. The reports showed that it was the svchost.exe process configured to support remote registry calls that was used to make the problematic changes, probably from an incorrectly written management script.

Another interesting problem was found on a laptop where the rule for the GSM audio codec configuration detected a missing value. This codec is part of the default Windows installation for playing .wav files. Attempting to play a .wav file on the machine confirmed that it was an actual problem. Resetting the value resolved the issue.

Several of the remaining detected problems related to rules regarding user preference changes that end users may unintentionally make and not know how to undo. For these entries, LiveOps marks the alerts as warnings rather than errors, and if a user from a managed system complains about an application problem these entries helps quickly resolve related issues.

Scenario: Enforcing Best Practices

Datacenter managers typically define best practices for managing systems to reduce the potential for security vulnerabilities, and to minimize the potential for introducing reliability and availability issues. However, it is a never-ending challenge to audit whether these practices are being followed. As an organization grows the number of its datacenters around the world, it becomes more difficult to educate the administrators on the best practices. This also means that an increased number of administrators will be interacting with the systems, thereby increasing the potential for best practice violations. With more people interacting with systems it is hard to identify which administrators may have inadvertently introduced a problem, and therefore makes it difficult to re-educate them on the best practices to avoid problem reoccurrences.

LiveOps reports have been created to analyze best practices for minimizing exposure to potential security vulnerabilities, and for minimizing potential system instabilities. The initial best practices reports are: Login report which details the users that have been logging in to each server; and Daemons running with local or domain credentials.

Report: Logins

Avoiding unnecessary logins to production servers reduces the potential for hackers to obtain credentials that can be used to connect to other systems in a network. Logging in to a server causes a primary user context to be created, which can be used to connect to remote systems one hop away. If, instead tools are run remotely on the system, a secondary security context that is only valid on the remote machine is used. Furthermore, a best practice for avoiding the potential for problems is to minimize the amount of changes and interactions on managed systems.

A sample of LiveOps login reports for 34 production systems over a one month period shows several examples of logins. Most notably, nine systems were found to be running screen savers 28 times, by detecting instances of the scrnsave.src process. This implies that primary credentials were left on the systems for extended periods of time. It also showed that each system was logged into at least twice, with one system having 11 logins.

Report: Non-System Daemons

Windows systems joined to an active directory have machine accounts associated with them, which can be used as the user context for running daemons. Machine accounts have the same properties as user accounts and can be added to access control lists (ACLs) if needed. It is best practice to run all daemons using the machine account [6] because this will typically not have permissions on remote machines, and will not require storing user credentials (login names and passwords) locally to run the daemons. Locally stored credentials can potentially be obtained by hackers that infiltrate the system and be used to connect to other systems.

The LiveOps non-system daemon report for 34 machines over a one month period identified several daemons that were not running with the machine account security context. Six distinct processes were identified across all systems, five were management product agents, and one was an LOB process. Each of the 34 systems had violations for at least one daemon running, and a few machines had up to 4. The majority of these daemons were management agents.

Feasibility of Labeling All Changes

The ability to understand all changes made across large numbers of systems is possible only if the rate of new PS is small enough for humans to reasonably comprehend. To evaluate this we determined the first time LiveOps observed process names, PS entries across all managed systems. Using this information we determined:

The steady state daily rate of new items observed over an extended period
The learning period, which is the number of days taken to observe the majority of items

The learning period determines how long LiveOps must be run before it will reliably generate reports without false positives caused by observing PS for the first time. The steady-state determines how many new PS items must be classified on an ongoing basis for LiveOps to accurately label and understand all observable PS. If labeling is not periodically maintained LiveOps will still accurately and correctly generate all reports, however over time there may be some unknown entries for some reports.

To help prevent this LiveOps provides contextual information that makes it easy to label new PS, and can integrate with other sources descriptive PS information as described earlier. This includes contextual information about the processes using the PS, integration with existing software libraries [24] and work order systems, as well as correlation with previously labeled LiveOps PS. Additionally, automatic inheritance based labeling strategies can be employed such as labeling binaries used by only one process with the process's label.

Figure 9 shows the distribution of newly seen processes from all monitored systems over 39 days since the start of LiveOps data collection. We can see that the learning period is one day, at which time 1000 distinct processes were observed. The steady-state for new processes is typically 0, unless software updates or installations occur. During software updates many processes are created from files created with random temporary file names that do the installation work. There are two solutions to automatically labeling these processes:

Using the checksum and size of the process .exe will effectively canonicalize the names of the temporary files into distinct entries across all machines
Substring rules can be created to eliminate process names created within known temporary folders.

Figure 9: Growth of new processes since the start of LiveOps data collection.

Figure 10 presents the distribution of newly seen binaries, that is any .dll, .exe, or other executable file that is loaded by any of the monitored machines. Distinct entries are determined by using the checksum stored in the executable header and the size of the binary file. These are the same properties used by kernel debuggers to identify the correct version of symbols to load when debugging a process. We can see that the learning period lasts for one day, when 1400 binaries are first observed, then reaches steady-state where new binaries are only observed if new installations or patches are applied.

Figure 10: Growth of new binaries since the start of LiveOps data collection.

The growth of new files and registry entries files since the start of LiveOps data collection is shown in Figure 11, and Figure 12 respectively. For files and registry entries, each day there are a large number of newly generated temporary and cached files with random names. These entries are filtered out in daily counts using simple substring rules that look for known temporary paths in the names. We can see from both graphs that the daily newly observed entries are

Figure 11: Growth of new files since the start of LiveOps data collection, filtering temp entries.

Figure 12: Growth of new registry entries since the start of LiveOps data collection, filtering temp entries.

O(10¹ across all systems, unless new installations or configurations are made, such as those that happened on the 14th and 23rd day. This is a reasonable amount for humans to manually evaluate.

Conclusion

We built LiveOps, a scalable systems management service and comprehensive low-overhead agents, that identify security vulnerabilities, perform compliance auditing, enable forensic investigations, detect patching problems, optimize troubleshooting, and detect malware/intrusions. The service provides a platform for knowledge sharing across organizations and administrative boundaries and allows for seamless integration between analysis results from disparate management products that build on it.

Our analysis of reports from deploying LiveOps in MSN demonstrates how it fulfills a critical need in the configuration management cycle by detecting all system changes, enabling verification of approved requests, and alerting upon unknown modifications. Closing this loop is becoming increasingly important for all datacenters for identifying unwanted changes, and to fulfill auditing requirements. These results illustrate the benefits and feasibility of managing configuration from the OS perspective of interactions between running processes and PS.

LiveOps empowers administrators with new visibility into the relationships between applications and the PS they use. This enables them to more knowledgably, efficiently, and accurately manage their systems and the applications that run on them.

Author Biographies

Chad Verbowski is an Architect in Microsoft Research. His early academic research on network management translated into a job designing network and systems management infrastructure for MFS Datanet. After surviving the WorldCOM takeover Chad worked at Cisco before joining a management focused software startup company as employee number 5. He eventually arrived at Microsoft where he worked on headless support in Windows 2000, then ran the core development team for the first release of Microsoft Operations Manager before finding his niche at Microsoft Research. At MSR Chad cofounded the Cybersecurity and Systems Management research group, where he focuses on his area of interest: reducing complexity in software.

Juhan Lee is an MSN Architect focused on improving reliability, scalability, and improving operation of MSN's Internet Service Platforms by innovating next generation MS products and research technology into MSN data centers. Juhan joined Microsoft in February 1996 to lead Windows 2000 Datacenter development, Systems core components of Management Server (SMS), and Microsoft Operations Manager 2000. Prior to Microsoft Juhan worked on IBM's CICS/DB2 and OS/2 product lines in Research Triangle Park, and directed distributed middleware and Lotus workflow engine R&D at UNUM Corporation that was ultimately licensed by large software corporations. Juhan enjoys all things electronics and works on home electronics on his free time. He studied Electrical Engineering at North Carolina State University, interned at Nortel and IBM as both Power Engineer and Software Developer where he chose software career by joining IBM in 1988.

Xiaogang Liu is a Software Development Engineer who joined Microsoft in 2005 and has been working on automatic diagnosis and discovery of system anomalies since then. Before that, he was the dev lead of an automatic computer test and configuration project, which has been deployed to two of the top three PC vendors in China and increased the output vastly. His first project in software industry was a Chinese-English dictionary on the Windows platform, where he's responsible for word/phrase lookup under mouse pointer initially and later the main UI. He likes reading books and playing badminton in his spare time.

Yi-Min Wang is a Principal Researcher at Microsoft Research, Redmond, where he manages the Cybersecurity and Systems Management Group and leads the Strider project. Yi-Min received his B.S. degree from National Taiwan University in 1986. He received his Ph.D. in Electrical and Computer Engineering from University of Illinois at Urbana-Champaign in 1993, worked at AT&T Bell Labs from 1993 to 1997, and joined Microsoft in 1998. His research interests include security, systems management, dependability, home networking, and distributed systems.

Roussi Roussev is finishing his Ph.D. as a student at Florida Institute of Technology. His research interests include distributed systems, security, dependability and program verification.

Bibliography

[1] Allen, B., ``Monitoring Hard Disks with SMART,'' LINUX Journal, 2004, https://www.linuxjournal.com/article/6983.
[2] Arbaugh, W., et al., ``Windows of Vulnerability: A Case Study Analysis,'' IEEE Computer, Vol. 33, Num. 12.
[3] Brodie, M., et al., ``Quickly Finding Known Software Problems via Automated Symptom Matching,'' ICAC, Seattle, WA, 2005.
[4] Brown, A., et al., ``Accepting Failure: A Case for Recovery-Oriented Computing (ROC),'' HPTPS, Asilomar, CA, 2001.
[5] Burgess, M., et al., ``Cfengine: a site configuration engine,'' USENIX Computing Systems, Vol. 8, Num. 3, 1995.
[6] Chen, S., et al., ``A Black-Box Tracing Technique to Identify Causes of Least-Privilege Incompatibilities,'' NDSS, San Diego, CA, 2005.
[7] Doceur, J., et al., ``A Large-Scale Study of File-System Contents,'' SIGMETRICS, Atlanta, GA, 1999.
[8] Dunagan, J., et al., ``Towards a Self-Managing Software Patching Process Using Black-box Persistent-state Manifests,'' ICAC, New York, NY, 2004.
[9] Ganapathi, A., et al., ``Why PCs Are Fragile and What We Can Do About It: A Study of Windows Registry Problems,'' ICAC, Florence, Italy, 2004.
[10] The GNU Project Debugger, https://www.gnu.org/software/gdb/gdb.html.
[11] Gordon, L., et al., CSI/FBI Computer Crime and Security Survey, 2004, go.sci.com .
[12] Hart, J. and J. D'Amelia, ``An Analysis of RPM Validation Drift,'' Proceedings of the 16th USENIX Conference on Systems Admininistration, Berkeley, CA, 2002.
[13] Japan's Personal Information Protection Act, 2003 Law Num. 57, Japan, 2003.
[14] Japan's Personal Information Protection Act, https://www.privacyinternational.org/survey/phr2003/countries/japan.htm.
[15] Jaques, R., Internal Hackers Pose The Greatest Threat, 23 Jun 2005, https://vunet.com.
[16] Kim, G. H., ``The Design and Implementation of Tripwire: A File System Integrity Checker,'' ACM Conference on Computer and Communications Security, Fairfax, VA, 1994.
[17] McLean, P., Information Technology - AT Attachment-3 Interface (ATA-3), X3T13 2008D Revision 7b, 1997.
[18] Microsoft Debugging Tools, https://www.microsoft.com/whdc/devtools/debugging.
[19] Microsoft Application Compatibility, Toolkit, https://www.microsoft.com/technet/prodtechnol/windows/appcompatibility.
[20] Microsoft Baseline Security Analyzer, https://www.microsoft.com/technet/security/tools/mbsahome.mspx.
[21] Microsoft Management Products, https://www.microsoft.com/management.
[22] Microsoft Program Analysis Projects, https://www.microsoft.com/windows/cse/pa.
[23] Moshchuk, A., et al., ``Crawler-based Study of Spyware in the Web,'' NDSS, San Diego, CA, 2006.
[24] National Software Reference Library, https://www.nsrl.nist.gov .
[25] Oppenheimer, D., et al., ``Why do Internet services fail, and what can be done about it?'' USITS, Seattle, WA, 2003.
[26] Oswald, E., ``Study: Adware Increasing Exponentially,'' BetaNews, September 11, 2006.
[27] Payment Card Industry (PCI) Data Security Standard, v1.1, Sep., 2006, https://www.pcisecurity standards.org/pdfs/pci_dss_v1-1.pdf .
[28] Rescorla, E., ``Security Holes . . . Who Cares?'' USENIX Security, Washington, DC, 2003.
[29] Sarbanes-Oxley Act of 2002, https://fl1.findlaw.com/news.findlaw.com/hdocs/docs/gwbush/sarbanes oxley072302.pdf .
[30] Sun, Y., et al, ``Global analysis of dynamic library dependencies,'' LISA, San Diego, CA, 2001.
[31] Syslog, https://en.wikipedia.org/wiki/Syslog.
[32] Verbowski, C., et al., ``Analyzing Persistent State Interactions to Improve State Management,'' SIGMETRICS, Saint Malo, France, 2006.
[33] Verbowski, C., et al., ``Flight Data Recorder: Monitoring Persistent-State Interactions to Improve Systems Management,'' OSDI, Seattle, WA, 2006.
[34] Wang, H., et al., ``Shield: Vulnerability-Driven Network Filters for Preventing Known Vulnerability Exploits,'' ACM SIGCOMM, Portland, OR, 2004.
[35] Wang, H., et al., ``Automatic Misconfiguration Troubleshooting with PeerPressure,'' OSDI, San Francisco, CA, 2004.
[36] Wang, Y.-M., et al., ``STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support,'' LISA, San Diego, CA, 2003.
[37] Wang, Y.-M., et al., ``Gatekeeper: Monitoring Auto-Start Extensibility Points (ASEPs) for Spyware Management,'' LISA, Atlanta, GA, 2004.
[38] Wang, Y.-M., et al., ``Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites that Exploit Browser Vulnerabilities,'' NDSS, San Diego, CA, 2006.
[39] Windows Error Reporting, https://www.microsoft.com/whdcmaintain/WERHelp.mspx.
[40] Windows Event Log, https://windowssdk.msdn.microsoft.com/en-us/library/ms732118.aspx.
[41] Windows Memory Diagnostic, https://oca.microsoft.com/en/windiag.asp.

Footnotes:
Note 1: Examples can be found at https://support.microsoft.com/via the article IDs: 898582, 816598, 239291, 810932, 181008.

Last changed: 1 Dec. 2006 jel