################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
	      Proceedings of the USENIX 2nd Symposium on
	Operating Systems Design and Implementation (OSDI '96)
	       Seattle, Washington, October 28-31, 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  http://www.usenix.org


Using Latency to Evaluate Interactive System Performance

Yasuhiro Endo, Zheng Wang, J. Bradley Chen, and Margo Seltzer

Harvard University, Division of Engineering and Applied Sciences

{yaz, zhwang, bchen, margo}@eecs.harvard.edu

Abstract

The conventional methodology for system performance 
measurement, which relies primarily on throughput-sen-
sitive benchmarks and throughput metrics, has major 
limitations when analyzing the behavior and perfor-
mance of interactive workloads. The increasingly inter-
active character of personal computing demands new 
ways of measuring and analyzing system performance. 
In this paper, we present a combination of measurement 
techniques and benchmark methodologies that address 
these problems. We introduce several simple methods 
for making direct and precise measurements of event 
handling latency in the context of a realistic interactive 
application. We analyze how results from such measure-
ments can be used to understand the detailed behavior of 
latency-critical events. We demonstrate our techniques 
in an analysis of the performance of two releases of 
Windows NT and Windows 95. Our experience indi-
cates that latency can be measured for a class of interac-
tive workloads, providing a substantial improvement in 
the accuracy and detail of performance information over 
measurements based strictly on throughput.

1	Introduction

Benchmarks are used in computer systems research to 
analyze design alternatives, identify performance prob-
lems, and motivate improvements in system design. 
Equally important, consumers use benchmarks to evalu-
ate and compare computer systems. Current benchmarks 
typically report throughput, bandwidth, or end-to-end 
latency metrics. Though often successful in rating the 
throughput of transaction processing systems and/or the 
performance of a system for scientific computation, 
these benchmarks do not give a direct indication of per-
formance that is relevant for interactive applications 
such as those that dominate modern desktop computing. 
The most important performance criterion for interac-
tive applications is responsiveness, which determines 
the performance perceived by the user.

In this paper, we propose a set of new techniques for 
performance measurement in which latency is measured 
in the context of a workload that is realistic, both in 
terms of the application used and the rate at which user-
initiated events are generated. We present low-overhead 
methods that require minimal modifications to the sys-
tem for measuring latency for a broad class of interac-
tive events. We use a collection of simple benchmark 
examples to characterize our measurement methodol-
ogy. Finally, we demonstrate the utility of our metrics 
by applying them in a comparison of Microsoft Win-
dows NT versions 3.51 and 4.0 and Windows 95, using 
realistic interactive input to off-the-shelf applications.

The remainder of this section provides background on 
the problem of measuring latency, including the motiva-
tion for our new methodology based on an analysis of 
the current practice in performance measurement. Sec-
tion 2 describes our methodology in detail. In Section 3, 
we discuss some of the issues in evaluating response 
time in terms of a user's experience. In Sections 4 and 5, 
we apply our methodology in a comparison of Windows 
NT versions 3.51 and 4.0 and Windows 95. Sections 6 
and 7 discuss the limitations of our work and conclude.

1.1 The Irrelevance of Throughput

Most macrobenchmarks designed for interactive sys-
tems use throughput as the performance metric, measur-
ing the time that the system takes to complete a 
sequence of user requests. A key feature of throughput 
as a performance metric is that it can be measured eas-
ily, given an accurate timer and a computation that will 
do a fixed amount of work. Throughput metrics measure 
system performance for repetitive, synchronous 
sequences of requests. However, the results of these 
benchmarks do not correlate directly with user-per-
ceived performance-a critical metric when evaluating 
interactive system performance. The performance of 
many modern applications depends on the speed at 
which the system can respond to an asynchronous 
stream of independent and diverse events that result 
from interactive user input or network packet arrival; we 
call this event handling latency. Throughput metrics are 
ill-equipped to characterize systems in such ways. More 
specifically, throughput benchmarks fail to provide 
enough information for evaluating interactive system 
performance and make inappropriate assumptions for 
measuring interactive systems.

Information Lost

The results of throughput benchmarks are often reduced 
to a single number that indicates how long a system took 
to complete a sequence of events. Although this can pro-
vide information about the sum of the latencies for a 
sequence of events, it does not provide information 
about the variance in response time, which is an impor-
tant factor in determining perceived interactive perfor-
mance.

The insufficient detail provided by throughput bench-
marks can also mislead designers trying to identify the 
bottlenecks of a system. Since throughput benchmarks 
provide only end-to-end measures of activity, system 
activity generated by low-latency events cannot be dis-
tinguished from that generated by longer-latency events, 
which have a much greater impact on user-perceived 
performance. Worse, if such a benchmark includes suffi-
ciently many short-latency events, these short events can 
contribute significantly to elapsed time, leading design-
ers to optimize parts of the system that have little or no 
impact on user-perceived performance. In an effort to 
compare favorably against other systems in throughput 
benchmarks, designers may even undertake such optimi-
zations knowingly. In this case, bad benchmarking 
methodology hurts both system designers and end users.

In addition, user interfaces tend to use features such as 
blinking cursors and interactive spelling checkers that 
have (or are intended to have) negligible impact on per-
ceived interactive performance, yet may be responsible 
for a significant amount of the computation in the over-
all activity of an application. Throughput measures pro-
vide no way to distinguish between these features and 
events that are less frequent but have a significant 
impact on user-perceived performance.

Inaccurate User Assumptions

Throughput benchmarks often drive the system by feed-
ing user input as rapidly as the system can accept it-
equivalent to modeling an infinitely fast user. Such an 
input stream is unrealistic and susceptible to generating 
misleading results. One of the sources for such errors is 
batching. Client-server systems such as Windows NT 
and the X-Window system batch multiple client requests 
into a single message before sending them to the server. 
This reduces communication overhead and allows the 
server to apply optimizations to the request stream, such 
as removing operations that are overridden by later 
requests. Although batching improves throughput, it can 
have a negative effect on the responsiveness of the sys-
tem.

When a benchmark uses an uninterrupted stream of 
requests, the system batches requests more aggressively 
to improve throughput. Measurement results obtained 
while the system is operating in this mode are meaning-
less; users will never be able to generate such an input 
stream and achieve a similar level of batching in actual 
use. Disabling batching altogether is sometimes possible 
but does not fully address the problem. An ideal test 
input should permit a level of batching that is likely to 
occur in response to real user input.

Overall, throughput measures provide an indirect rather 
than a direct measure of latency, and as such they can 
give a distorted view of interactive performance. An 
ideal benchmarking methodology will drive the system 
in the same way that real users do and give designers a 
correct indication as to which parts of the system are 
responsible for delays or user-perceptible latency. 
Obtaining such figures requires that we drive the system 
using an input stream that closely resembles one that an 
interactive user may generate and more importantly, an 
ability to measure the latency of individual events. 

1.2 Related Work

Most of the benchmarks currently used for computer 
performance measurement can be categorized as either 
microbenchmarks or batch benchmarks. Microbench-
marks are commonly used to measure the performance 
of relatively simple, specific events [7,10]. They can be 
used to measure latency (e.g., 7.7 ms per IPC) or 
throughput (e.g.,129,000 local IPCs per second). Some 
examples of current popular microbenchmark suites are 
Winbench [14], the Bytemarks [3], and lmbench [7]. 
The shortcomings of microbenchmarks for general per-
formance analysis are well documented [2]. The funda-
mental problem is that the computation they perform is 
not useful in and of itself and, as such, they fail to accu-
rately reflect system behavior for any realistic situation.

Batch benchmarks address the shortcomings of 
microbenchmarks by measuring the time to run a com-
plete, non-trivial computation from start to finish. Batch 
benchmarks are the dominant way of measuring system 
performance [1,13,14]. Portable benchmarks such as the 
SPEC95 suite are based on batch computations with 
non-interactive input. The portability requirement 
means that it must be possible to adapt these bench-
marks to use new system APIs or to accommodate the 
idiosyncrasies of a given API implementation. To meet 
these requirements, the SPEC benchmarks are distrib-
uted in source form. The central problem with bench-
marks such as the SPEC suites is that, although they 
may represent a realistic batch load, they fail to model 
the behavior of an interactive user [8].

Benchmarks such as Winstone, BAPCo SYSmark NT 
and BAPCo SYSmark 32 sacrifice portability in order to 
use popular interactive applications. Winstone is spe-
cific to PC-compatible hardware running Windows 95 or 
Windows NT. SYSmark NT provides somewhat more 
hardware portability than the Winstone suite, providing 
native executables for non-x86 systems that run Win-
dows NT. Although these benchmark suites use interac-
tive applications and simulated interactive input, they 
report performance in terms of throughput metrics and 
suffer from the problems presented in Section 1.1. Non-
trivial batch computations are sometimes included in 
these workloads when they are consistent with realistic 
usage of the application. Examples are circuit board lay-
out in MAXEDA (BAPCo SYSmark NT) and database 
queries in Microsoft Access (Winstone 95). Although 
these benchmarks use realistic applications, the input 
streams used to drive them model an infinitely fast user. 

BAPCo SYSmark for File Servers, Ziff-Davis Net-
bench, SPEC SFS (LADDIS), and the TPC benchmarks 
measure performance of a server under client load. Both 
Netbench and SPEC SFS measure file server perfor-
mance, and compute their results using the latency 
observed by clients. The intent of TPC-C is to measure 
on-line transaction processing performance. It reports 
performance in terms of complete transactions pro-
cessed per minute. TPC-D is designed to measure deci-
sion-support database performance. TPC-D specifies 
one metric (QppD) that measures "raw performance" or 
latency, and another metric (QthD) for throughput. 
Several of these benchmarks use client-observed latency 
in reporting performance, but only for client requests to 
a remote server. Our goal is to measure latency for a 
more general class of events.

A prior study compared Windows NT, Windows for 
Workgroups 3.11, and NetBSD to explore how differ-
ences in system structure affect overall performance [4]. 
The study used microbenchmarks to measure the latency 
of simple events and application workloads. Although 
some of the application workloads were based on inter-
active applications, the applications were driven by an 
uninterrupted input stream, and the results were reported 
in terms of throughput. There was no attempt to measure 
interactive performance directly. Because of the require-
ment that the workloads run on a Unix system, the study 
included no popular PC applications.

Two prior studies influenced our work. Shand used a 
free-running counter in a tight loop to measure the 
latency of interrupts [11]. This methodology is similar 
to ours in that it measures the computation of interest by 
detecting lost time. We generalize the prior work by 
looking beyond interrupt handling to a broader range of 
events types, by avoiding the need for special-purpose 
hardware, and by looking at interactive events rather 
than the latency of interrupt handlers. Second, our tech-
niques for visualizing latency were influenced by the 
work of O`Toole et al. [9] on reducing the pause times 
for garbage collection.

2	Methodology

Our methodology must provide the ability to measure 
the latency of individual events that occur while execut-
ing realistic interactive workloads. This poses the fol-
lowing set of new challenges:

·	Interactive events are short in duration relative to 
the timer resolution provided by clock APIs in 
modern operating systems such as Windows and 
UNIX. Whereas a batch workload might run for 
millions of timer ticks, many interactive events last 
less than a single timer interval. 

·	Under realistic load, there will often be only a 
fraction of a second between interactive events in 
which to record results and prepare for the next 
measurement. Therefore the measurement scheme 
must have quick turnaround time. 

·	Perhaps the most challenging problem is collecting 
the requisite data without access to the source code 
of the applications or operating system. With source 
code, it is straightforward to instrument an 
application to generate timestamps at the beginning 
and ending points of every interactive event, but this 
is time consuming at best and not possible given 
our goal of measuring widely-available commercial 
software.

Analyzing interactive applications is just as challenging 
as measuring them. The time during which an applica-
tion is running can be divided into think time and wait 
time. Think time is the time during which the user is nei-
ther making requests of the system nor waiting for the 
system to do something. Wait time is the time during 
which the system is responding to a request for which 
the user is waiting. Not all wait time is equivalent with 
respect to the user; wait time intervals shorter than a 
user's perception are irrelevant. We call these classes of 
wait time "unnoticeable." A good example of unnotice-
able wait time is the time required to service a keystroke 
when a user is entering text. Although the system may 
require a few tens of milliseconds to respond to each 
keystroke, such small "waits" will be unnoticeable, as 
even the best typists require approximately 120 ms per 
keystroke [12]. Distinguishing between wait time and 
think time is non-trivial, and the quantity and distribu-
tion of wait time is what the user perceives as an appli-
cation's responsiveness. Our measurement methodology 
must help us recognize the wait time that is likely to irri-
tate users.

In the following sections, we describe the combination 
of tools and techniques that we use to measure and iden-
tify event latency.

2.1 Experimental Systems

We ran our experiments on a personal computer based 
on an Intel Premiere II motherboard, with the Intel Nep-
tune chip set and a 100 MHz Pentium processor. Our 
machine was equipped with a 256KB asynchronous 
SRAM Level 2 cache, 32 MB of RAM, and a Diamond 
Stealth 64 DRAM display card. We used a dedicated 
1GB Fujitsu disk (model M1606SAU) for each of the 
operating systems we tested. These disks were con-
nected via a NCR825-based SCSI II host adapter. Both 
Windows NT systems used a NTFS file system, while 
the Windows 95 system used a FAT file system. The sig-
nificant differences between Windows NT 3.51 and NT 
4.0 are a change in GUI (NT 3.51 uses the traditional 
Windows GUI, while NT 4.0 uses a GUI similar to that 
of Windows 95) and the movement of some Win32 com-
ponents into the kernel.

2.2 The Pentium Counters

The Intel Pentium processor has several built-in hard-
ware counters, including one 64-bit cycle counter and 
two 40-bit configurable event counters [5]. The counters 
can be configured to count any one of a number of dif-
ferent hardware events (e.g., TLB misses, interrupts, or 
segment register loads). The Pentium counters make it 
possible to obtain accurate counts of a broad range of 
processor events. Although the cycle counter can be 
accessed in user or system mode, the two event counters 
can only be read and configured from system mode.

2.3 Idle Loop Instrumentation

Our first measurement technique uses a simple model of 
user interaction to measure the duration of interactive 
events. In an interactive system, the CPU is mostly idle. 
When an interactive event arrives, the CPU becomes 
busy and then returns to the idle state when the event-
handling is complete. By recording when the processor 
leaves and returns to an idle state, we can measure the 
time it takes to handle an interactive event, and the time 
during which a user might be waiting.

The lack of kernel source code prevents us from instru-
menting the kernel to identify the exact times at which 
the processor leaves or enters the idle loop. Instead, we 
replace the system's idle loop with our own low-priority 
process in each of the operating systems. These low-
priority processes measure the time to complete a fixed 
computation: N iterations of a busy-wait loop. The 
instrumentation code logs the time required by the loop. 
The pseudo code is as follows:


while (space_left_in_the_buffer) {

    for (i = 0; i < N; i++)

        ;

    generate_trace_record;

}


We select the value of N such that the inner loop takes 
one ms to complete when the processor is idle. In this 
way we generate one trace record per millisecond of idle 
time. If the processor is taken away from the idle loop, 
the loop takes longer than one ms of elapsed time to 
complete. Any non-idle time manifests itself as an elon-
gated time interval between two trace records. The 
larger we make N, the coarser the accuracy of our mea-
surements; the smaller we make N, the finer the resolu-
tion of our measurements but the larger the trace buffer 
required for a given benchmark run.

We wrote and measured a simple microbenchmark to 
demonstrate and validate this methodology. It uses a 
program that waits for input from the user and when the 
input is received, performs some computation, echoes 
the character to the screen, and then waits for the next 
input. We measured the time it took to process a key-
stroke in two ways. First, we used the idle loop method 
described above to measure the processing time. 
Figure 1 shows the times at which the samples were col-
lected. For the sake of clarity only a few samples are 
shown. The figure shows that the system spent approxi-
mately one ms generating samples A, B, D, and E, indi-
cating that the system was idle during the periods in 
which these samples were generated, but spent 10.76 ms 
generating sample C. The difference, (10.76 - 1) or 9.76 
ms, represents the time required to handle the event. 

Next, we used the traditional approach, recording one 
timestamp when the program received the character 
(i.e., after a call to getchar()) and a second timestamp 
after the character was echoed back to the screen. This 
measurement reported an event-handling latency of only 
7.42 ms. The 2.34 ms discrepancy between the two mea-
surements highlights a shortcoming of the conventional 
measurement methodology. Our test program calls the 
getchar() function to wait for user input. When the user 
enters a character, the system generates a hardware 
interrupt, which is first handled by the dynamically 
linked library KERNEL32.DLL. In the traditional 
approach, the measurement does not start until control is 
returned to the test program. Therefore, it fails to cap-
ture the system time required to process the interrupt 
and reschedule the benchmark thread. In comparison, 
our idle loop methodology provides a more complete 
measurement of the computation required to process the 
keyboard event.

Our idle loop methodology uses CPU busy time to rep-
resent event latency, but there are several issues that pre-
vent this from being an accurate measure of the user's 
perceived response time. One problem is that most 
graphics output devices refresh every 12-17 ms. In this 
research, we do not consider this effect.

Another problem is that CPU busy time and CPU idle 
time do not equate directly with wait time and think 
time. First, synchronous I/O requests contribute to wait 
time, even though the CPU can be idle during these 
operations. Second, in the case of background process-
ing, the user may not be waiting even though the CPU is 
busy. The first problem could be solved with system 
support for monitoring the I/O queue and distinguishing 
between synchronous and asynchronous requests. In 
order to address the second problem, we must consider 
how events are processed by the systems. When the user 
generates key strokes and mouse clicks, they are queued 
in a message queue to await processing. Therefore, 
when there are events queued, we can assume that the 
user is waiting. By combining CPU status (busy or idle), 
message queue status (empty or non-empty), and status 
for outstanding synchronous I/O (busy or idle), we can 
speculate during which time intervals the user is wait-
ing.

Figure 2 shows a state transition diagram for identifying 
think time and wait time in our system, using the param-
eters: CPU state, message queue state, and synchronous 
I/O status. The diagram omits asynchronous I/O, which 
we assume is background activity, and assumes that 
users always wait for the completion of an event. In real-
ity, we can never precisely distinguish think time from 
wait time, because we cannot know what the user is 
doing and whether the user is actually waiting for an 
event to complete or is thinking while an event is being 
processed. For simplicity, in the rest of this paper, we 
assume that the user waits for each event and report 
results in terms of event handling latency. In the next 
section, we describe how we obtain information about 
the status of the message queue.

2.4 Monitoring the Message API 

Win32 applications use the PeekMessage() and GetMes-
sage() calls to examine and retrieve events from the 
message queue. We can monitor use of these API entries 
by intercepting the USER32.DLL calls. By monitoring 
use of these API entries, we can detect when an applica-
tion is prepared to accept a new event and when it actu-
ally receives an event. We correlate the trace of 
GetMessage() and PeekMessage() calls with our CPU 
profile to determine when the application begins han-
dling a new request and when it completes a request. 
This allows us to distinguish between synchronous and 
asynchronous I/O. It is also useful for recognizing situa-
tions where asynchronous computation is used to 
improve interactive response time.

Figure 2 illustrates our design for a finite state machine 
that distinguishes think time from wait time in a latency-
measurement system. In Sections 4, 5, and 6, we will 
demonstrate how to apply complete information about 
CPU state and partial information about message queue 
state to implement part of the FSM. Implementation of 
the full FSM requires additional system support for 
monitoring I/O and message queue state transitions. 
Implementation of such monitoring is part of our con-
tinuing work at Harvard.

Next, we will present two simple example measure-
ments to give some insight into some of the non-trivial 
aspects of interpreting the output of our measurements.

2.5 Idle System Profiles

In this section, we present measurement results for the 
background activity that occurs during periods of inac-
tivity on Windows NT and Windows 95. This provides 
intuition about the measurement techniques as well as 
baseline information, useful for interpreting latency 
measurements in realistic situations. Figure 3 shows the 
idle system profiles for the three test systems. To relate 
non-idle time to elapsed time, we plot elapsed time on 
the X-axis and the CPU utilization on the Y-axis. Given 
that each sample represents 1 ms of idle time, the aver-
age CPU utilization during a sample interval can be cal-
culated easily. For example, if the system spends 10 ms 
collecting a sample, and the sample includes 1 ms of 
idle time, the CPU utilization for that time interval is 
(10 - 1)/10 = 90%.

Both versions of Windows NT show bursts of CPU 
activity at 10 ms intervals due to hardware clock inter-
rupts. Correlating the samples with a count of hardware 
interrupts from the Pentium performance counters 
shows that each burst of computation is accompanied by 
a hardware interrupt.

Although we have compensated for the overhead intro-
duced by the user-level idle loop, Windows 95 shows a 
higher level of activity in comparison to both versions of 
the NT system. We do not know what causes this 
increased activity in Windows 95. 

By coupling our idle-loop methodology with the Pen-
tium counters, we were able to compute the interrupt 
handling overhead for various classes of interrupts -
measurements difficult to obtain using conventional 
methods. For example, the smallest clock interrupt han-
dling overhead under Windows NT 4.0 was about 400 
cycles, or 4 ms.

2.6 Window Maximize

It is becoming common for interactive systems to per-
form extra work, such as animation, for the user's 
"viewing pleasure." The following experiment shows 
the effects of such features. We measured the system 
activity generated when maximizing a previously mini-
mized window; the system displays a window that 
increases in size gradually as it is maximized. In order 
for the animation to be visible to the user, the system 
must pace itself, inserting delays. These delays cause 
the system to become idle, resulting in multiple idle-
loop measurement samples that correspond to the single 
user event (i.e., maximize window).

Figures 4a and 4b show the results of our measurement 
in two different resolutions. Figure 4a shows the full 1 
ms resolution of the data, while Figure 4b shows the 
CPU utilization averaged over 10 ms intervals. Both fig-
ures clearly show the 80 ms of 100% CPU utilization 
required to process the input event (from 100 to 180 ms) 
and another period of 100% CPU utilization starting 
around 400 ms to redisplay the page. The stair pattern 
between 180 and 400 ms illustrates the CPU activity 
required to perform the animation. From Figure 4a, we 
can observe that the bursts of CPU activity for perform-
ing animation are aligned on 10 ms boundaries, suggest-
ing that they are scheduled by clock interrupts. Each 
step of animation takes progressively longer time to 
complete as the window outline increases in size.

This measurement shows that a single user event can 
correspond to multiple intervals of CPU busy time. Such 
events complicate the task of precisely identifying event 
boundaries. Monitoring the Message API (section 2.4) 
is one of the techniques that helps us pinpoint the begin-
ning and ending of interactive events. 

3	Benchmarks and Metrics

Our benchmark set is organized into three categories. 
Microbenchmarks are useful for understanding system 
behavior for simple interactive operations, such as inter-
rupt handling and user-interface animation. By analyz-
ing microbenchmarks, we develop an understanding of 
the low-level behavior of the system. We then extend 
our measurement to task-oriented benchmarks in order 
to understand the real impact of latency on the perceived 
interactive responsiveness of an application. These task-
oriented benchmarks are based on applications from 
typical PC office suites and are designed to represent a 
realistic interactive computing workload. We further 
apply application microbenchmarks to evaluate isolated 
interactive events from the realistic workloads. Our 
application microbenchmarks include such computa-
tions as page-down of a Power-Point document and edit-
ing of an embedded OLE object.

We used Microsoft Visual Test to create most of our 
microbenchmarks and task-oriented benchmarks. MS 
Test provides a system for simulating user input events 
on a Windows system in a repeatable manner. Test 
scripts can specify the pauses between input events, 
generating minimal runtime overhead. However, in 
some cases, the way that Test drives applications alters 
the behavior of those applications. This effect is dis-
cussed in detail in Section 5.4.

3.1 Evaluating Response Time

Early in this project, we had planned to develop a new 
latency metric, a formula that could be used to summa-
rize our measurements and provide a single scalar figure 
of merit to characterize the interactive performance of a 
given workload. Events that complete in 0.1 seconds or 
less are believed to have imperceptible latency and do 
not contribute to user dissatisfaction, whereas events in 
the 2-4 second range invariably irritate users [12]. 
Events that fall between these ranges may be acceptable 
but can correspond to perceptibly worse performance 
than events under 0.1 seconds. Our intuition is that a 
user-responsiveness metric would be a summation of the 
form:

However, we also believe that the threshold, T, is a func-
tion of the type of event. For example, users probably 
expect keystroke event latency to be imperceptible while 
they may expect that a print command will impose some 
delay. The issues of event types, user expectation, the 
precise tolerance of users for delay, and the limitations 
of human perception are beyond our field of expertise. 
Presented with these obstacles, we modified our plans, 
and present latency measurements graphically. We trust 
that the issues in human-computer interaction can be 
resolved by specialists. In the meantime, our visualiza-
tion of latency enables us to compare applications and 
develop an intuition for responsiveness without risking 
the inappropriate data reductions that could occur given 
our limited background in experimental psychology.

3.2 Visualizing Event Latency

Figure 5 is an example of the graphical representation of 
our raw data. Each vertical bar represents an event that 
began at the time represented by the X value and lasted 
for a period represented on the Y axis. Figure 5a shows 
the data for an entire Microsoft Word benchmark run, 
while 5b shows a magnification of a two second interval. 
The complete event latency profile provides a very 
coarse view of the application, while the magnification 
provides the detail to explain the periodicity in the over-
all pattern. By drawing a horizontal line at a given "irri-
tation threshold," the frequency and distribution of 
irritating events is readily visible.

For our task-oriented benchmarks, we use three graphi-
cal representations to capture the responsiveness of an 
application. First, we present histograms, showing the 
number of events corresponding to each measured 
latency. This presents a detailed breakdown of the event 
latencies and provides some intuition into the different 
categories of events present in an application. Next, we 
integrate over the histogram presenting a cumulative 
latency graph. This provides the quantitative data indi-
cating how events of a particular duration contribute to 
the overall time required to complete a task. Finally, we 
plot the cumulative latency as a function of the number 
of events, providing an intuition about the variance in 
response time perceived by the user. Note that in each of 
these cases, the events are sorted by their duration, not 
by their actual time of occurrence.

4	Microbenchmarks

In this section, we present some basic measurements of 
simple interactive events. This helps us explore the char-
acter of our tools and understand the kinds of things we 
can and cannot measure. 

Figure 6 shows the latencies for two simple interactive 
events, unbound key stroke and mouse click on the 
screen background, under the three operating systems. 
We were unable to measure the overhead of Microsoft 
Test for these microbenchmarks, so we were forced to 
use manual input. To compensate for the potential vari-
ability introduced by a human user, we report the mean 
of 30-40 trials, ignoring cold cache cases. The most sig-
nificant standard deviations occurred in the key click 
events for Windows NT 4.0 and Windows 95 (8%) while 
all the remaining standard deviations were under 2% of 
the mean.

On the key stroke test, Windows 95 shows substantially 
worse performance than NT 4.0. This is a reflection of 
segment register loads (not shown) and other overhead 
associated with 16-bit windows code [4], which persist 
in Windows 95. 

The mouse click results are even more striking. The 
Windows 95 measurements are off the scale, because 
the system busy-waits between "mouse down" and 
"mouse up" events; therefore our measurement indicates 
the length of time the user took to perform the mouse 
click. This is much longer than the actual processing 
times of the NT systems and is not indicative of the 
actual Windows 95 performance. 

Our methodology provides little guidance in explaining 
the differences in performance between Windows NT 
3.51 and NT 4.0, but it does highlight the fact that 
instructions and data references occur roughly in pro-
portion to cycles across the systems for both of the sim-
ple interactive events. Therefore, we conclude that in the 
warm cache case, the performance differences are a 
function of the code path lengths. It is possible that the 
difference in code path length stems from the change in 
GUI between NT 3.51 and NT 4.0.

5	Task-oriented Benchmarks

In this section, we measure three task-oriented bench-
marks, designed to model realistic tasks that users com-
monly perform using the target applications. In using 
these longer running benchmarks we have two specific 
goals. The first is to measure the system performance for 
a realistic system state. An often-cited problem of 
microbenchmarks is that they tend to measure the sys-
tem when various caches are already warm. However, 
measuring the system when all the caches are cold is 
also unrealistic. Neither extreme is representative of the 
system state in which the target micro-operations are 
invoked in common practice. By measuring the latency 
of micro-operations embedded in a longer realistic inter-
active task, we measure each micro-operation under 
more realistic circumstances. The second goal is to iden-
tify long-latency operations that users encounter as they 
perform tasks on the systems. Since these long-latency 
operations have a greater effect on how users perceive 
system performance than very short events, we will iso-
late several such operations in our latency record and 
examine their activity in detail.

We ran each benchmark five times using Microsoft Test 
and found that the results were consistent across runs. 
The standard deviations for the elapsed times and cumu-
lative CPU busy times were 1-2%, and the event latency 
distributions were virtually identical. The graphical out-
put shown in the following sections depicts one of the 
five runs for each benchmark.

5.1 Microsoft Notepad

Notepad is a simple editor for ASCII text distributed 
with all versions of Microsoft Windows. Our Notepad 
benchmark models an editing session on a 56KB text 
file, which includes text entry of 1300 characters at 
approximately 100 words per minute, as well as cursor 
and page movement. With this benchmark, we demon-
strate how differences in average response time across 
the three systems manifest themselves in our visual rep-
resentation of latency and how they can be used to com-
pare system performance. We used the same Notepad 
executable (the Windows 95 version) on all three sys-
tems and used a Microsoft Test script to drive Notepad. 
Since virtually all Notepad activity is synchronous, we 
were able to collect the latency figures for every key-
stroke that the user made in a straightforward way. By 
correlating our idle loop measurement with our monitor-
ing of the PeekMessage() and GetMessage() API calls, 
we were able to clearly identify the Test overhead and 
remove it from the data presented in Figure 7.

Notice that the Y scale in the histogram in Figure 7 is a 
logarithmic scale. The cumulative latency graph shows 
that for all three systems, over 80% of the latency of 
Notepad is due to low-latency (less than 10 ms) events. 
These short-latency events are the keystrokes that gener-
ate printable ASCII characters. The remaining 20% of 
the total latency are due to the longer latency (at least 28 
ms) keystrokes that cause "page down" or newline oper-
ations. These keystrokes cause Notepad to refresh all or 
part of the screen. The smoothness of the curves in the 
bottommost graph in Figure 7 shows that there is little 
variance in either the long latency events or the short-
latency events. Events of the same type contribute 
equally to the total latency.

The latencies measured are relatively small for Notepad 
and reflect both the simplicity of the application and the 
relatively fast PC that we used for our experiments. 
Although these differences in latency are likely to go 
unnoticed by users of our test system, they might have a 
significant effect on user-perceived performance on a 
slower machine.

5.2 Microsoft Powerpoint

Powerpoint, from the Microsoft Office suite, is a popu-
lar application for creating presentation graphics. In our 
Powerpoint task scenario, the user starts Powerpoint 
immediately after powering up the machine and booting 
the operating system, so that all caches are cold. The 
user then loads a 46-page, 530KB presentation, and 
finds and modifies three OLE embedded Excel graph 
objects. Each of the OLE objects was of similar size 
and complexity. As with Notepad, we used a Microsoft 
Test script to drive the application and deliver key-
strokes at a realistic rate, with each keystroke separated 
by at least 150 ms. An important property of the Power-
point benchmark is that it has a number of events with 
easily perceptible latencies. Since we were mainly inter-
ested in longer events, we pre-processed our data to 
exclude events with latency of less than 50 ms. Figure 8 
shows the results for the two versions of Windows NT. 
We were unable to run this experiment for Windows 95 
due to limitations of Microsoft Test when manipulating 
OLE embedded object on that system.

The shortest events in Figure 8 (with latency of less than 
one second) are due to "page down" operations and 
Excel operations. Both systems exhibited a similar 
latency distribution for these events. Six events had 
latencies greater than one second on both systems, in 
nearly the same relative order. Table 1 lists these long 
latency events.

======================
latency (in seconds) 			NT 3.51	 	NT 4.0

Save document 				8.082	 	9.580
Start Powerpoint 			7.166 		5.773
Start OLE edit session (first time) 	7.050	 	5.844
Open document 				5.680		4.151
Start OLE edit session (second object)	2.897		2.009
Start OLE edit session (third object)	2.697		1.305

Table 1.Powerpoint events with latency over one second.
======================
All of the long-latency events required disk accesses, 
which are responsible for the majority of the latency for 
these events. The effects of the file system cache are 
most clearly observed in the latency for starting the sec-
ond OLE edit, as more of the pages for the embedded 
Excel object editor become resident in the buffer cache.

The cumulative latency graph shows that both versions 
of Windows NT demonstrate similar performance for 
the short-latency keystrokes, and the majority of the per-
formance difference is a result of the ability of NT 4.0 to 
handle the long-latency events much more efficiently. 
We turn to application microbenchmarks to examine this 
phenomenon in more detail.

5.3 Powerpoint Microbenchmarks

Our Powerpoint task reveals significant differences in 
latency between the two version of NT but Powerpoint's 
complexity makes it difficult to reveal the source of 
these differences. We designed two application 
microbenchmarks to explore page down and OLE edit 
operations in detail. Both operations have a perceptible 
response time and occur multiple times during our Pow-
erpoint task. We measured each operation using Test 
and collected data from the Pentium performance 
counters to look for hints to identify the performance 
differences between the systems. Because of the sim-
plicity of the application microbenchmarks, we were 
able to run them on the Windows 95 systems as well as 
the two NT systems. 

Page Down

For this benchmark, we measured the latency and the 
hardware events that occur during a "page down" to a 
page containing an OLE embedded graph. We repeated 
the test 10 times for each performance counter, so our 
measurements reflect warm cache behavior. The stan-
dard deviations are all below 3%. Figure 9 shows some 
of the hardware events observed in each system during 
the page-down operation. The graph shows that NT 4.0 
was able to handle the request in the shortest amount of 
time followed by Windows 95 and NT 3.51. The differ-
ence in the latency between the two versions of Win-
dows NT is explained by the differences in system 
architecture. In NT 3.51, the Win32 API is implemented 
by a user-level server. The negative performance effects 
of this server-based architecture were demonstrated in 
prior research [4]. In NT 4.0, some components of the 
Win32 API server are rumored to have been moved into 
the kernel. The improved locality from this change is 
reflected in reduced TLB misses for NT 4.0 compared to 
NT 3.51. A lower TLB miss rate implies fewer protec-
tion domain crossings in Pentium processors, which 
flush the TLB on each crossing [5]. Using 20 cycles per 
miss as a lower bound on TLB miss handling latency, 
the extra TLB misses that occur for NT 3.51 (both 
instruction and data) account for at least 25% of the 
latency difference between NT 3.51 and NT 4.0.

Comparing NT 4.0 and Windows 95, there are a rela-
tively large number of unaligned data accesses and seg-
ment register loads for Windows 95 (Figure 9). The high 
counts for these events are due to the large components 
of Windows 95 (such as the graphics API) that are 
implemented in 16-bit code. Windows 95 also incurs 
93% more TLB misses than NT 4.0, although we do not 
have sufficient information to attribute this behavior to a 
specific architectural feature in the two systems.

OLE Edit

The long latency events in our Powerpoint task suggest 
that much of the OLE edit start-up latency is due to disk 
I/O. In this section, we exclude the effects of disk I/O by 
measuring an OLE edit start-up with a hot buffer cache. 
We also repeated this test 10 times for each counter, but 
we noticed that all of the events and the cycle counter 
increased steadily on subsequent runs. We speculate that 
this behavior is unintended, and therefore, report the 
results from the first run. Figure 10 shows some of the 
hardware event counts during the OLE edit start-up. 
Like the page down benchmark, NT 4.0 completes the 
operation with the shortest latency followed by Win-
dows 95 and NT 3.51. Comparing NT 4.0 to NT 3.51, 
the observations made for the page down benchmark 
carry through to the OLE edit. Overhead from elevated 
TLB miss rates account for at least 23% of the latency 
difference between NT 3.51 and NT 4.0. In Windows 
95, we observed a large number of segment register 
loads and unaligned data accesses, both of which can be 
attributed to code executing in 16-bit mode.

5.4 Microsoft Word

Our task-oriented workload for Microsoft Word consists 
of text entry of a paragraph of approximately 1000 char-
acters. It includes cursor movement with arrow keys and 
backspace characters to correct typing errors. The tim-
ing between keystrokes was varied to simulate realistic 
pauses when composing a document, and line justifica-
tion and interactive spell checking were enabled. We do 
not report results for Windows 95, because the system 
does not become idle immediately after Word finishes 
handling an event, making all event latencies appear to 
be several seconds long.

Figure 11 shows results for Microsoft Test driven simu-
lations on the two versions of Windows NT. Compared 
to Notepad, Word requires substantially more process-
ing time per keystroke, due to additional functionality 
such as text formatting, variable-width fonts and inter-
active spell checking. For the majority of interactive 
events, NT 4.0 exhibits shorter response time and lower 
variance than NT 3.51.

The Microsoft Word benchmark demonstrates both the 
strengths and limitations of evaluating interactive per-
formance using latency. Compared to throughput mea-
surements, our latency analysis provides much more 
detailed information, such as variations in latency and 
the distribution of events with different latencies. How-
ever, the structural features of Word push us to the limit 
of the behavior we are able to analyze. Our analysis 
indicates that Word uses a single system thread, but 
responds to input events and handles background com-
putations asynchronously using an internal system of 
coroutines or user level threads.

Distinguishing background activity from foreground 
activity in Word is challenging. We examined the results 
of hand-generated Word input under Windows NT 3.51, 
compared it to the Test-generated results, and found sig-
nificant differences. For our hand-generated tests, we 
ran seven trials, with the same typist and input, and 
found that the event histograms appeared very similar 
and that the variation in cumulative latency and elapsed 
time was less than 4% across the runs. While the Test 
results showed that most events had latency between 80 
and 100 ms, we measured a 32 ms typical latency for the 
hand-generated input. This difference in event latency 
was accompanied by a compensating difference in back-
ground activity. The hand-generated input showed a 
higher level of background activity than the Test-gener-
ated results. We also observed that carriage returns 
under the hand-generated input took longer than 200 ms 
to handle while the longest latency events we saw in the 
Test-generated runs were 140 ms. Our Message API log 
reveals that Test generates a WM_QUEUESYNC mes-
sages after every keystroke. We hypothesize that these 
messages were responsible for the different behavior 
under Test and under manual typing. However, with our 
current tools, the complexity of Word makes it difficult 
to thoroughly analyze even the simple experiment we 
present here.

6	Discussion

The tools and techniques we have discussed here are a 
first step towards understanding and quantifying interac-
tive latency, but there remains much work to be done. In 
the absence of system and application source, better per-
formance monitoring tools would be useful. Our mea-
surements could be improved through API calls that 
return information about system state such as message 
queue lengths, I/O queue length, and the types of 
requests on the I/O queue. Currently, some of this infor-
mation can be obtained, but it is painful (e.g., monitor-
ing the GetMessage() and PeekMessage() calls).

Even in the presence of rich APIs, the task of distin-
guishing between wait time and think time is not always 
possible. There is no automatic way to detect exactly 
what a user is doing. Without user input, we can never 
tell whether a user is genuinely waiting while the system 
paints a complicated graphic on the screen or is busy 
thinking. For simulations using designed scripts, we can 
make assumptions about when users think and then ana-
lyze performance based on those assumptions, but the 
most useful analysis will come from evaluating actual 
user interaction.

One factor that contributes to user dissatisfaction is the 
frequency of long-latency events. We processed the 
Microsoft Word profile of Figure 5 to analyze the distri-
bution of interarrival times of events above a given 
threshold. Since most events in the Word benchmark 
were very short, we chose thresholds around 100 ms. 
Table 2 shows the summaries for these thresholds. Note 
that the standard deviations are of the same order of 
magnitude as the averages themselves, indicating that 
there is no strong periodicity between long-latency 
events. 

Threshold 
(in msec)

Number of 
events 
above 
threshold

Interarrival times

Average

(in sec)

Std. Dev. 
(in sec)

100

101

3.1

3.1

110

26

12.4

10.6

120

8

41.1

48.8

Table 2. Summary of interarrival distributions for 
Microsoft Word benchmark on Windows NT 3.51. 
Notice that an increase of 10% in the threshold (from 100 
ms to 110 ms) reduces the number of above threshold events 
by a factor of 4.

We then examined the truly long-latency events from the 
Powerpoint benchmark. Figure 12 shows the event 
latency profile for all events over 50 ms. Both systems 
show similar periodicity with the better performing 4.0 
system demonstrating smaller interarrival times to 
match its shorter overall latency.

In the case of Word, the interarrival times are clustered 
because most events have similar latency. In the case of 
Powerpoint, the interarrival times of long-latency events 
are simply the interarrival times of a few particular 
classes of events. The distribution of these events is 
entirely dependent upon when we issued such requests 
in our test script and is not necessarily indicative of the 
distribution that might be obtained from a real user. In 
this test, none of the simple keystroke events were 
responsible for generating long-latency events, rather all 
the events with latencies over 50 ms result from major 
operations for which user expectation for response time 
is generally longer. Until our tools become sophisticated 
enough to examine long traces of complex events gener-
ated by a real user, further analysis of these interarrival 
times is not particularly productive.

Over time, our tools will become better able to deal with 
the sophisticated applications that we seek to analyze, 
but we need the human factors community to assist us in 
understanding the limits of human perception and the 
models of user tolerance. Some of the questions that 
must be answered are:

·	What are the limits of human perception?

·	How do the limits vary by task (e.g., typing versus 
mouse-tracking)?

·	How do the user expectation and tolerance for 
interactive response time vary by task?

·	How does user dissatisfaction grow with increasing 
of latency?

·	How does user dissatisfaction grow with the 
variance of latency?

·	What aspects of performance contribute the most to 
user satisfaction?

7	Conclusions

Latency, not throughput, is the key performance metric 
for interactive software systems. In this paper, we have 
introduced some tools and techniques for quantifying 
latency for a general class of realistic interactive appli-
cation. To demonstrate our methodology, we applied it 
to compare the responsiveness of realistic applications 
running on three popular PC operating systems. 
Whereas current measurements of latency are generally 
limited to microbenchmarks, our approach allows us to 
measure latency for isolated events in the context of 
realistic interactive tasks. Our latency measurements 
give a more accurate and complete picture of interactive 
performance than throughput measurements.

We have combined a few simple ideas to get precise 
information about latency in interactive programs. We 
have shown that using these ideas we can get accurate 
and meaningful information for simple applications and 
also, to a degree, for complex applications. The require-
ments of these techniques are not out of reach; in partic-
ular, a hardware cycle counter, a means for changing the 
system idle loop, and a mechanism for logging calls to 
system API routines are needed. Additional support for 
detecting the enqueuing of messages and the state of the 
I/O queue would provide a more complete framework 
for latency measurement. We have shown the limitations 
of our system for applications such as Microsoft Word 
that use batching and asynchronous computation.

Measuring latency for an arbitrary task and an arbitrary 
application remains a difficult problem. Our experience 
with Microsoft Word demonstrates that there are many 
difficult technical issues to be resolved before latency 
will become a practical metric for system design. Our 
graphical representation provides a great deal of infor-
mation about program behavior to specialists, but is 
probably not appropriate for more widespread use. The 
two key components necessary to provide consumers a 
single figure of merit are further work in human factors 
and some method for distinguishing user think time 
from user wait time.

8	References

[1]	Business Applications Performance Corporation, 
"SYSmark for Windows NT," Press Release by 
IDEAS International, Santa Clara, CA, March 
1995.

[2]	Brian N. Bershad, Richard P. Draves, and Alessan-
dro Forin, "Using Microbenchmarks to Evaluate 
System Performance." Proceedings of the Third 
Workshop on Workstation Operating Systems, 
IEEE, Key Biscayne, Florida, April 1992, pages 
148-153.

[3]	Ben Smith, "Ultrafast Ultrasparcs," Byte Magazine, 
January 1996, page 139. Additional information on 
the Bytemarks suite is available on the Internet: 
http://www.byte.com/bmark/bdoc.htm.

[4]	J. Bradley Chen, Yasuhiro Endo, Kee Chan, David 
Mazieres, Antonio Dias, Margo Seltzer, and 
Michael D. Smith, "The Measured Performance of 
Personal Computer Operating Systems," ACM 
Transactions on Computer Systems 14, 1, February 
1996, pages 3-40.

[5]	Intel Corporation, Pentium Processor Family 
Developer's Manual. Volume 3: Architecture and 
Programming Manual, Intel Corporation, 1995.

[6]	C. J. Lindblad and D. L. Tennenhouse, "The VuSys-
tem: A Programming System for Compute-Inten-
sive Multimedia," To appear in IEEE Journal of 
Selected Areas in Communication," 1996.

[7]	Larry McVoy, "Lmbench: Portable tools for perfor-
mance analysis," Proceedings of the 1996 USENIX 
Technical Conference, January 1996, pages 179-
294.

[8]	Jeffrey C. Mogul, "SPECmarks are leading us 
astray," Proceedings of the Third Workshop on 
Workstation Operating Systems, IEEE, Key Bis-
cayne, Florida, April 1992, pages 160-161.

[9]	James O'Toole, Scott Nettles, and David Gifford, 
"Concurrent Compacting Garbage Collection," The 
Proceedings of the Fourteenth ACM Symposium on 
Operating System Principles, December 1993, 
pages 161-174.

[10]	John K. Ousterhout, "Why Operating Systems 
Aren't Getting Faster As Fast As Hardware." Pro-
ceedings of the Summer 1991 USENIX Conference, 
June 1991, pages 247-256.

[11]	Mark Shand, "Measuring Unix Kernel Performance 
with Reprogammable Hardware," Digital Paris 
Research Lab, Research Report #19, August 1992.

[12]	Ben Shneiderman, Designing the User Interface, 
Addison-Wesley, 1992.

[13]	Jeff Reilly, "SPEC Discusses the History and Rea-
soning behind SPEC 95," SPEC Newsletter, 7(3):1-
3, September 1995.

[14]	M. L. VanName and B. Catchings, "Reaching New 
Heights in Benchmark Testing," PC Magazine, 13 
December 1994, pages 327-332. Further informa-
tion on Ziff-David benchmarks is available on the 
Internet: http://www.zdnet.com/zdbop/.


Figure 1. Validation of Idle-Loop Methodology. The 
system spent one ms collecting each of samples A, B, D, and E 
but spent 10.76 ms collecting sample C, indicating the system 
performed 9.76 ms of work during this interval.

Figure 2. System State Transition Diagram. Monitoring CPU busy
time, the I/O queues, and the message queue provides a framework
for identifying wait time and think time. We assume that asynchronous
I/O is always background activity (and is not shown) and that
synchronous I/O causes long-latency events for which users wait.

Figure 3. Idle Loop Measurements for the three operating 
systems. These graphs provide a graphical representation of 
our idle-loop measurement and illustrate differences in the idle 
behavior of the three systems. For each recorded sample, we 
subtract the one ms of idle time that our busy-wait loop 
requires to obtain the amount of non-idle activity. During an 
idle period, this difference is zero, and we report a CPU 
utilization of zero. When there is other CPU activity, more 
than one ms of time will pass before the completion of one ms 
of idle time, and the CPU utilization during that interval is 
non-zero.

Figure 4. CPU Usage Profile for a Window Maximize 
Operation under Windows NT 4.0. Figure 4a shows the 
measurement of computation generated when a window is 
transformed from an icon in the NT 4.0 icon bar to an open 
window using 1 ms resolution. The short spikes between 180 
ms and 400 ms correspond to the animation that is used as the 
window outline moves from the menu bar to its destination. 
The upward slope occurs as the outline increases in size. The 
period of continuous computation from 400 ms to 600 ms 
corresponds to the redraw computation for the window. 

Figure 4b shows the same data as Figure 4a, but with the CPU 
utilization averaged over 10 ms intervals. The details in Figure 
4a are useful, but become illegible for longer running tests.

Figure 5. Raw Data Representation. Figure5a represents a 1000 event
trace of Microsoft Word on Windows NT 3.51.  The benchmark is
sufficiently long that a complete profile does not lend itself to
detailed analysis. The magnified profile (b) shows the periodicity
of long and short latency events. Notice that the majority of the
events fall below the 0.1 second threshold of user perception but
that a significant number fall well above the threshold.

Figure 6. Latency for simple interactive events. This figure 
shows the latencies for a simple mouse click and a simple key 
stroke in each of the three systems. In Windows 95, the 
system remains busy the entire time between a "mouse down" 
and a "mouse up" event; therefore the latency of the event is a 
function of the duration of the user's press on the mouse. This 
is much larger than the actual processing times of the NT 
systems and is not indicative of Windows 95 performance. 
Each microbenchmark was repeated 30-40 times; the 
standard deviations were less than 8% for key stroke on 
Windows NT 4.0 and Windows 95 and less than 2% on the 
remaining tests.

Figure 7. Notepad Event Latency Summary. The bracketed 
numbers in the second graph report the total elapsed time for 
the benchmark run in seconds. Notice that Windows 95 has 
the smallest cumulative latency, but its elapsed time is larger 
than that of both NT 3.51 and NT 4.0. This seemingly 
anomalous result is an artifact of Microsoft Test, which 
generates an extra WM_QUEUESYNC message after each 
input event. The time to process this message is longer under 
Windows 95 than under the NT systems. We do not include 
this time in the event latencies, but it does contribute to 
elapsed time.

Figure 8. Powerpoint Event Latency Summary. The 
bracketed numbers in the second graph report the total elapsed 
time for the benchmark run in seconds. While most of the 
events for Powerpoint are relatively short (under 500 ms), the 
majority of the time is spent in long-latency events.

Figure 9. Counter Measurements for Powerpoint page 
down operation. The larger number of TLB misses in NT 
3.51 relative to NT 4.0 indicates a greater number of 
protection domain crossings. The large number of segment 
register loads for Windows 95 results from code that runs in 16 
-bit mode and accounts for the majority of the difference 
between Windows 95 and the NT systems.

Figure 10. Counter Measurements for Powerpoint OLE 
edit start-up operation. As in Figure 9, TLB misses account 
for a significant fraction of the difference between NT 3.51 
and NT 4.0, and segment register loads account for the 
majority of the difference between Windows 95 and NT.
This work was sponsored in part by grants from Bellcore, 
Intel, the National Science Foundation, Sun Micsystems 
Laboratories, and the Sloan Foundation.

Figure 11. Microsoft Word Event Latency Summary. 
The bracketed numbers in the second graph report the total 
elapsed time for the benchmark run in seconds. Although NT 
4.0 demonstrates uniformly better response time, both 
systems have most latencies below the threshold of user 
perception.

Figure 12. Time Series of long-latency events for 
Powerpoint. Both systems show similar distributions, with 
NT 4.0 having slightly shorter interarrival intervals.