2nd USENIX Windows NT Symposium
SEATTLE, WASHINGTON
August 3-5, 1998
CONFERENCE OVERVIEW
NTnix . . . You Are There
by George M. Jones
KEYNOTE ADDRESS
The New Power Behind Windows NT
Justin Rattner, Intel Corporation
Summary by Hui Qin Luo
Justin Rattner, fellow and director at Intel's Server Architecture Lab,
gave an excellent and informative talk on the technology of Intel's
Pentium II Xeon processor.
Rattner highlighted the Xeon's 400 MHz processor speed, support for a
100 MHz system bus, and support for 512KB or 1MB full speed Level 2
Cache. He continued with a description of Xeon's architecture and
Intel's market strategy for its processors in 1998 with one core
technology, the Micron P6 Architecture. He mentioned the
cost-effectiveness of Xeon architecture relative to other architectures
in the market such as the Alpha, using CAD performance as an indicator.
Other features he highlighted, such as the die size and its 0.25 micron
CMOS, served as eye-candy more for electrical engineers than for
computer scientists.
The rest of the talk was about the VI (Virtual Interface) architecture
developed jointly by Microsoft, Compaq, and Intel. Based on VI
latency-frame size measurements, the performance of a VI-enabled switch
appeared to be close to the theoretical maximum. This was a very good
driver for VI-enabled products in the market, especially distributed
database environments. Many companies -- including Compaq, GigaNet,
and Dolphin -- already had products with VI support, with other
companies' products still pending. Already, Intel was leading the pack,
by promising native VI support in its future chipsets.
Certainly, the Xeon processor is powerful in comparison to its
predecessors. I believe it will be a major power behind PC-compatible
operating systems; even so, Rattner did not specifically mention how
Windows NT takes advantage of such power.
Session: Performance
Summary by Dan Mihai Dumitriu
A Performance Study of Sequential I/O on Windows NT 4
Erik Riedel, Carnegie Mellon University; Catherine van Ingen and Jim
Gray, Microsoft Research
Erik Riedel presented a very realistic analysis of the sequential I/O
performance of Windows NT 4. The motivation for this study was the
large number of applications -- such as databases, data mining, and
multimedia -- that require high bandwidth for sequential I/O. In a
typical system, the bottleneck in bandwidth is the storage subsystem;
that is the main focus of this study. Two factors to consider are PAP
(Peak Advertised Performance) and RAP (Real Application Performance).
The goal should be RAP = PAP/2, the so called "half power point." The
experimental setup in this study consisted of modest 1997 technology:
P6-200 with 64 MB DRAM and Adaptec Ultra Wide SCSI adapters running NT
4.0 build 1381.
Riedel's study showed that out of the box, the performance of NTFS is
quite good; however, it has very large overhead for small requests.
Performance is good for reads but horrible for writes. The options for
improving performance are (1) read ahead, (2) WCE (Write Cache Enable),
(3) write coalescing, (4) disk striping.
The study suggests a number of ways to improve sequential file access.
If small requests are absolutely necessary, deep parallelism makes a
tremendous speed improvement; WCE provides large benefits for small
requests; three disks can saturate a SCSI bus; file system buffering
coalesces small requests into 64KB disk requests, significantly
improving performance smaller than 64KB; for requests larger than 64KB
file system buffering degrades performance; when possible, files should
be allocated to their maximum size, because extending a file while
writing forces synchronization of the requests and significantly
degrades performance.
<www.pdl.cs.cmu.edu/nasd> (Riedel's research group)
<www.pdl.cs.cmu.edu/active> (Riedel's thesis)
Scalability of the
Microsoft Cluster Service
Werner Vogels, Dan Dumitriu, Ashuosh Agrawal, Teck Chia, and
Katherine Guo, Cornell University
Although he gave an overview of MSCS (Microsoft Cluster Service),
Werner Vogels did not talk much about the numbers in the paper.
Instead, he used the presentation as an opportunity to showcase his
vision of what cluster design and programming should be. He highly
recommended reading Gregory Pfister's "In Search of Clusters" as a
realistic presentation of cluster technology.
Vogels presented his research goals as (1) efficient distributed
management, (2) low-overhead scalability, (3) cluster collections, and
(4) cluster-aware programming tools, on which there is a research
project at Cornell called Quintet.
Vogels gave a brief introduction to MSCS. MSCS makes a group of
machines appear to the client as a single system and also allows the
group to be managed as such. Its design goals were to extend NT to
support high availability without requiring application modifications.
It is shared-nothing, not scalable, cannot run in lockstep, cannot move
running apps, and in its first release only supports clusters of two
nodes. With some modifications, however, it is possible to scale MSCS.
Some traditional methods are reducing the algorithmic dependence on the
number of nodes and reducing the overall system complexity. More
radical methods include epidemic data dissemination techniques such as
gossip and probabilistic multicast.
Although MSCS does not scale out of the box, the algorithms that it is
based on are intrinsically scalable. It is only their implementation
and the supporting code that make the system not scale well. For
example, MSCS uses only point-to-point messages for communication
between nodes. This makes join and regroup operations scale very poorly
because of the dependency between the number
of nodes and the number of messages generated.
Vogels presented a huge discrepancy between the worlds of parallel
computing and high-availability computing. Parallel computing runs on
the order of 512+ nodes, and high-availability computing only runs on
the order of 16+ nodes. He cited a lack of good cluster programming
tools as one of the main culprits, and he mentioned the Quintet project
at Cornell.
Evaluating the Importance of
User-Specific Profiling
Zheng Wang, Harvard University; Norm Rubin, Digital Equipment
Corporation
Zheng Wang presented an alternative to profiling applications that have
a "generic" user in mind. His research, based on the assumption that
individual users use programs differently, focuses on analyzing the
benefits of profiling for individual users or groups of users. The
target applications for this study are interactive applications on
Windows NT. The experimental setup used Windows NT 4 running on DEC
Alpha hardware using Digital's FX!32 emulation/binary translation
software.
FX!32 generates a profile for each x86 WIN32 module (executable, DLL,
etc.) that it runs on the Alpha platform. The only aspects of the
profiles that were analyzed were the procedures used, although the
FX!32-generated profiles contain more information.
The procedure analysis consisted of analyzing the unique procedures,
common procedures, and subgroup procedures used by various users during
normal application use. Also analyzed were pairwise similarities
between users.
Session:
From NT to UNIX to NT
Summary by Jason Pettiss
Cygwin32: A Free Win32 Porting Layer for UNIX Applications
Geoffrey J. Noer, Cygnus Solutions
Geoffrey Noer gave an in-depth presentation of Cygwin32, a library that
facilitates the porting of most UNIX applications to Windows 95/98/NT.
Cygwin32 and source code are freely available from Cygnus Solutions
under GNU General Public License; a commercial license is also
available.
Noer explained that Cygwin32 is a DLL that exports most common UNIX
calls. You use the Cygwin32 version of GCC to compile your UNIX code on
a Windows machine, and the resulting Windows program runs as you would
expect under UNIX. A "mixed flavors" Cygwin32 application can also make
Win32 API calls if desired.
Noer sidetracked briefly to provide some background information on
Cygnus Solutions. It provides support and contracting services for the
GNUPro Toolkit and sells Source-Navigator, "a powerful source code
comprehension tool." 150 host/target combinations of the toolkit are
available, including Win32-hosted native and cross-platform development
tools. Cygnus engineers make over three-quarters of all changes to the
GNU compiler tools, maintained Noer.
Cygnus Solutions is conducting Internet beta testing, and source code
is freely available. The mailing list to subscribe to is:
<gnu-win32@cygnus.com>. "GNU-Win32" beta releases include native
Win32 development tools and all GNU utilities needed to run GNU
configure scripts, including UNIX shell, file, and text utilities.
Noer dived into the details of Cygwin32. Cygnus Solutions wanted to be
able to offer Win32-hosted version of the GNUPro Toolkit, keeping the
source code as simple as possible and development time fast. Also,
rebuilding tools on Win32 platforms required the GNU configure
mechanism. With that in mind Cygwin32 seemed like a good solution.
Development started in January, 1995 and passed the Plum Hall Positive
C Conformance test suite in July, 1996.
The architecture uses shared memory to store information needed by
multiple Cygwin32 processes, such as open file descriptors, and data
needed to assist fork and exec operations. Every process also gets a
structure containing pid, userid, and signal masks.
Cygwin32 applications see a POSIX view of the Win32 file system. The
provided "mount" utility may be used to select the Win32 path
interpreted as the root ('/') of this view and mount additional
arbitrary Win32 paths under it (such as different drive letters). UNIX
permissions are emulated from Win32 ones.
Filenames are case-preserving but case-insensitive. Difficulties
included handling the distinction between text and binary mode file
I/O, symbolic and hard links, and Win32 and POSIX path translation
issues. Some other implementation details include a Cygnus ANSI C
library ("newlib") process creation (fork, exec, and spawn), signal
handling, sockets on top of Winsock, and select using sub-threads.
Much software has been ported using Cygwin32, including: X11R6 Clients
--
x-emacs, ghostview, xfig, xconq; GNU inet utils -- remote logins
possible to Win95/98/NT; KerbNet -- Cygnus' Kerberos security
implementation; CVS (Concurrent Versions System); Perl 5 scripting
language; shells -- bash, tcsh, ash, zsh; Apache Web Server; Tcl/Tk
8.
Future goals are to make Cygwin32 multiple-thread safe, improve runtime
performance, conform to the POSIX.1/90 standard (for example, setuid
support is missing), and to produce a true native Win32 compiler (where
the use of Cygwin32 would be optional).
A few technical questions were asked. Noer explained that under NT,
POSIX file attributes were originally stored in the flat NT Extended
Attributes file. On FAT partitions with thousands of files, that file
could grow beyond 200 megabytes, slowing file accesses down enormously.
As a result, this method is no longer being used by default. Instead,
only the standard Windows permissions are used. When asked about user
impersonation capabilities like "su", Noer gave a workaround --
using the inetd utilities to log in to the local machine as a different
user (which works because inetd runs as a privileged process). Noer
thinks conforming to the POSIX standard will not be difficult.
Win32 API Emulation on UNIX for Software DSM
Sven M. Paas, Thomas Bemmerl, and Karsten Scholtyssik, RWTH Aachen,
Lehrstuhl für Betriebssysteme
Sven Paas described a new Win32 API emulation layer called nt2unix that
runs on UNIX. This is a pure library solution that doesn't require a
change to compilers or target systems. His organization has been trying
to emulate a "reasonable subset" of the Win32 API to demonstrate that
emulation is possible under UNIX, and they have been able to implement
most low-level constructs of Win32.
The problem they originally set out to solve was: Given a console
application for Win32 written in Visual C++ 5.0 with use of the
Standard Template Libraries (STL) and running under Windows 95 or NT,
compile and execute the same code on UNIX, with gcc 2.8.1 and STL, for
Solaris 2.6 (sparc/x86) or Linux 2.0 (x86).
By "a reasonable subset," Paas means implementation of a couple key
areas. The first is support for NT multithreading (i.e., the ability to
create, destroy, suspend, and resume preemptive threads) and for most
synchronization and thread-local storage (TLS) functions. To
demonstrate memory management abilities, they wanted to be able to
allocate, commit, and protect virtual memory on the page level, as well
as support memory mapping I/O and files. They wanted to provide user
level page fault handling with structured exception handling (SEH) to
emulate NT SEH. Finally (and ambitiously), they wanted to provide use
of the Winsock API for TCP/IP under the emulator.
Paas then went into the implementation details of nt2unix, comparing
code to accomplish common tasks such as creating a thread in NT versus
POSIX versus Solaris. Creating a thread is in itself very simple, but
the differences between operating systems meant they had to ignore
LPSEC attributes like Windows 95 does.
One major problem with thread synchronization is that suspending and
resuming threads is not possible under the POSIX thread API.
Additionally, some Win32 thread concepts are hard to implement
efficiently within POSIX. The NT kernel usually handles this, but in
UNIX it must be done manually, which implies some performance hits.
Memory management turned out to be fairly easy. Structured exception
handling wasn't as easy and couldn't be supported directly, since
supporting the keywords try and except would require a change in the
compiler. They decided to implement SetUnhandledExceptionFilter(),
which creates a global signal handler. Mapping NT exception codes to
UNIX signals, where there isn't always a good match, made this
difficult.
To enable TCP/IP networking using Winsock, they decided to restrict
Winsock 2.0 to the BSD Sockets API. The bulk of the task was
translating data types, definitions, and error codes. Paas notes that
the pitfalls in this are that some types are hard to map, like fd_set:
Winsock's select() function is most definitely not BSD's select().
To test their solutions, they emulated a 15,000-line native Win32
Visual C++ code module, SVMlib. This shared virtual-memory library is
all-software, user-level, and page-based. They ran this with nt2unix
with no source code changes. Initial time comparisons show satisfactory
behavior, the major reason for slightly slower performance than on a
Win32 platform being that UNIX signal handling is significantly more
expensive than Win32 event handling.
Paas's team concluded that Win32 API emulation under UNIX is very
possible, and that if the emulator is application-driven, it can be
implemented within finite time (three man-months). Paas says "nt2unix
is a reasonable first step to develop portable low level applications."
In the future they would like to implement a more complete set of Win32
base services, allowing more applications to be run under UNIX (NT
services could be run as UNIX daemons, for example).
nt2unix: <https://www.lfbs.rwth-aachen.de/~karsten/projects/nt2unix>
SVMlib: <https://www.lfbs.rwth-aachen.de/~karsten/projects/SVMlib>
NT-SwiFT: Software Implemented Fault Tolerance on Windows NT
Yennun Huang, P. Emerald Chung, and Chandra Kintala, Bell Labs,
Lucent Technologies; Chung-Yih Wang and De-Ron Liang, Institute of
Information Science, Academia Sinica
Yennun Huang presented NT-SwiFT, a group of software components
implemented to provide fault tolerance on Windows NT. These were
originally developed for UNIX and have been ported to NT with new
features added.
The problem is to make distributed applications highly available and
fault tolerant. Huang outlined three possible solutions: (1)
transaction processing as in Microsoft Transaction Server; (2) active
replication / virtual synchrony as in ISIS, HORUS, and Ensemble;
(3)checkpointing and rollback recovery, which is the approach SwiFT
takes.
SwiFT supports three types of process replication for rollback
recovery: cold, warm, and hot. Cold is fail-over with or without
checkpointing. Warm is primary backup with state transfer. Hot uses an
active process group with no shared memory. Regardless, the overall
philosophy was to keep the error recovery mechanism transparent from
client programs, and to enhance fault-tolerant server programs with
fault tolerant APIs.
After an extremely detailed catalog of the many components of SwiFT and
what they can be used for, Huang provided some general information
about SwiFT in general. UNIX SwiFT has been used in Bell Labs for more
than five years and is used in more than 20 products and services. Its
technologies have been licensed to a few companies. A few projects in
Lucent are trying NT-SwiFT.
SwiFT was originally ported to NT on UWIN but was re-implemented with
many new features. UWIN 1.33 works, but not quite. By writing new
driver code, they obtain less software dependency, a new GUI, new
features, and a much needed understanding of NT internals.
The basic procedure is that when a process is initially created,
important system calls and events are intercepted and recorded. This is
a sneaky and very transparent solution: a whole process space can be
set up (using NT calls VirtualQuery() and Virtual Protect()). Handles
can be restored with library injection techniques and modification of
import address tables. For client-server applications, they use an
intermediate NDIS driver, which allows them to set up a dispatcher and
server node with the same IP address. The dispatcher node can be failed
over.
Huang gave a very interesting demo of the checkpoint and process-space
recovery features using the beloved minesweeper application. This was
both humorous and very effective: after placing a few mines, he took a
checkpoint, placed a few more, and, naturally, "blew up" when he
clicked a bad square. Then he restored from the last checkpoint, which,
he explained, actually launched the process, restored its process
space, and replayed a sequence of events. This allowed him to "try
again" and checkpoint again when he clicked a few good squares. While
trivial, it clearly demonstrated the possibilities of the system.
Huang expressed a few opinions about NT, namely that it has too many
APIs and libraries (really?), but that it is very powerful and that
"everything is possible in NT." It has many useful facilities, and
although the OS can be an esoteric maze at times, a good point is that
if you have a problem, someone somewhere has probably written some code
to solve it, and it's fairly easy to find free code samples.
Huang stated a few future goals for SwiFT. They'd like to bring it to
Windows 98. They are planning to add a few components (CosMic,
addrejuv), more NT system calls trapping, and more dispatching
algorithms for ONE-IP. They'd also like to see SwiFT for distributed
objects (CORBA, DCOM, and JAVA). Finally, they'd like to integrate
SwiFT with other tools (MSCS) and NT5.
In the Q/A session, someone wanted to know about availability. The
answer
wasn't very clear, but it's under license and at this point is not very
available (still in progress). People had many questions about the
Winmine demo. Huang made it clear that GDI objects like brushes can not
be captured: there's no way to understand them outside of a process
space. For the demo, they use window handles only. Someone wanted to
know if it was possible to save on one machine and recover on another,
as this would be a very useful feature for load balancing. The answer
is yes, but it has to be exactly the same type machine with the same
configuration because of memory internals. Someone wanted to know if
this would be able to run more than one process per server (for
example, could you run thousands of SwiFT-backed objects on a server?),
and could SwiFT run for days without crashing? The answer: "We're
working on it."
Session: Threads
Summary by Kevin Chipalowsky
A Thread Performance Comparison: Windows NT and Solaris on a
Symmetric Multiprocessor
Fabian Zabatta and Kevin Ying, Brooklyn College and CUNY Graduate
School
Kevin Ying began by observing that the cost of multiprocessing
equipment has dropped drastically over the past few years. A
dual-processor IBM SP2 sold for $130,000 in 1995, and a more powerful
system built with Pentium II processors costs around $13,000 today.
With this much computing power readily available, mainstream operating
systems need to support multithreading.
Both Windows NT and Solaris support kernel-level and user-level
subprocess objects of execution. Windows NT calls its kernel objects
"threads" and its user objects "fibers." The application programmer has
complete control over the scheduling of fibers. Solaris calls its
kernel objects "Light Weight Processes" (LWP) and its user objects
"threads." Unlike NT, Solaris provides a user level library to schedule
threads to run on LWPs.
In their research, Zabatta and Ying performed seven experiments to test
the relative multiprocessing performance of the two operating systems.
They tested kernel execution objects in Windows NT only, but tested
bound, unbound, and restricted concurrency level threading in Solaris.
A bound thread in Solaris's user level library is a single thread that
is always scheduled on a single LWP. An unbound thread is dynamically
scheduled on a dynamically chosen number of LWPs, but the concurrency
level can be restricted to a fixed number. In their concurrency
restricted case, Zabatta and Ying limited the library to four LWPs
(CL=4), because their experimental system had four processors.
Ying explained that neither operating system documented a limit on the
number of kernel execution objects it could create. Their first
experiment was to discover this limit. They found the Windows NT limit
to be around 9800, and the Solaris limit to be around 2200. The second
experiment tested normal thread creation speed. They wrote a simple
program to create threads in a loop. Performance of NT and Solaris
bound threads were very comparable. However, unbound Solaris threads
could be created much faster. They also tested thread creation speed
while the system was under a heavy load. In this situation, the
creation of all types of Solaris threads was drastically faster than
creation of Windows NT threads. Ying informed us that this could be
expected, because NT gives a higher priority to threads that have been
running for a long time, while Solaris gives a higher priority to newly
created threads.
In the fourth experiment, performance was measured for an application
requiring no synchronization. They found no major differences in
running times by any of the threading models. This led them to conclude
that the Solaris threading library does not significantly affect
performance. Next, they tested performance in an application making
heavy use of synchronization. Windows NT has two different types of
synchronization objects. A critical section has local scope, and a
mutex has global scope. Solaris only has critical sections, but it has
a creation flag, which determines scope. Zabatta and Ying found that
Windows NT critical sections drastically outperform local Solaris ones.
However, global synchronization objects in Solaris outperform the
global ones in NT. In the sixth experiment, they tested performance
using the classic symmetric traveling salesman problem. The significant
result was an almost linear speedup with parallelism. All threading
models had very similar performance. The final experiment attempted to
mimic real world applications with CPU bursts. They tested each
threading model with drastic CPU bursts and found the restricted
concurrency Solaris threads slightly outperform the others. Ying
attributed this to Solaris' two-tier system.
Ying concluded by reiterating the scalability of each model, the
flexibility of Solaris's design, and the performance advantages of NT's
critical sections.
A System for Structured High-Performance Multithreaded Programming
in Windows NT
John Thornley, K. Mani Chandy, and Hiroshi Ishii, California
Institute of Technology
John Thornley opened by reminding us of a time-honored idea:
multithread programs on multiprocessor computers to make them run
faster. However, the idea of multithreaded programming has still had
very little impact on mainstream computing. Thornley asks and tries to
explain why this is so.
In his explanation, there are three types of obstacles to the
widespread adoption of multithreaded systems: the availability of
symmetric multiprocessing (SMP) computers, the lack of programming
systems, and the difficulty of software development. Until recently,
SMP technology was been very expensive and rare. Software tools were
always limited. Most were primitive and the product of academic
research. They have always been unreliable, nonportable, and difficult
to program. Things are changing. SMP computers are finally becoming
cheap enough for their use to spread beyond expensive research labs.
Commodity operating systems, notably Windows NT, support threaded
software. Multithreaded programming, however, remains difficult. This
is the focus of the research Thornley presented.
Why is programming so tough? Thornley argues that the problem is a lack
of structure. Current tools are designed for systems programming, which
is small subset of all programming that could benefit from SMP
computers. Current synchronization operations are also very error-prone
because of their nondeterminism. These tools are at the level of "goto"
sequential programming. We need structured design techniques, modeled
after tried-and-true sequential techniques. We need direct control of
threads and synchronization. We need determinacy unless explicit
nondeterminacy is required. And we need performance that is portable
across different hardware and with different background loads.
The authors developed Sthreads, a new package of tools to deliver this
functionality. The programming model is "multithreaded program =
sequential program + pragmas + library calls." They claim that if a
programmer follows the rules of the model, multithreaded execution is
equivalent to sequential execution. This determinacy has many important
consequences simplifying software development. For example, a program
designed for multithreading can be run sequentially for debugging.
Sthreads is not a parallelizing compiler. Their pragmas are not hints;
they are specific directives. They are used around blocks and "for"
loops and indicate the section of code that should be explicitly
multithreaded. The Sthreads library provides counters to guarantee
correct order of execution, but also provides access to traditional
locks for nondeterministic programming.
Thornley presented a trivial code example that multiplies matrices. A
pragma indicating that it should be multithreaded precedes the outer
"for" loop. His second example was a little more complicated. It sums
up arrays of floating point numbers. Since floating point arithmetic is
not associative, the order of execution matters. To ensure equivalency
to sequential execution, Thornley's example uses a counter. It
guarantees sequential ordering and mutual exclusion in a section of
code. The use of Sthreads for this example is far simpler than using
the Win32 thread API.
The researchers theorize that this is all you need to make programs run
fast. They think that hardware and operating system software are ready
for multithreading of commodity software. Thornley made the very strong
statement that if multithreaded programming is not this simple, then it
will never become mainstream. He ended with the following testimonial.
They took a difficult aircraft route optimization problem and
implemented a solution using the Sthreads tools. In the end, their
solution running on an SMP system with four Intel processors
outperformed a Cray supercomputer solving the same problem using an
implementation designed with traditional programming techniques.
A Transparent Checkpoint Facility On NT
Johny Srouji, Paul Schuster, Maury Bach, and Yulik Kuzmin, Intel
Corporation
Paul Schuster and Johny Srouji presented their research and resulting
checkpoint tool. Checkpointing is the act of capturing the complete
state of a running process. Once captured, it can later be used to
resume the process either on the same machine or be migrated to
another.
In the past, many checkpointing tools have been built for UNIX systems,
but this group began the development of their tool not knowing of any
others for Windows NT. Given NT's increased usage over the past few
years, they believed that such a tool was definitely needed.
The motivation for checkpointing is strong. It is a good way to prevent
the loss of data that is due to the failure of a long-running process.
It can also be used for debugging, to determine why that long-running
process failed and resulted in data loss. Most significantly,
checkpointing can be used to migrate a process from one machine to
another in a distributed environment, possibly to improve resource
utilization.
There were a number of design goals in the project. Foremost, it needed
to be transparent to the running application, so no source code changes
could be required. Obviously, it needed to be correct, but with a
minimal performance impact. It also had to be application-independent
and support multithreaded processes. The designers tried to make their
implementation as portable as possible, although they discussed only
the NT implementation.
For a checkpoint facility to be correct, it needs to capture the
complete state of a process. Schuster and Srouji illustrated a layering
of process state components. User objects, such as memory and thread
contexts, were on the bottom. System state objects were above those,
and GUI and external state objects were at the very top. Moving up the
layers, the complexity of capturing state increased. Schuster and
Srouji said they did not even attempt to capture state for the highest
layers. Their tool is limited to console applications.
It has both a user interface and a developer interface. The user runs
an application using an alternate loader, which configures the app to
run with automatic checkpointing. Alternatively, the application
developer can explicitly control when checkpointing occurs by using a
provided API.
Schuster and Srouji described the architecture of the checkpoint tool.
Their checkpoint DLL is loaded into the user memory space by the
loader. Its DllMain is called first, which rewrites the Import Address
Table (IAT). In doing so, it forces all Win32 API calls to be
redirected to checkpoint DLL functions.
When it is time to perform a checkpoint, the tool has access to all
needed states. User state is in user memory, and since the tool is
implemented as a DLL running in user memory space, it can directly
access it. State associated with system calls is stored in system
memory. The checkpoint DLL does not have access to that memory, but it
can infer the system's internal state because it had a chance to see
all system calls.
To resume a process, the checkpoint tool loads the application
suspended. It rebuilds the state in the reverse order it captured it.
Finally, it releases the application threads and they runs as if they
were never stopped.
As implemented, Schuster and Srouji's work has a few limitations. Most
relate to external state, which they cannot control. If a process
creates any temporary files, they must still exist in their
checkpointed states during resume. Any applications that bypass the IAT
(by using GetProcAddress, for example) might not resume correctly. And
their tool does not even attempt to deal with simultaneously
checkpointing multiple processes that interact. In the future, they
plan to continue their research with more optimizations and more
comprehensive API support. They hope to improve performance with
incremental memory dumps. Eventually, they hope to use their
checkpointing tool for process migration.
Poster and Demo Session
Summary by Michael Panitz
The Poster and Demo session was a popular, well-attended event
featuring many research projects in a wide range of areas, from dynamic
optimization to process migration to NT-UNIX connectivity. The session
began with the session chair, John Bennett, inviting project presenters
to give a one-minute summary of their projects. After the summaries,
the audience was free to roam about, conversing both with the project
presenters and among themselves.
Many of the projects focused on network-related advances. A Bell
Labs/
Microsoft team collaborated on a project that exploited COM's custom
marshalling ability to run DCOM on RMTP, a multicast network protocol.
A group from Harvard presented a cluster-based Web server in which page
requests are preferentially forwarded to specific nodes, thus
significantly increasing performance by decreasing the total number of
pages each node is expected to serve. The Milan/Chime project
demonstrated a distributed shared memory system which was used to
support distributed preemptive scheduling (and task migration). Martin
Schulz, from the SMiLE group at TU-Munich, presented a system for
building a shared memory system, which supports transparent,
cluster-wide memory by exploiting SCI's hardware DSM support. Finally,
the Brazos parallel programming environment was presented, which
facilitates parallel programming by offering features such as being
able to run a Brazos program on uniprocessor, SMP, or clustered
computers without recompilation.
Two posters dealt with connecting NT to other systems. Motohiro Kanda
presented a mainframe file system browser, which allows one to access
to file system of a Hitachi 7700 mainframe from Windows NT. Network
Appliance demonstrated a specialized, multiprotocol file server that
utilizes the SecureShare technology that was described in the Mixing
UNIX and NT technical session. The main feature of the NetApp file
server was that it supports both UNIX and NT file sharing. It allows NT
clients to access UNIX files and vice versa, in addition to NT to NT/
UNIX to UNIX access, all without client modification.
In a class by itself was SWiFT (not to be confused with the
checkpointing project NT-Swift), a toolkit to build adaptive systems.
The system is based on feedback control theory, and seeks to apply
hardware control theory to software problems. In doing so, it
facilitates the use of modular control components, explicitly specified
performance parameters, and from there, the automatic, dynamic
reconfiguration of software modules for good performance despite a
changing environment.
KEYNOTE ADDRESS
Buying Computing and Storage
by the Slice
Gary Campbell, Tandem Computers Inc.
Summary by Dan Mihai Dumitriu
Gary Campbell gave a compelling argument for cluster technology as a
more scalable and cost-effective replacement for SMPs, MPPs, and
Supercomputers. He argues that in today's computing world, SAN (System
Area Network) technology has matured to the point where it is a
feasible interconnect for clusters.
Key technologies that must exist in order for "computing by the slice"
to succeed are: x86 SMP systems, which are very inexpensive today;
balanced PCI; SAN interconnects such as VI (Virtual Interface)-based
solutions; and parallel programming standards.
Alternative technologies to clustering are SMPs, which do not scale
indefinitely and do not have the best price-performance curve, and
CC-nUMA (Cache Coherent Non-Uniform Memory Access). When applications
start to get broken up on a nUMA machine, it starts to look more and
more like a cluster. In addition, both of these technologies have
single points of failure, whereas clusters are architecturally ready
for fault-tolerant features.
The hardware necessary for computing by the slice is available, but the
software side still needs work. "Legacy" cluster systems -- such as
Tandem NSK, IBM SP2, and Digital UNIX -- are too difficult to
replicate. More recent products such as Microsoft's cluster service,
affectionately (?) called "Wolf Pair," do not scale well. Much work is
needed in the software and the distributed APIs.
Some case studies of "computing by the slice": The IBM DB2 database
running on 2 P6-200MHz machines interconnected with ServerNet gets 91%
scaling. The Inktomi Web search engine, which is a Berkeley NOW
derivative, is built out of 150 dual-processor UltraSparc II machines
connected with Myrinet. This highly parallel search engine can index
110 million documents and is highly economical. The Sandia Allegra
Model was originally built on Cray, later on Paragon. Now it is running
on DEC Alpha's interconnected with Myrinet.
The conclusion was that computing by the slice offers superior
price/performance, is architecturally ready, has lower time and cost to
market, and is even gaining ground in traditionally supercomputer
applications like sparse matrix computations, and also in commercial
parallel databases. The architecture is
primarily limited by the software. In
the future look for COM+, Java EJB,
and other perhaps more dramatic
developments.
KEYNOTE ADDRESS
Here Comes nUMA:
The Revolution in Computer Architecture and its OS Implications
Forrest Basket, Silicon Graphics Inc.
Summary by Dan Mihai Dumitriu
Forrest Basket presented an argument for CC-nUMA (Cache Coherent
Non-Uniform Memory Access). He asserted that the nUMA architecture is
inevitable in today's computing world, and he presented a successful
implementation of the hardware and software.
Basket pointed out some problems that arise in modern systems: faster
buses must be shorter and run hotter, and faster CPUs run hotter and so
need more volume for cooling. Using point-to point wires rather than
buses allows you to run them faster and cooler.
The SGI Origin 2000 is a successful implementation of CC-nUMA. It has
an integrated nUMA crossbar and a fat-hypercube interconnect. It has a
coherence and locality protocol, 64-bit PCI, and some fault-tolerant
features. Each node in the Origin 2000 has two MIPS R10000 CPUs.
The operating system is Irix 6.5. The computational and IO semantics of
the system are the same as for an SMP. For a small nUMA system an SMP
OS will work, as will SMP applications. Some disadvantages are
additional levels in memory and IO hierarchy. Even though the system
has a high-performance interconnect, latency in memory access is an
issue, as is the contention for resources between nodes.
According to Basket, in order to optimize performance of parallel
applications, we want to be able to specify the topology of the system
as well as affinities for devices, and to be able to do this without
modifying binaries. Other issues that arise are page migration between
nodes and memory placement policies. Modifications to the OS kernel are
necessary to support being able to specify the initial placement of
applications consistent with the user-specified system topology,
replication of the kernel at boot time, a reverse page table, and a
page locking scheme. System management is also an issue with nUMAs. The
ability to partition systems so we can perform administrative shutdown
of parts of the system is desirable, as is a sophisticated batch system
that would enable users to see consistent running times.
SMP and DSM (Distributed Shared Memory) (cc-nUMA is a variant of DSM)
are displacing vector supercomputers and MPPs.
Session: Mixing UNIX and NT
Summary by Michael Panitz
Merging NT and UNIX Filesystem Permissions
Dave Hitz, Bridget Allison, Andrea Borr, Rob Hawley, and Mark
Muhlestein, Network Appliance
Dave Hitz presented a fast and witty overview of the WAFL file system,
which enables network-based file sharing with both UNIX and NT clients.
Network Appliance has created a specialized file-sharing device that
uses WAFL to ease file sharing in a mixed NT/UNIX environment. The
three design goals of WAFL are: to make WinNT/95 users happy by
providing a security model that mimics NTFS; to keep UNIX users happy
by providing a security model that mimics NFS; and to allow Windows and
UNIX users to share files with each other.
Difficulties arise because UNIX (and its Network File System, NFS) and
NT (and its Common Internet File System, CIFS) are fundamentally
different, both in security models and in such aspects as case
sensitivity (NTFS is case-insensitive, NFS is case-sensitive). NFS uses
divides permissions into (user, group, world), while CIFS uses Access
Control Lists (ACLs). CIFS uses a connection-based authentication
scheme, while NFS is stateless. WAFL was primarily designed to bridge
these two filesystems in the most secure manner possible, while
secondarily providing as intuitive an interaction as possible.
In addition to moderating access to files based on permissions, a
filesystem is expected to display permissions, to allow them to modify
these permissions when appropriate, and to be able to specify the
default permissions to assign to a newly created file. WAFL uses both
permission mapping and user mapping to accomplish these goals. When a
UNIX client accesses an NT file, access is determined by UNIX-style
permissions that are generated from the ACL via a process called
"permission mapping." These "faked-up" permissions are guaranteed to be
at least as restrictive as the NT ACL. When an NT client requests
access to a UNIX file, access is determined by mapping the NT user to a
UNIX account, via a process called "user mapping." The presentation
argued that this was an effective, direct way to allow access in a
secure, reasonably intuitive manner.
The presentation finished by touching on some of the issues surrounding
WAFL, such as how to store NT ACLs, and on the administrative protocols
used by NT.
Pluggable Authentication Modules
for Windows NT
Naomaru Itoi and Peter Honeyman, University of Michigan
Naomaru Itoi began the presentation with an anecdote about the
motivation for creating a Pluggable Authentication Module (PAM) on NT.
At the University of Michigan there existed authentication modules for
both Kerberos and NetWare. This was great, but the authors wanted a
module that provided authentication for both Kerberos and NetWare
together; the only way to accomplish this was to create a third module.
To create this under NT would have been difficult and time-consuming.
They wanted an authentication system would allow the user to log on
once, yet use many services ("single sign-on"), a system that would be
easy to administer, and a system that would be relatively easy to
develop new authentication modules for. What was wanted was a dynamic
security system for NT, much like the PAM system that provides dynamic
security for Linux and Solaris.
After explaining why such a system would be useful, the speaker gave an
overview of PAM, which is a de facto standard for administration, being
part of Linux, Solaris, and the Common Desktop Environment (CDE), and
also being standardized by the IETF. PAM allows for security
(re)configuration via a simple text file, which allows the
administrator to specify such things as which services (Kerberos,
NetWare, etc.) are required for, say, a logon attempt, or ftp session;
which are optional; which services should be logged on to using the
login password the user provides; which should be logged in to using a
password stored in a password file, etc. PAM also provides a high-level
API for authentication, so that different services can be wrapped and
then configured without a recompilation.
Itoi outlined a plan of attack by next explaining GINA, the
administrator-replaceable "Graphical Identification and Authentication"
user authentication component. GINA enables the administrator to
replace the default user authentication module with another, but still
suffers from the problem of having to write one module for the Kerberos
service, one for the NetWare service, and a third for Kerberos and
NetWare. Further, each module would have to be configured in its own
way, thus making administration of any significant number of NT
machines nearly impossible. Last, custom GINA modules require special
debugging tools and the use of difficult techniques, since GINA is run
before anyone logs in. The plan was to build a custom GINA that
implemented a subset of PAM, so that NT could be used, administered,
and developed for as easily as the UNIX PAM systems.
The design and implementation of PAM, named NI_PAM, was presented,
including the API supported by NI_PAM; a diagram that showed which DLLs
replaced the GINA.dll and their interaction was explained.
Itoi reported that much of PAM has been successfully implemented,
though more features need to be implemented, and more testing needs to
be done, before a large-scale rollout can take place. The presentation
concluded with some thoughts on alternate means of implementing PAM on
NT, and possible future directions of the work, such as use of
smartcards and screen saver locks.
Montage - An ActiveX Container for Dynamic Interfaces
Gordon Woodhull and Steven C. North, AT&T Laboratories --
Research
Montage grew out of an effort to create a Windows-friendly port of an
abstract graph/network editor from UNIX. The Windows graph editor would
integrate with Windows applications using ActiveX (also known as OLE -
Object Linking and Embedding), a runtime, object-oriented technology.
The edges and nodes of the graph would be embedded objects, and the
application itself would be an embeddable ActiveX container, which
sounded easy enough.
Unlike previously available containers, Montage separates both the
layout and control of the contained objects and the interface used to
control them from the container. Thus, Montage is actually an
externally controllable object container that is being used to create a
graph application, Dynagraph. Montage is itself an embeddable,
customizable ActiveX object, and allows dynamic changes to the layout
of contained objects. Thus, Montage could be used to display the
current state of a computer network, unlike something like dotty, which
is used to generate static graphs. All policy decisions (i.e., which
objects should be placed where, what size should they be, etc.) are
implemented objects independent of the Montage objects, thus allowing
one to change the style of layout without recompiling Montage (unlike
an VB MFC application).
Montage exploits the OCX96 technology of "transparent controls" to
provide for different modes of interaction with the objects. This
allows Montage to provide a "Viewing Mode," in which the user can view
but not change the graph, and an "Editing Mode," in which the user can
both view and edit the graph. At the same time, the contained objects
themselves are allowed to request that their properties be set to a
certain value. A contained object could, for example, request to be
moved to point (x, y), and its foreground color set to blue, or to be
brought to the front. Montage then forwards this request to the layout
control engine, which then has the option of ignoring the request or
interpreting it if it so chooses. Montage exploits the new technology
of "windowless controls" to provide for different modes of interaction
with the objects. This allows Montage to provide a "Viewing Mode" in
which the user can view but not change the graph, and an "Editing
Mode," in which the user can both view and edit the graph.
The presentation included an impressive live demo, which showed
embedding a Word snippet into Montage, and then embedding a Montage
graph into Word.
Session: Networking and Distributed Systems
Summary by Hui Qin Luo
SecureShare: Safe UNIX/Windows File Sharing through Multiprotocol
Locking
Andrea J. Borr, Network Appliance, Inc.
Dennis Chapman, who made the presentation for author Andrea Borr,
employed illustrative examples to demonstrate the capabilities of
SecureShare. SecureShare is Network Appliance's solution to
multiprotocol file sharing between two different file systems, UNIX's
Network File System (NFS) and the Windows Common Internet File System
(CIFS) or "(PC)NFS."
SecureShare is a Multiprotocol Lock Manager providing file-sharing
capabilities between UNIX clients using NFS and Windows clients using
CIFS without violating data integrity. CIFS has hierarchical locking
and mandatory locking functionality that requires file-open and lock
retrieval before performing any operations such as reading, writing, or
byte range locks. Unlike CIFS, UNIX's NFS has a nonhierarchical,
file-open deficient and advisory locking mechanism. Its has no
predeclarative functionality that specifies the kind of access mode it
needs to a file. Therefore, these differences make file sharing between
the mixed network environment difficult, if not impossible.
SecureShare's main selling point is the preservation of multiprotocol
data integrity by reconciling the locking mechanisms and file-open
semantics between the two different file systems, and multiprotocol
oplock ("opportunistic locks") management involving oplock requests
from NFS to CIFS oplock break.
CIFS opportunistic locks (with the exception of level II oplocks)
represent the equivalent of a file open with Read-Write/Deny-All access
mode. However, access attempts by other clients (using either CIFS or
NFS) to the oplocked file can cause the server to revoke the oplock
through an oplock break protocol. The client who obtained the oplock
gains the advantages of read-ahead on the open file, cache write
operations to files, and cache lock requests. In this way, the network
traffic to the file server is minimized. Chapman discussed the oplock
break protocol in a mixed CIFS and NFS environment. When another client
wishes to access the file, the client's request is suspended.
Afterwards, an oplock break message is sent to the operating system of
the CIFS client holding the oplock. The client operating system can
close the file and pipe all the changes of the file stored in the cache
to the file server. It can also pipe all the cached changes and remove
all the read-ahead data. It then transmits a reply to the fileserver
acknowledging the break.
One of the concerns brought up in the Q/A session was the handling of a
situation in which a client fails to respond to the oplock bread
request sent by the server due to attempted access to the oplocked file
by NFS. Chapman claimed that there is an automatic session timeout on
the oplock held by the client's operating system, which, if triggered,
automatically relinquishes the stale batch oplock.
Session: Networking and Distributed Systems
Summary by: Hui Qin Luo
Harnessing User-Level Networking Architectures for Distributed
Object Computing over High Speed Networks
Rajesh S. Madukkarumukumana, Intel Corp.; Calton Pu, Oregon Graduate
Institute of Science and Technology; Hemal V. Shah, Intel Corp.
The introduction of high-performance user-level networking
architectures such as Virtual Interface (VI) lays the ground work for
improving the performance of distributed object systems. This
presentation by Rajesh Madukkarumukumana examined the potential of
custom object marshalling using VI, along with issues involved in the
overall integration of user-level networking into high-level
applications.
Component-based software like Distributed Component Object Model (DCOM)
uses remote procedure call (RPC) mechanisms to facilitate distributed
computing. Although distributed computing has matured over time, the
protocols that are relied on to transport data have remained virtually
unchanged, hindering the overall performance of networks such as SANs
(System Area Networks). The low-latency of user-level architectures
provides an attractive solution to the problem in SAN environments.
Madukkarumukumana chose to use DCOM and VI as the subjects of his
research. He presents his methodology in integrating the VI based
transport and a preliminary analysis of the performance results.
The VI architecture provides the illusion that each process owns the
network; many performance bottlenecks are bypassed, including the
operating system, to achieve this low latency, high bandwidth
connection. At the heart of the standard lies two queues for each
process, one for sending data, the other for receiving it; the queues
contain descriptors that state the work that needs to be done. Prior to
data transfer operations, a process called memory registration is
performed, allowing the user process to attach physical addresses to
virtual ones. Unique memory regions are referenced by these address
pairs, eliminating any further bookkeeping. Two data transfer
operations are accounted for -- the
standard send/receive operations, and Remote DMA (RDMA) read/write
operations.
DCOM is a network version of COM, used for the development of component
software. The network extensions in DCOM allow for all objects to be
addressed the same way, hiding their location. Encoding and decoding
data for transfer is called marshalling and unmarshalling,
respectively; the process of marshalling and unmarshalling creates a
stub object in the server process, and a proxy object in the client
process. Basically, three types of marshalling are used, but the one
that Madukkarumukumana discussed is custom marshalling: it allows for
the object to dynamically choose how its interface pointers are
marshalled.
In order for VI to do its job, the interface that DCOM uses to generate
the stub and proxy, referred to as IMarshal, has to be exposed.
Specialization in the object implementation is used to expose the
IMarshal interface. By exposing the parameters of the IMarshal
interface, new methods can be written to make use
of the VI send and receive queues. Information can therefore be sent
using the VI standard, instead of the old UDP protocol. Since VI
guarantees a certain quality of the signal transferred over a line,
much of the overhead and interrupts involved in UDP is eliminated.
In discussing the results of his research, Madukkarumukumana stated
that latency in the signal (for one-way transfer) dropped by about 30%
to 60% in some cases, even only under VI emulation. The existence of
core VI hardware provides a further dramatic increase in performance,
and an even greater performance boost may be expected if new procedures
catering to distributed computing systems are implemented within VI
(results forthcoming).
Implementing IPv6 for Windows NT
Richard P. Draves, Microsoft Research; Allison Mankin, University of
Southern California; Brain D. Zill, Microsoft Research
This presentation focused on the implementation and design details of
IPv6 for Windows NT as well as the common pitfalls/challenges
encountered in the process. IPv6 is the next generation Internet
Protocol (IP) worked by the IETF. The major driver behind it and some
of its key features were briefly mentioned in the paper; however,
anyone new to IPv6 who wishes to know more about its history and the
IPv6 specification should consult the relevant RFC documents referenced
in the paper.
The presentation started with an excellent overview of the Windows NT
networking architecture and how the IPv6 protocol stack can be/is
integrated into it. This was followed by a high-level overview of the
presenters' IPv6 implementation and some discussions of four
challenges/issues encountered and the specific solution used. The
presentation ended with some notes on the implementation's performance.
The segment on NT networking internals was very informative, especially
for novices. The details on the interfaces (documented or undocumented)
and protocols presented, along with the helpful references mentioned in
the paper, will prove useful for anyone trying to implement a different
network protocol stack for NT and even for Windows 95 (due to similar
network architectures). The integration of IPv6 into this networking
architecture was also highlighted. Mainly, a Winsock DLL module was
added to provide user-level socket functionality for IPv6 addresses,
and a new TCPIP protocol driver to replace the IPv4. The implementation
was "single stack," supporting only IPv6, and though not efficient was
useful in isolating problems in the IPv6 stack during testing. The
logical layout of the implementation was divided into three layers
similar to IPv4 -- the link layer, the network layer (IP), and the
transport or upper layer which includes protocols such as TCP, UDP, and
ICMP.
Four noteworthy problems and their implemented solutions were
discussed. They range from inefficiencies in lower-level network device
handlers during a receive cycle to deadlock avoidance issues.
Performance measurements for the implementation were taken using TCP
throughput as the indicator metric and compared to the IPv4 stack.
Results presented showed marginal performance degradation with the IPv6
stack (2.5% over 10Mb/s LAN), and somewhat higher than expectations
(1.4% based on increased IPv6 header length). This is to be expected,
as the developers never intended this to be an optimized
implementation, but rather as a base for further research and to give
Microsoft the push for a product release in the future. Whether we will
see better performance results in Microsoft's official product
implementation of IPv6 is still to be determined. Comparative
performance measurements against other IPv6 implementations (Solaris,
Digital Unix, BSD variants etc.) were left out. The metric of comparing
each implementation's relative performance to its IPv4 counterparts can
be used as an indicator. Direct IPv6 TCP throughput comparisons might
not be fruitful because of differences in the O/S architecture each
implementation was targeted for, unless IPv4 performance was similar
across these platforms. Source code size comparisons were done against
another publicly available IPv6 implementation (INRIA IPv6).
"Great sample code" is available at
<https://research.microsoft.com/msripv6> for anyone who wishes to
dabble in Windows NT network protocol development or have a starting
code base for further IPv6 research and experimentation. A more
full-fledged release with security, authentication, and mobility
support is expected to be available in the future.
Session: Real-time Scheduling
Summary by Jason Pettiss
A Soft Real-time Scheduling Server on Windows NT
Chih-han Lin, Hao-hua Chu, and Klara Nahrstedt, University of
Illinois
Hao-hua Chu spoke about his group's implementation of a software
realtime CPU scheduler for Windows NT. NT schedules applications
indiscriminately based on multi-user time-sharing. Multimedia performs
poorly under these conditions, especially if
non-time-sensitive-conscious but CPU-hungry tasks like compilation are
occurring in the background. The scheduling server is a daemon from
which applications can request and acquire periodic processing time.
The scheduler requires no kernel modifications, uses the rate monotonic
algorithm, supports multiple processors (SMP model), and provides
guarantees for timeshare processes so that they aren't starved by
realtime tasks. Chu says there is "reasonable" performance at this
point, the main problem being limited overrun protection due to the
fact that the scheduler itself is a process and sometimes isn't woken
up on time by NT.
The architecture consists of a broker, which handles reservation
requests, builds a dispatch table, and fills the available slots of the
table. The dispatcher is in charge of reading the table and responding
appropriately. The dispatch table is configurable for the number of
processors, the number of available slots, and the time-slice of slots.
Dispatching occurs by changing the thread and process priority of
participating applications between idle and highest priority realtime
(1-31).
To test, Chu's team used a dual Pentium 200 with 96 MB RAM (an HP
Vectra XU). Time-slice was set to 20ms. They ran two processes running
MPEG decoders at FPS, one Visual C++ compilation of the MPEG decoder,
and four processes running computations of sine and cosine tables.
Dispatch latency worked out to be about 640 microseconds, which was
longer than they would have liked, but not too large to disrupt
scheduling. Performance of the two time-sensitive processes was
improved, Chu noted.
The main problem, reiterated Chu, was that NT sometimes did not wake up
the dispatcher on time. Also, the dispatcher, being an NT process
itself, cannot preempt real-time processes. This means there is weak
overrun protection. The provided NT timers weren't accurate enough, so
they are using a Realtime Extension (RTX) by Venturcom to get under 1ms
resolution.
They have much future work planned. Support is planned for varying
processing time per period and for a process service class, similar to
ATM traffic classes. They hope to run conformance tests. Also, they
would like to adapt a multimedia decoder to increase reservation or
decrease quality as necessary. An additional feature, probing and
profiling, could be added to figure out how much processing time to
reserve on a per-application basis.
Vassal: Loadable Scheduler Support for Multi-Policy Scheduling
George M. Candea, Oracle Corporation; Michael B. Jones, Microsoft
Research
George Candea presented Vassal, a system that utilizes loadable
schedulers to enable multi-policy, application-specific scheduling on
Windows NT. He led off with the example of a speaker late for a
presentation (an application) that knows where it needs to be and when,
and a cab driver (the OS) that can get him there on time if he can
communicate with him. Windows NT is more like a cab driver who hasn't
learned English yet -- the operating system multiplexes the CPU
among tasks, unaware of their individual scheduling needs.
Since no single algorithm is good enough for all task mixes, explained
Candea, a compromise would be to hardcode more than one scheduling
policy into the kernel. But even better would be a dynamically
extensible set of policies, made possible by separating policy
(scheduling) from mechanism (dispatching). This lets a developer
concentrate on coding policy. It also would allow different
applications to communicate with their preferred policy to bargain for
scheduling time. Custom schedulers are special Windows NT drivers that
coexist with the default NT scheduler, and which should have negligible
impact on global performance.
Candea then reviewed the current state of Windows NT scheduling. The
basic schedulable unit is the thread, which acquires CPU time based on
priority levels of two classes: variable, which uses a dynamic priority
round-robin, and realtime, which uses a fixed priority round-robin.
Interrupts and deferred procedure calls (DPCs) have precedence over
threads, so scheduling predictability is limited. Scheduling events are
triggered by: the end of a thread quantum, priority or affinity
changes, transition to wait state, or a thread waking up.
NT timers use the hardware abstraction layer (HAL), which provides the
kernel with a periodic timer of variable resolution. Candea noted that
most HALs have resolution between 1 and 15 ms, but some HALs are worse
than others -- some can be set to only powers of two, while others
are fixed at 10 ms. This is certainly another limitation to scheduling
with any policy under NT.
Vassal separates policy from mechanism. While the NT scheduler consists
of thread dispatch and a default scheduler, Vassal consists of many
schedulers (policy modules) arranged hierarchically and a separate
dispatch module in charge of preempting and awakening threads. Standard
NT policies remain in the kernel so that applications with no special
needs are handled as usual.
The schedulers register decision making routines with the dispatcher.
The dispatcher invokes these when scheduling events occur. Threads can
communicate with schedulers to request services. These new features
require some interface modifications, with the addition of three system
calls.
As proof-of-concept, the Vassal team implemented a sample scheduler
that can be loaded in addition to the default NT policy. The sample
allows threads to get scheduled at application-specified time
instances, which is simplistic, Candea admitted, but demonstrates
potential for more interesting time-based policies.
Minimal NT kernel changes were required: they added 188 lines of C
code, added 61 assembly instructions, and replaced 6 assembly
instructions. The scheduler itself was only 116 lines of C code,
required no assembly language, and they only needed to code policy.
Performance results showed that it was no slower if no specialized
scheduler was loaded, and there was only 8% overhead with their untuned
prototype. Results showed that with the special scheduler, the
predictability of periodic wakeup times significantly improved, and
there were no longer early wakeups. There were a few slightly late
wakeups (still less late than without the prototype loaded), and these
were caused by unscheduled events such as interrupts and DPCs.
Candea emphasized the Vassal take-homes are that it demonstrates the
viability and positive impact of loadable schedulers, and that it frees
the OS from anticipating all possible application scheduling
requirements. It also encourages interesting research in this area by
making it easy to develop and test new policies, and doesn't adversely
affect the OS.
Some related work is Solaris, which maps scheduling decisions onto a
global priority space; extensible OS work like Spin, Exokernal, or
Vino, and hierarchical schedulers like UTAH CPU or UIUC Windows NT Soft
Real-Time scheduler.
Many questions followed. Someone suggested that two different
schedulers must be compatible or there will be trouble. Candea agreed
that this was an interesting problem that could be solved by allowing
schedulers to talk to each other and "negotiate" or to use their
descriptions and decide if a conflict will occur. Another question was
that a driver has limited visibility into the NT kernel, and does this
affect the power of a scheduler? The answer is yes, ideally these
special drivers would be able to see into the kernel data structures
for real power. The moderator asked what can be done about predicting
DPCs and interrupts. Candea didn't think that was necessary, noting
that these are best left to perform their crucially important tasks
when they
need to.
<https://pdos.lcs.mit.edu/~candea/research.html>
<https://research.microsoft.com/~mbj>
Session: NT Futures
Tom Phillips and Felipe Cabrera, Microsoft Corporation
Summary by Kevin Chipalowsky
Felipe Cabrera and Tom Phillips demonstrated the upcoming Windows NT
5.0 operating system and the NT Services
for UNIX add-on. The two-hour session was very open, informal, and at
times emotional for some in the audience. Microsoft gave a presentation
while
inviting just about any type of question regarding the future of their
flagship operating system. Cabrera and Phillips fielded very spirited
questions and
comments.
Phillips began his NT 5.0 presentation by describing its new support
for upgrading from Win 9x. The setup program first scans a system for
compatibility before installing anything. It will now migrate
applications as part of the upgrade process, using plug-ins to support
third-party software. The system configuration is also preserved during
the transition to NT.
Next, Cabrera talked about the new volume-management infrastructure.
The new version of Windows NT will have many new storage management
features. In most cases, hard-drive partitions can be manipulated
without requiring a reboot. For example, partition size can grow and
shrink dynamically. There are also new security features, such as a
reliable "change" journal and file encryption. In response to a
question about the type of encryption, Cabrera explained that it uses
public key cryptography and is designed to prevent thieves from
examining data on a stolen laptop.
Cabrera also talked about the new file-based services sported by NT
5.0. It provides a new content indexing tool, which can be used to
search for files based on their content instead of just their filename.
It tracks common types of embedded file links and updates them when a
data file is moved to a different volume or even a different machine.
There is also a new automated recovery system to revive a computer that
otherwise will
not boot.
The speakers then presented NT Services for UNIX. Microsoft developed
it in response to the growing adoption of NT by previous UNIX users. It
will be available for Windows NT 4.0 with Service Pack 3 for $149 per
client. It is in beta, and anyone interested in being a beta tester
should email <gregsu@microsoft.com> with "sfubeta" as the
subject.
NT Services for UNIX will make an NT machine feel more like a UNIX one.
It allows users to access NFS partitions like any NTFS of FAT
partition. A new Telnet client and server are also included, so
administrators can remotely Telnet into a Windows NT system and run
console-based applications. Microsoft has licensed a Korn shell
implementation and a few dozen familiar UNIX tools from MKS. The
audience loudly applauded this part of the presentation.
Next, Cabrera and Phillips revealed storage features; the new Microsoft
Management Console (MMC) was the center of attention. It is a
management container that provides Microsoft and third-party developers
the opportunity to plug in software to manage just about anything.
The first demonstration was of RAID 5.0 support. One hard drive of a
stripe volume was removed and later plugged back into the system.
Although the underlying file system seemed to handle the intentional
fault, MMC simply crashed and the computer needed a reboot. After this
small setback, the demo moved on.
Hierarchical Storage Management (HSM) is another feature new to NT 5.0.
It makes use of the observation that the most commonly used files are
usually the most recently used files. When a hard-drive partition
becomes full, the filesystem offloads older files to a tape backup
system to make free hard-drive space available. Although Microsoft is
not the first to attempt to build HSM support onto NT, they believe
they will be the most successful. They have complete control over the
operating system and can fix all of the related utilities that would
otherwise have difficulty with the extremely long latency that results
from trying to open certain files.
Networking in NT 5.0 has also received an overhaul. As with volume
management, most network configuration changes can now be made without
rebooting the system. Microsoft also claims an improved programmable
network infrastructure. The TCP/IP stack is also enhanced; it runs
faster and supports security and QoS protocols.
|