USENIX ;login: - 2nd USENIX Windows NT Symposium

2nd USENIX Windows NT Symposium

SEATTLE, WASHINGTON
August 3-5, 1998

CONFERENCE OVERVIEW

NTnix . . . You Are There by George M. Jones

KEYNOTE ADDRESS

The New Power Behind Windows NT

Justin Rattner, Intel Corporation

Summary by Hui Qin Luo

Justin Rattner, fellow and director at Intel's Server Architecture Lab, gave an excellent and informative talk on the technology of Intel's Pentium II Xeon processor.

Rattner highlighted the Xeon's 400 MHz processor speed, support for a 100 MHz system bus, and support for 512KB or 1MB full speed Level 2 Cache. He continued with a description of Xeon's architecture and Intel's market strategy for its processors in 1998 with one core technology, the Micron P6 Architecture. He mentioned the cost-effectiveness of Xeon architecture relative to other architectures in the market such as the Alpha, using CAD performance as an indicator. Other features he highlighted, such as the die size and its 0.25 micron CMOS, served as eye-candy more for electrical engineers than for computer scientists.

The rest of the talk was about the VI (Virtual Interface) architecture developed jointly by Microsoft, Compaq, and Intel. Based on VI latency-frame size measurements, the performance of a VI-enabled switch appeared to be close to the theoretical maximum. This was a very good driver for VI-enabled products in the market, especially distributed database environments. Many companies -- including Compaq, GigaNet, and Dolphin -- already had products with VI support, with other companies' products still pending. Already, Intel was leading the pack, by promising native VI support in its future chipsets.

Certainly, the Xeon processor is powerful in comparison to its predecessors. I believe it will be a major power behind PC-compatible operating systems; even so, Rattner did not specifically mention how Windows NT takes advantage of such power.

Session: Performance

Summary by Dan Mihai Dumitriu

A Performance Study of Sequential I/O on Windows NT 4

Erik Riedel, Carnegie Mellon University; Catherine van Ingen and Jim Gray, Microsoft Research

Erik Riedel presented a very realistic analysis of the sequential I/O performance of Windows NT 4. The motivation for this study was the large number of applications -- such as databases, data mining, and multimedia -- that require high bandwidth for sequential I/O. In a typical system, the bottleneck in bandwidth is the storage subsystem; that is the main focus of this study. Two factors to consider are PAP (Peak Advertised Performance) and RAP (Real Application Performance). The goal should be RAP = PAP/2, the so called "half power point." The experimental setup in this study consisted of modest 1997 technology: P6-200 with 64 MB DRAM and Adaptec Ultra Wide SCSI adapters running NT 4.0 build 1381.

Riedel's study showed that out of the box, the performance of NTFS is quite good; however, it has very large overhead for small requests. Performance is good for reads but horrible for writes. The options for improving performance are (1) read ahead, (2) WCE (Write Cache Enable), (3) write coalescing, (4) disk striping.

The study suggests a number of ways to improve sequential file access. If small requests are absolutely necessary, deep parallelism makes a tremendous speed improvement; WCE provides large benefits for small requests; three disks can saturate a SCSI bus; file system buffering coalesces small requests into 64KB disk requests, significantly improving performance smaller than 64KB; for requests larger than 64KB file system buffering degrades performance; when possible, files should be allocated to their maximum size, because extending a file while writing forces synchronization of the requests and significantly degrades performance.

<www.pdl.cs.cmu.edu/nasd> (Riedel's research group)
<www.pdl.cs.cmu.edu/active> (Riedel's thesis)

Scalability of the Microsoft Cluster Service

Werner Vogels, Dan Dumitriu, Ashuosh Agrawal, Teck Chia, and Katherine Guo, Cornell University

Although he gave an overview of MSCS (Microsoft Cluster Service), Werner Vogels did not talk much about the numbers in the paper. Instead, he used the presentation as an opportunity to showcase his vision of what cluster design and programming should be. He highly recommended reading Gregory Pfister's "In Search of Clusters" as a realistic presentation of cluster technology.

Vogels presented his research goals as (1) efficient distributed management, (2) low-overhead scalability, (3) cluster collections, and (4) cluster-aware programming tools, on which there is a research project at Cornell called Quintet.

Vogels gave a brief introduction to MSCS. MSCS makes a group of machines appear to the client as a single system and also allows the group to be managed as such. Its design goals were to extend NT to support high availability without requiring application modifications. It is shared-nothing, not scalable, cannot run in lockstep, cannot move running apps, and in its first release only supports clusters of two nodes. With some modifications, however, it is possible to scale MSCS. Some traditional methods are reducing the algorithmic dependence on the number of nodes and reducing the overall system complexity. More radical methods include epidemic data dissemination techniques such as gossip and probabilistic multicast.

Although MSCS does not scale out of the box, the algorithms that it is based on are intrinsically scalable. It is only their implementation and the supporting code that make the system not scale well. For example, MSCS uses only point-to-point messages for communication between nodes. This makes join and regroup operations scale very poorly because of the dependency between the number of nodes and the number of messages generated.

Vogels presented a huge discrepancy between the worlds of parallel computing and high-availability computing. Parallel computing runs on the order of 512+ nodes, and high-availability computing only runs on the order of 16+ nodes. He cited a lack of good cluster programming tools as one of the main culprits, and he mentioned the Quintet project at Cornell.

Evaluating the Importance of User-Specific Profiling

Zheng Wang, Harvard University; Norm Rubin, Digital Equipment Corporation

Zheng Wang presented an alternative to profiling applications that have a "generic" user in mind. His research, based on the assumption that individual users use programs differently, focuses on analyzing the benefits of profiling for individual users or groups of users. The target applications for this study are interactive applications on Windows NT. The experimental setup used Windows NT 4 running on DEC Alpha hardware using Digital's FX!32 emulation/binary translation software.

FX!32 generates a profile for each x86 WIN32 module (executable, DLL, etc.) that it runs on the Alpha platform. The only aspects of the profiles that were analyzed were the procedures used, although the FX!32-generated profiles contain more information.

The procedure analysis consisted of analyzing the unique procedures, common procedures, and subgroup procedures used by various users during normal application use. Also analyzed were pairwise similarities between users.

Session: From NT to UNIX to NT

Summary by Jason Pettiss

Cygwin32: A Free Win32 Porting Layer for UNIX Applications

Geoffrey J. Noer, Cygnus Solutions

Geoffrey Noer gave an in-depth presentation of Cygwin32, a library that facilitates the porting of most UNIX applications to Windows 95/98/NT. Cygwin32 and source code are freely available from Cygnus Solutions under GNU General Public License; a commercial license is also available.

Noer explained that Cygwin32 is a DLL that exports most common UNIX calls. You use the Cygwin32 version of GCC to compile your UNIX code on a Windows machine, and the resulting Windows program runs as you would expect under UNIX. A "mixed flavors" Cygwin32 application can also make Win32 API calls if desired.

Noer sidetracked briefly to provide some background information on Cygnus Solutions. It provides support and contracting services for the GNUPro Toolkit and sells Source-Navigator, "a powerful source code comprehension tool." 150 host/target combinations of the toolkit are available, including Win32-hosted native and cross-platform development tools. Cygnus engineers make over three-quarters of all changes to the GNU compiler tools, maintained Noer.

Cygnus Solutions is conducting Internet beta testing, and source code is freely available. The mailing list to subscribe to is: <gnu-win32@cygnus.com>. "GNU-Win32" beta releases include native Win32 development tools and all GNU utilities needed to run GNU configure scripts, including UNIX shell, file, and text utilities.

Noer dived into the details of Cygwin32. Cygnus Solutions wanted to be able to offer Win32-hosted version of the GNUPro Toolkit, keeping the source code as simple as possible and development time fast. Also, rebuilding tools on Win32 platforms required the GNU configure mechanism. With that in mind Cygwin32 seemed like a good solution. Development started in January, 1995 and passed the Plum Hall Positive C Conformance test suite in July, 1996.

The architecture uses shared memory to store information needed by multiple Cygwin32 processes, such as open file descriptors, and data needed to assist fork and exec operations. Every process also gets a structure containing pid, userid, and signal masks.

Cygwin32 applications see a POSIX view of the Win32 file system. The provided "mount" utility may be used to select the Win32 path interpreted as the root ('/') of this view and mount additional arbitrary Win32 paths under it (such as different drive letters). UNIX permissions are emulated from Win32 ones.

Filenames are case-preserving but case-insensitive. Difficulties included handling the distinction between text and binary mode file I/O, symbolic and hard links, and Win32 and POSIX path translation issues. Some other implementation details include a Cygnus ANSI C library ("newlib") process creation (fork, exec, and spawn), signal handling, sockets on top of Winsock, and select using sub-threads.

Much software has been ported using Cygwin32, including: X11R6 Clients -- x-emacs, ghostview, xfig, xconq; GNU inet utils -- remote logins possible to Win95/98/NT; KerbNet -- Cygnus' Kerberos security implementation; CVS (Concurrent Versions System); Perl 5 scripting language; shells -- bash, tcsh, ash, zsh; Apache Web Server; Tcl/Tk 8.

Future goals are to make Cygwin32 multiple-thread safe, improve runtime performance, conform to the POSIX.1/90 standard (for example, setuid support is missing), and to produce a true native Win32 compiler (where the use of Cygwin32 would be optional).

A few technical questions were asked. Noer explained that under NT, POSIX file attributes were originally stored in the flat NT Extended Attributes file. On FAT partitions with thousands of files, that file could grow beyond 200 megabytes, slowing file accesses down enormously. As a result, this method is no longer being used by default. Instead, only the standard Windows permissions are used. When asked about user impersonation capabilities like "su", Noer gave a workaround -- using the inetd utilities to log in to the local machine as a different user (which works because inetd runs as a privileged process). Noer thinks conforming to the POSIX standard will not be difficult.

Win32 API Emulation on UNIX for Software DSM

Sven M. Paas, Thomas Bemmerl, and Karsten Scholtyssik, RWTH Aachen, Lehrstuhl für Betriebssysteme

Sven Paas described a new Win32 API emulation layer called nt2unix that runs on UNIX. This is a pure library solution that doesn't require a change to compilers or target systems. His organization has been trying to emulate a "reasonable subset" of the Win32 API to demonstrate that emulation is possible under UNIX, and they have been able to implement most low-level constructs of Win32.

The problem they originally set out to solve was: Given a console application for Win32 written in Visual C++ 5.0 with use of the Standard Template Libraries (STL) and running under Windows 95 or NT, compile and execute the same code on UNIX, with gcc 2.8.1 and STL, for Solaris 2.6 (sparc/x86) or Linux 2.0 (x86).

By "a reasonable subset," Paas means implementation of a couple key areas. The first is support for NT multithreading (i.e., the ability to create, destroy, suspend, and resume preemptive threads) and for most synchronization and thread-local storage (TLS) functions. To demonstrate memory management abilities, they wanted to be able to allocate, commit, and protect virtual memory on the page level, as well as support memory mapping I/O and files. They wanted to provide user level page fault handling with structured exception handling (SEH) to emulate NT SEH. Finally (and ambitiously), they wanted to provide use of the Winsock API for TCP/IP under the emulator.

Paas then went into the implementation details of nt2unix, comparing code to accomplish common tasks such as creating a thread in NT versus POSIX versus Solaris. Creating a thread is in itself very simple, but the differences between operating systems meant they had to ignore LPSEC attributes like Windows 95 does.

One major problem with thread synchronization is that suspending and resuming threads is not possible under the POSIX thread API. Additionally, some Win32 thread concepts are hard to implement efficiently within POSIX. The NT kernel usually handles this, but in UNIX it must be done manually, which implies some performance hits.

Memory management turned out to be fairly easy. Structured exception handling wasn't as easy and couldn't be supported directly, since supporting the keywords try and except would require a change in the compiler. They decided to implement SetUnhandledExceptionFilter(), which creates a global signal handler. Mapping NT exception codes to UNIX signals, where there isn't always a good match, made this difficult.

To enable TCP/IP networking using Winsock, they decided to restrict Winsock 2.0 to the BSD Sockets API. The bulk of the task was translating data types, definitions, and error codes. Paas notes that the pitfalls in this are that some types are hard to map, like fd_set: Winsock's select() function is most definitely not BSD's select().

To test their solutions, they emulated a 15,000-line native Win32 Visual C++ code module, SVMlib. This shared virtual-memory library is all-software, user-level, and page-based. They ran this with nt2unix with no source code changes. Initial time comparisons show satisfactory behavior, the major reason for slightly slower performance than on a Win32 platform being that UNIX signal handling is significantly more expensive than Win32 event handling.

Paas's team concluded that Win32 API emulation under UNIX is very possible, and that if the emulator is application-driven, it can be implemented within finite time (three man-months). Paas says "nt2unix is a reasonable first step to develop portable low level applications." In the future they would like to implement a more complete set of Win32 base services, allowing more applications to be run under UNIX (NT services could be run as UNIX daemons, for example).

nt2unix: <https://www.lfbs.rwth-aachen.de/~karsten/projects/nt2unix>
SVMlib: <https://www.lfbs.rwth-aachen.de/~karsten/projects/SVMlib>

NT-SwiFT: Software Implemented Fault Tolerance on Windows NT

Yennun Huang, P. Emerald Chung, and Chandra Kintala, Bell Labs, Lucent Technologies; Chung-Yih Wang and De-Ron Liang, Institute of Information Science, Academia Sinica

Yennun Huang presented NT-SwiFT, a group of software components implemented to provide fault tolerance on Windows NT. These were originally developed for UNIX and have been ported to NT with new features added.

The problem is to make distributed applications highly available and fault tolerant. Huang outlined three possible solutions: (1) transaction processing as in Microsoft Transaction Server; (2) active replication / virtual synchrony as in ISIS, HORUS, and Ensemble; (3)checkpointing and rollback recovery, which is the approach SwiFT takes.

SwiFT supports three types of process replication for rollback recovery: cold, warm, and hot. Cold is fail-over with or without checkpointing. Warm is primary backup with state transfer. Hot uses an active process group with no shared memory. Regardless, the overall philosophy was to keep the error recovery mechanism transparent from client programs, and to enhance fault-tolerant server programs with fault tolerant APIs.

After an extremely detailed catalog of the many components of SwiFT and what they can be used for, Huang provided some general information about SwiFT in general. UNIX SwiFT has been used in Bell Labs for more than five years and is used in more than 20 products and services. Its technologies have been licensed to a few companies. A few projects in Lucent are trying NT-SwiFT.

SwiFT was originally ported to NT on UWIN but was re-implemented with many new features. UWIN 1.33 works, but not quite. By writing new driver code, they obtain less software dependency, a new GUI, new features, and a much needed understanding of NT internals.

The basic procedure is that when a process is initially created, important system calls and events are intercepted and recorded. This is a sneaky and very transparent solution: a whole process space can be set up (using NT calls VirtualQuery() and Virtual Protect()). Handles can be restored with library injection techniques and modification of import address tables. For client-server applications, they use an intermediate NDIS driver, which allows them to set up a dispatcher and server node with the same IP address. The dispatcher node can be failed over.

Huang gave a very interesting demo of the checkpoint and process-space recovery features using the beloved minesweeper application. This was both humorous and very effective: after placing a few mines, he took a checkpoint, placed a few more, and, naturally, "blew up" when he clicked a bad square. Then he restored from the last checkpoint, which, he explained, actually launched the process, restored its process space, and replayed a sequence of events. This allowed him to "try again" and checkpoint again when he clicked a few good squares. While trivial, it clearly demonstrated the possibilities of the system.

Huang expressed a few opinions about NT, namely that it has too many APIs and libraries (really?), but that it is very powerful and that "everything is possible in NT." It has many useful facilities, and although the OS can be an esoteric maze at times, a good point is that if you have a problem, someone somewhere has probably written some code to solve it, and it's fairly easy to find free code samples.

Huang stated a few future goals for SwiFT. They'd like to bring it to Windows 98. They are planning to add a few components (CosMic, addrejuv), more NT system calls trapping, and more dispatching algorithms for ONE-IP. They'd also like to see SwiFT for distributed objects (CORBA, DCOM, and JAVA). Finally, they'd like to integrate SwiFT with other tools (MSCS) and NT5.

In the Q/A session, someone wanted to know about availability. The answer wasn't very clear, but it's under license and at this point is not very available (still in progress). People had many questions about the Winmine demo. Huang made it clear that GDI objects like brushes can not be captured: there's no way to understand them outside of a process space. For the demo, they use window handles only. Someone wanted to know if it was possible to save on one machine and recover on another, as this would be a very useful feature for load balancing. The answer is yes, but it has to be exactly the same type machine with the same configuration because of memory internals. Someone wanted to know if this would be able to run more than one process per server (for example, could you run thousands of SwiFT-backed objects on a server?), and could SwiFT run for days without crashing? The answer: "We're working on it."

Session: Threads

Summary by Kevin Chipalowsky

A Thread Performance Comparison: Windows NT and Solaris on a Symmetric Multiprocessor

Fabian Zabatta and Kevin Ying, Brooklyn College and CUNY Graduate School

Kevin Ying began by observing that the cost of multiprocessing equipment has dropped drastically over the past few years. A dual-processor IBM SP2 sold for $130,000 in 1995, and a more powerful system built with Pentium II processors costs around $13,000 today. With this much computing power readily available, mainstream operating systems need to support multithreading.

Both Windows NT and Solaris support kernel-level and user-level subprocess objects of execution. Windows NT calls its kernel objects "threads" and its user objects "fibers." The application programmer has complete control over the scheduling of fibers. Solaris calls its kernel objects "Light Weight Processes" (LWP) and its user objects "threads." Unlike NT, Solaris provides a user level library to schedule threads to run on LWPs.

In their research, Zabatta and Ying performed seven experiments to test the relative multiprocessing performance of the two operating systems. They tested kernel execution objects in Windows NT only, but tested bound, unbound, and restricted concurrency level threading in Solaris. A bound thread in Solaris's user level library is a single thread that is always scheduled on a single LWP. An unbound thread is dynamically scheduled on a dynamically chosen number of LWPs, but the concurrency level can be restricted to a fixed number. In their concurrency restricted case, Zabatta and Ying limited the library to four LWPs (CL=4), because their experimental system had four processors.

Ying explained that neither operating system documented a limit on the number of kernel execution objects it could create. Their first experiment was to discover this limit. They found the Windows NT limit to be around 9800, and the Solaris limit to be around 2200. The second experiment tested normal thread creation speed. They wrote a simple program to create threads in a loop. Performance of NT and Solaris bound threads were very comparable. However, unbound Solaris threads could be created much faster. They also tested thread creation speed while the system was under a heavy load. In this situation, the creation of all types of Solaris threads was drastically faster than creation of Windows NT threads. Ying informed us that this could be expected, because NT gives a higher priority to threads that have been running for a long time, while Solaris gives a higher priority to newly created threads.

In the fourth experiment, performance was measured for an application requiring no synchronization. They found no major differences in running times by any of the threading models. This led them to conclude that the Solaris threading library does not significantly affect performance. Next, they tested performance in an application making heavy use of synchronization. Windows NT has two different types of synchronization objects. A critical section has local scope, and a mutex has global scope. Solaris only has critical sections, but it has a creation flag, which determines scope. Zabatta and Ying found that Windows NT critical sections drastically outperform local Solaris ones. However, global synchronization objects in Solaris outperform the global ones in NT. In the sixth experiment, they tested performance using the classic symmetric traveling salesman problem. The significant result was an almost linear speedup with parallelism. All threading models had very similar performance. The final experiment attempted to mimic real world applications with CPU bursts. They tested each threading model with drastic CPU bursts and found the restricted concurrency Solaris threads slightly outperform the others. Ying attributed this to Solaris' two-tier system.

Ying concluded by reiterating the scalability of each model, the flexibility of Solaris's design, and the performance advantages of NT's critical sections.

A System for Structured High-Performance Multithreaded Programming in Windows NT

John Thornley, K. Mani Chandy, and Hiroshi Ishii, California Institute of Technology

John Thornley opened by reminding us of a time-honored idea: multithread programs on multiprocessor computers to make them run faster. However, the idea of multithreaded programming has still had very little impact on mainstream computing. Thornley asks and tries to explain why this is so.

In his explanation, there are three types of obstacles to the widespread adoption of multithreaded systems: the availability of symmetric multiprocessing (SMP) computers, the lack of programming systems, and the difficulty of software development. Until recently, SMP technology was been very expensive and rare. Software tools were always limited. Most were primitive and the product of academic research. They have always been unreliable, nonportable, and difficult to program. Things are changing. SMP computers are finally becoming cheap enough for their use to spread beyond expensive research labs. Commodity operating systems, notably Windows NT, support threaded software. Multithreaded programming, however, remains difficult. This is the focus of the research Thornley presented.

Why is programming so tough? Thornley argues that the problem is a lack of structure. Current tools are designed for systems programming, which is small subset of all programming that could benefit from SMP computers. Current synchronization operations are also very error-prone because of their nondeterminism. These tools are at the level of "goto" sequential programming. We need structured design techniques, modeled after tried-and-true sequential techniques. We need direct control of threads and synchronization. We need determinacy unless explicit nondeterminacy is required. And we need performance that is portable across different hardware and with different background loads.

The authors developed Sthreads, a new package of tools to deliver this functionality. The programming model is "multithreaded program = sequential program + pragmas + library calls." They claim that if a programmer follows the rules of the model, multithreaded execution is equivalent to sequential execution. This determinacy has many important consequences simplifying software development. For example, a program designed for multithreading can be run sequentially for debugging.

Sthreads is not a parallelizing compiler. Their pragmas are not hints; they are specific directives. They are used around blocks and "for" loops and indicate the section of code that should be explicitly multithreaded. The Sthreads library provides counters to guarantee correct order of execution, but also provides access to traditional locks for nondeterministic programming.

Thornley presented a trivial code example that multiplies matrices. A pragma indicating that it should be multithreaded precedes the outer "for" loop. His second example was a little more complicated. It sums up arrays of floating point numbers. Since floating point arithmetic is not associative, the order of execution matters. To ensure equivalency to sequential execution, Thornley's example uses a counter. It guarantees sequential ordering and mutual exclusion in a section of code. The use of Sthreads for this example is far simpler than using the Win32 thread API.

The researchers theorize that this is all you need to make programs run fast. They think that hardware and operating system software are ready for multithreading of commodity software. Thornley made the very strong statement that if multithreaded programming is not this simple, then it will never become mainstream. He ended with the following testimonial. They took a difficult aircraft route optimization problem and implemented a solution using the Sthreads tools. In the end, their solution running on an SMP system with four Intel processors outperformed a Cray supercomputer solving the same problem using an implementation designed with traditional programming techniques.

A Transparent Checkpoint Facility On NT

Johny Srouji, Paul Schuster, Maury Bach, and Yulik Kuzmin, Intel Corporation

Paul Schuster and Johny Srouji presented their research and resulting checkpoint tool. Checkpointing is the act of capturing the complete state of a running process. Once captured, it can later be used to resume the process either on the same machine or be migrated to another.

In the past, many checkpointing tools have been built for UNIX systems, but this group began the development of their tool not knowing of any others for Windows NT. Given NT's increased usage over the past few years, they believed that such a tool was definitely needed.

The motivation for checkpointing is strong. It is a good way to prevent the loss of data that is due to the failure of a long-running process. It can also be used for debugging, to determine why that long-running process failed and resulted in data loss. Most significantly, checkpointing can be used to migrate a process from one machine to another in a distributed environment, possibly to improve resource utilization.

There were a number of design goals in the project. Foremost, it needed to be transparent to the running application, so no source code changes could be required. Obviously, it needed to be correct, but with a minimal performance impact. It also had to be application-independent and support multithreaded processes. The designers tried to make their implementation as portable as possible, although they discussed only the NT implementation.

For a checkpoint facility to be correct, it needs to capture the complete state of a process. Schuster and Srouji illustrated a layering of process state components. User objects, such as memory and thread contexts, were on the bottom. System state objects were above those, and GUI and external state objects were at the very top. Moving up the layers, the complexity of capturing state increased. Schuster and Srouji said they did not even attempt to capture state for the highest layers. Their tool is limited to console applications.

It has both a user interface and a developer interface. The user runs an application using an alternate loader, which configures the app to run with automatic checkpointing. Alternatively, the application developer can explicitly control when checkpointing occurs by using a provided API.

Schuster and Srouji described the architecture of the checkpoint tool. Their checkpoint DLL is loaded into the user memory space by the loader. Its DllMain is called first, which rewrites the Import Address Table (IAT). In doing so, it forces all Win32 API calls to be redirected to checkpoint DLL functions.

When it is time to perform a checkpoint, the tool has access to all needed states. User state is in user memory, and since the tool is implemented as a DLL running in user memory space, it can directly access it. State associated with system calls is stored in system memory. The checkpoint DLL does not have access to that memory, but it can infer the system's internal state because it had a chance to see all system calls.

To resume a process, the checkpoint tool loads the application suspended. It rebuilds the state in the reverse order it captured it. Finally, it releases the application threads and they runs as if they were never stopped.

As implemented, Schuster and Srouji's work has a few limitations. Most relate to external state, which they cannot control. If a process creates any temporary files, they must still exist in their checkpointed states during resume. Any applications that bypass the IAT (by using GetProcAddress, for example) might not resume correctly. And their tool does not even attempt to deal with simultaneously checkpointing multiple processes that interact. In the future, they plan to continue their research with more optimizations and more comprehensive API support. They hope to improve performance with incremental memory dumps. Eventually, they hope to use their checkpointing tool for process migration.

Poster and Demo Session

Summary by Michael Panitz

The Poster and Demo session was a popular, well-attended event featuring many research projects in a wide range of areas, from dynamic optimization to process migration to NT-UNIX connectivity. The session began with the session chair, John Bennett, inviting project presenters to give a one-minute summary of their projects. After the summaries, the audience was free to roam about, conversing both with the project presenters and among themselves.

Many of the projects focused on network-related advances. A Bell Labs/ Microsoft team collaborated on a project that exploited COM's custom marshalling ability to run DCOM on RMTP, a multicast network protocol. A group from Harvard presented a cluster-based Web server in which page requests are preferentially forwarded to specific nodes, thus significantly increasing performance by decreasing the total number of pages each node is expected to serve. The Milan/Chime project demonstrated a distributed shared memory system which was used to support distributed preemptive scheduling (and task migration). Martin Schulz, from the SMiLE group at TU-Munich, presented a system for building a shared memory system, which supports transparent, cluster-wide memory by exploiting SCI's hardware DSM support. Finally, the Brazos parallel programming environment was presented, which facilitates parallel programming by offering features such as being able to run a Brazos program on uniprocessor, SMP, or clustered computers without recompilation.

Two posters dealt with connecting NT to other systems. Motohiro Kanda presented a mainframe file system browser, which allows one to access to file system of a Hitachi 7700 mainframe from Windows NT. Network Appliance demonstrated a specialized, multiprotocol file server that utilizes the SecureShare technology that was described in the Mixing UNIX and NT technical session. The main feature of the NetApp file server was that it supports both UNIX and NT file sharing. It allows NT clients to access UNIX files and vice versa, in addition to NT to NT/ UNIX to UNIX access, all without client modification.

In a class by itself was SWiFT (not to be confused with the checkpointing project NT-Swift), a toolkit to build adaptive systems. The system is based on feedback control theory, and seeks to apply hardware control theory to software problems. In doing so, it facilitates the use of modular control components, explicitly specified performance parameters, and from there, the automatic, dynamic reconfiguration of software modules for good performance despite a changing environment.

KEYNOTE ADDRESS

Buying Computing and Storage by the Slice

Gary Campbell, Tandem Computers Inc.

Summary by Dan Mihai Dumitriu

Gary Campbell gave a compelling argument for cluster technology as a more scalable and cost-effective replacement for SMPs, MPPs, and Supercomputers. He argues that in today's computing world, SAN (System Area Network) technology has matured to the point where it is a feasible interconnect for clusters.

Key technologies that must exist in order for "computing by the slice" to succeed are: x86 SMP systems, which are very inexpensive today; balanced PCI; SAN interconnects such as VI (Virtual Interface)-based solutions; and parallel programming standards.

Alternative technologies to clustering are SMPs, which do not scale indefinitely and do not have the best price-performance curve, and CC-nUMA (Cache Coherent Non-Uniform Memory Access). When applications start to get broken up on a nUMA machine, it starts to look more and more like a cluster. In addition, both of these technologies have single points of failure, whereas clusters are architecturally ready for fault-tolerant features.

The hardware necessary for computing by the slice is available, but the software side still needs work. "Legacy" cluster systems -- such as Tandem NSK, IBM SP2, and Digital UNIX -- are too difficult to replicate. More recent products such as Microsoft's cluster service, affectionately (?) called "Wolf Pair," do not scale well. Much work is needed in the software and the distributed APIs.

Some case studies of "computing by the slice": The IBM DB2 database running on 2 P6-200MHz machines interconnected with ServerNet gets 91% scaling. The Inktomi Web search engine, which is a Berkeley NOW derivative, is built out of 150 dual-processor UltraSparc II machines connected with Myrinet. This highly parallel search engine can index 110 million documents and is highly economical. The Sandia Allegra Model was originally built on Cray, later on Paragon. Now it is running on DEC Alpha's interconnected with Myrinet.

The conclusion was that computing by the slice offers superior price/performance, is architecturally ready, has lower time and cost to market, and is even gaining ground in traditionally supercomputer applications like sparse matrix computations, and also in commercial parallel databases. The architecture is primarily limited by the software. In the future look for COM+, Java EJB, and other perhaps more dramatic developments.

KEYNOTE ADDRESS

Here Comes nUMA: The Revolution in Computer Architecture and its OS Implications

Forrest Basket, Silicon Graphics Inc.

Summary by Dan Mihai Dumitriu

Forrest Basket presented an argument for CC-nUMA (Cache Coherent Non-Uniform Memory Access). He asserted that the nUMA architecture is inevitable in today's computing world, and he presented a successful implementation of the hardware and software.

Basket pointed out some problems that arise in modern systems: faster buses must be shorter and run hotter, and faster CPUs run hotter and so need more volume for cooling. Using point-to point wires rather than buses allows you to run them faster and cooler.

The SGI Origin 2000 is a successful implementation of CC-nUMA. It has an integrated nUMA crossbar and a fat-hypercube interconnect. It has a coherence and locality protocol, 64-bit PCI, and some fault-tolerant features. Each node in the Origin 2000 has two MIPS R10000 CPUs.

The operating system is Irix 6.5. The computational and IO semantics of the system are the same as for an SMP. For a small nUMA system an SMP OS will work, as will SMP applications. Some disadvantages are additional levels in memory and IO hierarchy. Even though the system has a high-performance interconnect, latency in memory access is an issue, as is the contention for resources between nodes.

According to Basket, in order to optimize performance of parallel applications, we want to be able to specify the topology of the system as well as affinities for devices, and to be able to do this without modifying binaries. Other issues that arise are page migration between nodes and memory placement policies. Modifications to the OS kernel are necessary to support being able to specify the initial placement of applications consistent with the user-specified system topology, replication of the kernel at boot time, a reverse page table, and a page locking scheme. System management is also an issue with nUMAs. The ability to partition systems so we can perform administrative shutdown of parts of the system is desirable, as is a sophisticated batch system that would enable users to see consistent running times.

SMP and DSM (Distributed Shared Memory) (cc-nUMA is a variant of DSM) are displacing vector supercomputers and MPPs.

Session: Mixing UNIX and NT

Summary by Michael Panitz

Merging NT and UNIX Filesystem Permissions

Dave Hitz, Bridget Allison, Andrea Borr, Rob Hawley, and Mark Muhlestein, Network Appliance

Dave Hitz presented a fast and witty overview of the WAFL file system, which enables network-based file sharing with both UNIX and NT clients. Network Appliance has created a specialized file-sharing device that uses WAFL to ease file sharing in a mixed NT/UNIX environment. The three design goals of WAFL are: to make WinNT/95 users happy by providing a security model that mimics NTFS; to keep UNIX users happy by providing a security model that mimics NFS; and to allow Windows and UNIX users to share files with each other.

Difficulties arise because UNIX (and its Network File System, NFS) and NT (and its Common Internet File System, CIFS) are fundamentally different, both in security models and in such aspects as case sensitivity (NTFS is case-insensitive, NFS is case-sensitive). NFS uses divides permissions into (user, group, world), while CIFS uses Access Control Lists (ACLs). CIFS uses a connection-based authentication scheme, while NFS is stateless. WAFL was primarily designed to bridge these two filesystems in the most secure manner possible, while secondarily providing as intuitive an interaction as possible.

In addition to moderating access to files based on permissions, a filesystem is expected to display permissions, to allow them to modify these permissions when appropriate, and to be able to specify the default permissions to assign to a newly created file. WAFL uses both permission mapping and user mapping to accomplish these goals. When a UNIX client accesses an NT file, access is determined by UNIX-style permissions that are generated from the ACL via a process called "permission mapping." These "faked-up" permissions are guaranteed to be at least as restrictive as the NT ACL. When an NT client requests access to a UNIX file, access is determined by mapping the NT user to a UNIX account, via a process called "user mapping." The presentation argued that this was an effective, direct way to allow access in a secure, reasonably intuitive manner.

The presentation finished by touching on some of the issues surrounding WAFL, such as how to store NT ACLs, and on the administrative protocols used by NT.

Pluggable Authentication Modules for Windows NT

Naomaru Itoi and Peter Honeyman, University of Michigan

Naomaru Itoi began the presentation with an anecdote about the motivation for creating a Pluggable Authentication Module (PAM) on NT. At the University of Michigan there existed authentication modules for both Kerberos and NetWare. This was great, but the authors wanted a module that provided authentication for both Kerberos and NetWare together; the only way to accomplish this was to create a third module. To create this under NT would have been difficult and time-consuming. They wanted an authentication system would allow the user to log on once, yet use many services ("single sign-on"), a system that would be easy to administer, and a system that would be relatively easy to develop new authentication modules for. What was wanted was a dynamic security system for NT, much like the PAM system that provides dynamic security for Linux and Solaris.

After explaining why such a system would be useful, the speaker gave an overview of PAM, which is a de facto standard for administration, being part of Linux, Solaris, and the Common Desktop Environment (CDE), and also being standardized by the IETF. PAM allows for security (re)configuration via a simple text file, which allows the administrator to specify such things as which services (Kerberos, NetWare, etc.) are required for, say, a logon attempt, or ftp session; which are optional; which services should be logged on to using the login password the user provides; which should be logged in to using a password stored in a password file, etc. PAM also provides a high-level API for authentication, so that different services can be wrapped and then configured without a recompilation.

Itoi outlined a plan of attack by next explaining GINA, the administrator-replaceable "Graphical Identification and Authentication" user authentication component. GINA enables the administrator to replace the default user authentication module with another, but still suffers from the problem of having to write one module for the Kerberos service, one for the NetWare service, and a third for Kerberos and NetWare. Further, each module would have to be configured in its own way, thus making administration of any significant number of NT machines nearly impossible. Last, custom GINA modules require special debugging tools and the use of difficult techniques, since GINA is run before anyone logs in. The plan was to build a custom GINA that implemented a subset of PAM, so that NT could be used, administered, and developed for as easily as the UNIX PAM systems.

The design and implementation of PAM, named NI_PAM, was presented, including the API supported by NI_PAM; a diagram that showed which DLLs replaced the GINA.dll and their interaction was explained.

Itoi reported that much of PAM has been successfully implemented, though more features need to be implemented, and more testing needs to be done, before a large-scale rollout can take place. The presentation concluded with some thoughts on alternate means of implementing PAM on NT, and possible future directions of the work, such as use of smartcards and screen saver locks.

Montage - An ActiveX Container for Dynamic Interfaces

Gordon Woodhull and Steven C. North, AT&T Laboratories -- Research

Montage grew out of an effort to create a Windows-friendly port of an abstract graph/network editor from UNIX. The Windows graph editor would integrate with Windows applications using ActiveX (also known as OLE - Object Linking and Embedding), a runtime, object-oriented technology. The edges and nodes of the graph would be embedded objects, and the application itself would be an embeddable ActiveX container, which sounded easy enough.

Unlike previously available containers, Montage separates both the layout and control of the contained objects and the interface used to control them from the container. Thus, Montage is actually an externally controllable object container that is being used to create a graph application, Dynagraph. Montage is itself an embeddable, customizable ActiveX object, and allows dynamic changes to the layout of contained objects. Thus, Montage could be used to display the current state of a computer network, unlike something like dotty, which is used to generate static graphs. All policy decisions (i.e., which objects should be placed where, what size should they be, etc.) are implemented objects independent of the Montage objects, thus allowing one to change the style of layout without recompiling Montage (unlike an VB MFC application).

Montage exploits the OCX96 technology of "transparent controls" to provide for different modes of interaction with the objects. This allows Montage to provide a "Viewing Mode," in which the user can view but not change the graph, and an "Editing Mode," in which the user can both view and edit the graph. At the same time, the contained objects themselves are allowed to request that their properties be set to a certain value. A contained object could, for example, request to be moved to point (x, y), and its foreground color set to blue, or to be brought to the front. Montage then forwards this request to the layout control engine, which then has the option of ignoring the request or interpreting it if it so chooses. Montage exploits the new technology of "windowless controls" to provide for different modes of interaction with the objects. This allows Montage to provide a "Viewing Mode" in which the user can view but not change the graph, and an "Editing Mode," in which the user can both view and edit the graph.

The presentation included an impressive live demo, which showed embedding a Word snippet into Montage, and then embedding a Montage graph into Word.

Session: Networking and Distributed Systems

Summary by Hui Qin Luo

SecureShare: Safe UNIX/Windows File Sharing through Multiprotocol Locking

Andrea J. Borr, Network Appliance, Inc.

Dennis Chapman, who made the presentation for author Andrea Borr, employed illustrative examples to demonstrate the capabilities of SecureShare. SecureShare is Network Appliance's solution to multiprotocol file sharing between two different file systems, UNIX's Network File System (NFS) and the Windows Common Internet File System (CIFS) or "(PC)NFS."

SecureShare is a Multiprotocol Lock Manager providing file-sharing capabilities between UNIX clients using NFS and Windows clients using CIFS without violating data integrity. CIFS has hierarchical locking and mandatory locking functionality that requires file-open and lock retrieval before performing any operations such as reading, writing, or byte range locks. Unlike CIFS, UNIX's NFS has a nonhierarchical, file-open deficient and advisory locking mechanism. Its has no predeclarative functionality that specifies the kind of access mode it needs to a file. Therefore, these differences make file sharing between the mixed network environment difficult, if not impossible. SecureShare's main selling point is the preservation of multiprotocol data integrity by reconciling the locking mechanisms and file-open semantics between the two different file systems, and multiprotocol oplock ("opportunistic locks") management involving oplock requests from NFS to CIFS oplock break.

CIFS opportunistic locks (with the exception of level II oplocks) represent the equivalent of a file open with Read-Write/Deny-All access mode. However, access attempts by other clients (using either CIFS or NFS) to the oplocked file can cause the server to revoke the oplock through an oplock break protocol. The client who obtained the oplock gains the advantages of read-ahead on the open file, cache write operations to files, and cache lock requests. In this way, the network traffic to the file server is minimized. Chapman discussed the oplock break protocol in a mixed CIFS and NFS environment. When another client wishes to access the file, the client's request is suspended. Afterwards, an oplock break message is sent to the operating system of the CIFS client holding the oplock. The client operating system can close the file and pipe all the changes of the file stored in the cache to the file server. It can also pipe all the cached changes and remove all the read-ahead data. It then transmits a reply to the fileserver acknowledging the break.

One of the concerns brought up in the Q/A session was the handling of a situation in which a client fails to respond to the oplock bread request sent by the server due to attempted access to the oplocked file by NFS. Chapman claimed that there is an automatic session timeout on the oplock held by the client's operating system, which, if triggered, automatically relinquishes the stale batch oplock.

Session: Networking and Distributed Systems

Summary by: Hui Qin Luo

Harnessing User-Level Networking Architectures for Distributed Object Computing over High Speed Networks

Rajesh S. Madukkarumukumana, Intel Corp.; Calton Pu, Oregon Graduate Institute of Science and Technology; Hemal V. Shah, Intel Corp.

The introduction of high-performance user-level networking architectures such as Virtual Interface (VI) lays the ground work for improving the performance of distributed object systems. This presentation by Rajesh Madukkarumukumana examined the potential of custom object marshalling using VI, along with issues involved in the overall integration of user-level networking into high-level applications.

Component-based software like Distributed Component Object Model (DCOM) uses remote procedure call (RPC) mechanisms to facilitate distributed computing. Although distributed computing has matured over time, the protocols that are relied on to transport data have remained virtually unchanged, hindering the overall performance of networks such as SANs (System Area Networks). The low-latency of user-level architectures provides an attractive solution to the problem in SAN environments. Madukkarumukumana chose to use DCOM and VI as the subjects of his research. He presents his methodology in integrating the VI based transport and a preliminary analysis of the performance results.

The VI architecture provides the illusion that each process owns the network; many performance bottlenecks are bypassed, including the operating system, to achieve this low latency, high bandwidth connection. At the heart of the standard lies two queues for each process, one for sending data, the other for receiving it; the queues contain descriptors that state the work that needs to be done. Prior to data transfer operations, a process called memory registration is performed, allowing the user process to attach physical addresses to virtual ones. Unique memory regions are referenced by these address pairs, eliminating any further bookkeeping. Two data transfer operations are accounted for -- the standard send/receive operations, and Remote DMA (RDMA) read/write operations.

DCOM is a network version of COM, used for the development of component software. The network extensions in DCOM allow for all objects to be addressed the same way, hiding their location. Encoding and decoding data for transfer is called marshalling and unmarshalling, respectively; the process of marshalling and unmarshalling creates a stub object in the server process, and a proxy object in the client process. Basically, three types of marshalling are used, but the one that Madukkarumukumana discussed is custom marshalling: it allows for the object to dynamically choose how its interface pointers are marshalled.

In order for VI to do its job, the interface that DCOM uses to generate the stub and proxy, referred to as IMarshal, has to be exposed. Specialization in the object implementation is used to expose the IMarshal interface. By exposing the parameters of the IMarshal interface, new methods can be written to make use of the VI send and receive queues. Information can therefore be sent using the VI standard, instead of the old UDP protocol. Since VI guarantees a certain quality of the signal transferred over a line, much of the overhead and interrupts involved in UDP is eliminated.

In discussing the results of his research, Madukkarumukumana stated that latency in the signal (for one-way transfer) dropped by about 30% to 60% in some cases, even only under VI emulation. The existence of core VI hardware provides a further dramatic increase in performance, and an even greater performance boost may be expected if new procedures catering to distributed computing systems are implemented within VI (results forthcoming).

Implementing IPv6 for Windows NT

Richard P. Draves, Microsoft Research; Allison Mankin, University of Southern California; Brain D. Zill, Microsoft Research

This presentation focused on the implementation and design details of IPv6 for Windows NT as well as the common pitfalls/challenges encountered in the process. IPv6 is the next generation Internet Protocol (IP) worked by the IETF. The major driver behind it and some of its key features were briefly mentioned in the paper; however, anyone new to IPv6 who wishes to know more about its history and the IPv6 specification should consult the relevant RFC documents referenced in the paper.

The presentation started with an excellent overview of the Windows NT networking architecture and how the IPv6 protocol stack can be/is integrated into it. This was followed by a high-level overview of the presenters' IPv6 implementation and some discussions of four challenges/issues encountered and the specific solution used. The presentation ended with some notes on the implementation's performance.

The segment on NT networking internals was very informative, especially for novices. The details on the interfaces (documented or undocumented) and protocols presented, along with the helpful references mentioned in the paper, will prove useful for anyone trying to implement a different network protocol stack for NT and even for Windows 95 (due to similar network architectures). The integration of IPv6 into this networking architecture was also highlighted. Mainly, a Winsock DLL module was added to provide user-level socket functionality for IPv6 addresses, and a new TCPIP protocol driver to replace the IPv4. The implementation was "single stack," supporting only IPv6, and though not efficient was useful in isolating problems in the IPv6 stack during testing. The logical layout of the implementation was divided into three layers similar to IPv4 -- the link layer, the network layer (IP), and the transport or upper layer which includes protocols such as TCP, UDP, and ICMP.

Four noteworthy problems and their implemented solutions were discussed. They range from inefficiencies in lower-level network device handlers during a receive cycle to deadlock avoidance issues.

Performance measurements for the implementation were taken using TCP throughput as the indicator metric and compared to the IPv4 stack. Results presented showed marginal performance degradation with the IPv6 stack (2.5% over 10Mb/s LAN), and somewhat higher than expectations (1.4% based on increased IPv6 header length). This is to be expected, as the developers never intended this to be an optimized implementation, but rather as a base for further research and to give Microsoft the push for a product release in the future. Whether we will see better performance results in Microsoft's official product implementation of IPv6 is still to be determined. Comparative performance measurements against other IPv6 implementations (Solaris, Digital Unix, BSD variants etc.) were left out. The metric of comparing each implementation's relative performance to its IPv4 counterparts can be used as an indicator. Direct IPv6 TCP throughput comparisons might not be fruitful because of differences in the O/S architecture each implementation was targeted for, unless IPv4 performance was similar across these platforms. Source code size comparisons were done against another publicly available IPv6 implementation (INRIA IPv6).

"Great sample code" is available at <https://research.microsoft.com/msripv6> for anyone who wishes to dabble in Windows NT network protocol development or have a starting code base for further IPv6 research and experimentation. A more full-fledged release with security, authentication, and mobility support is expected to be available in the future.

Session: Real-time Scheduling

Summary by Jason Pettiss

A Soft Real-time Scheduling Server on Windows NT

Chih-han Lin, Hao-hua Chu, and Klara Nahrstedt, University of Illinois

Hao-hua Chu spoke about his group's implementation of a software realtime CPU scheduler for Windows NT. NT schedules applications indiscriminately based on multi-user time-sharing. Multimedia performs poorly under these conditions, especially if non-time-sensitive-conscious but CPU-hungry tasks like compilation are occurring in the background. The scheduling server is a daemon from which applications can request and acquire periodic processing time.

The scheduler requires no kernel modifications, uses the rate monotonic algorithm, supports multiple processors (SMP model), and provides guarantees for timeshare processes so that they aren't starved by realtime tasks. Chu says there is "reasonable" performance at this point, the main problem being limited overrun protection due to the fact that the scheduler itself is a process and sometimes isn't woken up on time by NT.

The architecture consists of a broker, which handles reservation requests, builds a dispatch table, and fills the available slots of the table. The dispatcher is in charge of reading the table and responding appropriately. The dispatch table is configurable for the number of processors, the number of available slots, and the time-slice of slots. Dispatching occurs by changing the thread and process priority of participating applications between idle and highest priority realtime (1-31).

To test, Chu's team used a dual Pentium 200 with 96 MB RAM (an HP Vectra XU). Time-slice was set to 20ms. They ran two processes running MPEG decoders at FPS, one Visual C++ compilation of the MPEG decoder, and four processes running computations of sine and cosine tables. Dispatch latency worked out to be about 640 microseconds, which was longer than they would have liked, but not too large to disrupt scheduling. Performance of the two time-sensitive processes was improved, Chu noted.

The main problem, reiterated Chu, was that NT sometimes did not wake up the dispatcher on time. Also, the dispatcher, being an NT process itself, cannot preempt real-time processes. This means there is weak overrun protection. The provided NT timers weren't accurate enough, so they are using a Realtime Extension (RTX) by Venturcom to get under 1ms resolution.

They have much future work planned. Support is planned for varying processing time per period and for a process service class, similar to ATM traffic classes. They hope to run conformance tests. Also, they would like to adapt a multimedia decoder to increase reservation or decrease quality as necessary. An additional feature, probing and profiling, could be added to figure out how much processing time to reserve on a per-application basis.

Vassal: Loadable Scheduler Support for Multi-Policy Scheduling

George M. Candea, Oracle Corporation; Michael B. Jones, Microsoft Research

George Candea presented Vassal, a system that utilizes loadable schedulers to enable multi-policy, application-specific scheduling on Windows NT. He led off with the example of a speaker late for a presentation (an application) that knows where it needs to be and when, and a cab driver (the OS) that can get him there on time if he can communicate with him. Windows NT is more like a cab driver who hasn't learned English yet -- the operating system multiplexes the CPU among tasks, unaware of their individual scheduling needs.

Since no single algorithm is good enough for all task mixes, explained Candea, a compromise would be to hardcode more than one scheduling policy into the kernel. But even better would be a dynamically extensible set of policies, made possible by separating policy (scheduling) from mechanism (dispatching). This lets a developer concentrate on coding policy. It also would allow different applications to communicate with their preferred policy to bargain for scheduling time. Custom schedulers are special Windows NT drivers that coexist with the default NT scheduler, and which should have negligible impact on global performance.

Candea then reviewed the current state of Windows NT scheduling. The basic schedulable unit is the thread, which acquires CPU time based on priority levels of two classes: variable, which uses a dynamic priority round-robin, and realtime, which uses a fixed priority round-robin. Interrupts and deferred procedure calls (DPCs) have precedence over threads, so scheduling predictability is limited. Scheduling events are triggered by: the end of a thread quantum, priority or affinity changes, transition to wait state, or a thread waking up.

NT timers use the hardware abstraction layer (HAL), which provides the kernel with a periodic timer of variable resolution. Candea noted that most HALs have resolution between 1 and 15 ms, but some HALs are worse than others -- some can be set to only powers of two, while others are fixed at 10 ms. This is certainly another limitation to scheduling with any policy under NT.

Vassal separates policy from mechanism. While the NT scheduler consists of thread dispatch and a default scheduler, Vassal consists of many schedulers (policy modules) arranged hierarchically and a separate dispatch module in charge of preempting and awakening threads. Standard NT policies remain in the kernel so that applications with no special needs are handled as usual.

The schedulers register decision making routines with the dispatcher. The dispatcher invokes these when scheduling events occur. Threads can communicate with schedulers to request services. These new features require some interface modifications, with the addition of three system calls.

As proof-of-concept, the Vassal team implemented a sample scheduler that can be loaded in addition to the default NT policy. The sample allows threads to get scheduled at application-specified time instances, which is simplistic, Candea admitted, but demonstrates potential for more interesting time-based policies.

Minimal NT kernel changes were required: they added 188 lines of C code, added 61 assembly instructions, and replaced 6 assembly instructions. The scheduler itself was only 116 lines of C code, required no assembly language, and they only needed to code policy. Performance results showed that it was no slower if no specialized scheduler was loaded, and there was only 8% overhead with their untuned prototype. Results showed that with the special scheduler, the predictability of periodic wakeup times significantly improved, and there were no longer early wakeups. There were a few slightly late wakeups (still less late than without the prototype loaded), and these were caused by unscheduled events such as interrupts and DPCs.

Candea emphasized the Vassal take-homes are that it demonstrates the viability and positive impact of loadable schedulers, and that it frees the OS from anticipating all possible application scheduling requirements. It also encourages interesting research in this area by making it easy to develop and test new policies, and doesn't adversely affect the OS.

Some related work is Solaris, which maps scheduling decisions onto a global priority space; extensible OS work like Spin, Exokernal, or Vino, and hierarchical schedulers like UTAH CPU or UIUC Windows NT Soft Real-Time scheduler.

Many questions followed. Someone suggested that two different schedulers must be compatible or there will be trouble. Candea agreed that this was an interesting problem that could be solved by allowing schedulers to talk to each other and "negotiate" or to use their descriptions and decide if a conflict will occur. Another question was that a driver has limited visibility into the NT kernel, and does this affect the power of a scheduler? The answer is yes, ideally these special drivers would be able to see into the kernel data structures for real power. The moderator asked what can be done about predicting DPCs and interrupts. Candea didn't think that was necessary, noting that these are best left to perform their crucially important tasks when they need to.

<https://pdos.lcs.mit.edu/~candea/research.html>
<https://research.microsoft.com/~mbj>

Session: NT Futures

Tom Phillips and Felipe Cabrera, Microsoft Corporation

Summary by Kevin Chipalowsky

Felipe Cabrera and Tom Phillips demonstrated the upcoming Windows NT 5.0 operating system and the NT Services for UNIX add-on. The two-hour session was very open, informal, and at times emotional for some in the audience. Microsoft gave a presentation while inviting just about any type of question regarding the future of their flagship operating system. Cabrera and Phillips fielded very spirited questions and comments.

Phillips began his NT 5.0 presentation by describing its new support for upgrading from Win 9x. The setup program first scans a system for compatibility before installing anything. It will now migrate applications as part of the upgrade process, using plug-ins to support third-party software. The system configuration is also preserved during the transition to NT.

Next, Cabrera talked about the new volume-management infrastructure. The new version of Windows NT will have many new storage management features. In most cases, hard-drive partitions can be manipulated without requiring a reboot. For example, partition size can grow and shrink dynamically. There are also new security features, such as a reliable "change" journal and file encryption. In response to a question about the type of encryption, Cabrera explained that it uses public key cryptography and is designed to prevent thieves from examining data on a stolen laptop.

Cabrera also talked about the new file-based services sported by NT 5.0. It provides a new content indexing tool, which can be used to search for files based on their content instead of just their filename. It tracks common types of embedded file links and updates them when a data file is moved to a different volume or even a different machine. There is also a new automated recovery system to revive a computer that otherwise will not boot.

The speakers then presented NT Services for UNIX. Microsoft developed it in response to the growing adoption of NT by previous UNIX users. It will be available for Windows NT 4.0 with Service Pack 3 for $149 per client. It is in beta, and anyone interested in being a beta tester should email <gregsu@microsoft.com> with "sfubeta" as the subject.

NT Services for UNIX will make an NT machine feel more like a UNIX one. It allows users to access NFS partitions like any NTFS of FAT partition. A new Telnet client and server are also included, so administrators can remotely Telnet into a Windows NT system and run console-based applications. Microsoft has licensed a Korn shell implementation and a few dozen familiar UNIX tools from MKS. The audience loudly applauded this part of the presentation.

Next, Cabrera and Phillips revealed storage features; the new Microsoft Management Console (MMC) was the center of attention. It is a management container that provides Microsoft and third-party developers the opportunity to plug in software to manage just about anything.

The first demonstration was of RAID 5.0 support. One hard drive of a stripe volume was removed and later plugged back into the system. Although the underlying file system seemed to handle the intentional fault, MMC simply crashed and the computer needed a reboot. After this small setback, the demo moved on.

Hierarchical Storage Management (HSM) is another feature new to NT 5.0. It makes use of the observation that the most commonly used files are usually the most recently used files. When a hard-drive partition becomes full, the filesystem offloads older files to a tape backup system to make free hard-drive space available. Although Microsoft is not the first to attempt to build HSM support onto NT, they believe they will be the most successful. They have complete control over the operating system and can fix all of the related utilities that would otherwise have difficulty with the extremely long latency that results from trying to open certain files.

Networking in NT 5.0 has also received an overhaul. As with volume management, most network configuration changes can now be made without rebooting the system. Microsoft also claims an improved programmable network infrastructure. The TCP/IP stack is also enhanced; it runs faster and supports security and QoS protocols.