Check out the new USENIX Web site. USENIX - USENIX 98

USENIX 1998 Annual Technical Conference

June 15-19, 1998


Science and the Chimera
James "The Amazing" Randi

Summary by Peter Collinson

James Randi's keynote talk kicked the 1998 Technical Conference into life. His talk zipped by, and we walked out at the end having seen several examples of his work, his magic, and his mission to expose the quacks and charlatans who often employ technology to fool us, the gullible public.

James's roving, debunking eye has been aimed in many directions, from harmless mentalists to the somewhat more serious faith healers in both the US and the Philippines whose activities not only net them huge sums of money, but also give false hope to people whom medical care cannot cure. He looked at homeopathic cures which are now sold in many drugstores. These cures are simply water because the original chemical has been diluted 102500 times.

His talk ended on some serious notes:

  • We need to teach our children to think critically so they stop being fooled
  • We need to stand up and expose fraudulent use of science and technology.

His Web site is <> and is well worth a visit.


Session: Performance 1

Summary by Tom M. Kroeger

Scalable Kernel Performance for Internet Servers Under Realistic Loads

Gaurav Banga, Rice University; Jeffrey C. Mogul, Digital Equipment Corp., Western Research Lab

This work, presented by Gaurav Banga, earned both the Best Paper and Best Student Paper awards at the conference. It examined an inconsistency between the observed performance of event-driven servers under standard benchmarks, like SPECWeb96, and real workloads. Banga and Mogul observed that in a WAN environment, which is characterized by inherent delays, a server is forced to manage a large number of connections simultaneously. Because commonly used benchmarks lack slow client connections, they failed to test system behavior under such conditions. From this observation they developed a new WAN benchmark that tried to model slow connections.

They then profiled the Squid proxy server under both a standard benchmark and their new benchmark. The standard benchmark showed no specific procedure in the system to be a bottleneck, but in the WAN benchmark the kernel procedures for select and file descriptor allocation (ufalloc) accounted for 40% of the CPU time. With this information Banga explained how they examined the implementations of select and ufalloc.

The select system call in Digital UNIX (and in fact in most UNIX variants) was designed at a time when several hundred connections as an input argument would have seemed extreme. Banga explained how the current implementations scale poorly because of linear searches, layered demultiplexing, the linear system call interface, and the pressure that these functions put on the CPU data cache. The key insight here is that select wastes a significant portion of time rediscovering information that previous stages of protocol processing had available. Using hints to transfer this state information, they were able to prune the scans that select needed to perform.

Next Banga explained how they examined ufalloc. This procedure is called every time a new file descriptor is allocated. Again a linear search was at the heart of the problem. UNIX semantics state that ufalloc must provide the lowest free descriptor; this prevents the use of a free-list. To solve this problem, the authors reimplemented ufalloc, adding a two-level tree of bitmaps to indicate available descriptors. This new implementation changed ufalloc's complexity from O(n) to O(log n). It also provided for better cache behavior requiring two memory reads vis-à-vis a sequential scan that would thrash the entire data cache. Lastly, it provided better locking behavior because of a shorter critical section.

Banga explained how they set up two testbeds to evaluate the effect of these changes. First, using the WAN benchmark on both the Squid proxy and the thttpd Web server, they showed that scalability with respect to connection rate and connection count was significantly improved. They then tested their changes under a live load, i.e., on Digital's Palo Alto Web proxies. Again, they found significant improvements in performance from the modified systems.

Tribeca: A System for Managing Large Databases of Network Traffic

Mark Sullivan, Juno Online Services; Andrew Heybey, Niksun, Inc.

Mark Sullivan presented the Tribeca system for network traffic analysis, which the authors developed after noting how the use of ad hoc analysis programs resulted in redundant efforts. They have developed a general query system for an environment where network traffic data streams by at rates of up to 155 megabytes per second. They observed that a typical relational database system would not be effective for network analysis for the following reasons. Both the data and storage media are stream oriented. Relational database systems do not normally handle tape data well, and tape data are commonly used for network traffic analysis. The operators needed are more like those in temporal and sequential databases. Traffic analysis commonly requires running several queries during a single pass. Lastly, relational database systems rarely consider the memory capacity of the system on which they are running. The Tribeca system addresses all of these issues, but also differs from conventional relational databases in that it does not support random access to data, transactional updates, conventional indices, or traditional joins. Tribeca takes its source data from either a live network adapter or tape data.

The query language in Tribeca is based on a data description language. The different protocols are expressed as different data types; this language then allows the user to create new types by extending compiled-in types. This language also provides support for inheritance, arbitrary offsets, and bit fields. Each query has one source stream and one or more result streams. To manipulate these streams, Tribeca provides three basic operators: qualification, projection, and aggregation. Additionally, the query language provides for stream demultiplexing and remultiplexing. Finally, the language also provides a method for operating on windows over the stream.

Tribeca's implementation shares several similarities with traditional relational database systems. Queries are compiled into directed acyclic graphs. These graphs are then optimized to improve performance. The basic data management for Tribeca makes use of existing operating system support for sequential I/O, and because data are not updated, no support for concurrency control is needed. Special attention was paid in the implementation to minimize internal data copying. Additionally, the optimizer also works to ensure that a query's intermediate state can be fit into the memory available.

The authors presented some basic tests to examine the overhead incurred. They compared the overhead for a basic query with that of the standard UNIX program dd: dd used 68% of the CPU on a Sparc 10, but Tribeca used only 70%-75% of the CPU time. Lastly, they compared the performance of Tribeca to that of programs written directly to execute a specific query. The results showed that Tribeca incurs between 1% and 7% overhead. The authors concluded by noting that the increased flexibility and convenience provided by Tribeca are well worth the minimal overhead introduced.

Transparent Result Caching

Amin Vahdat, University of California, Berkeley; Thomas Anderson, University of Washington

Amin Vahdat presented a system (TREC) developed by the authors to track output of a process's execution based on the inputs. Using this information, TREC provides a framework to make use of previously existing outputs and observe process lineage and file dependencies.

Implemented through the use of the procfile system under Solaris, TREC intercepts the open, fork, fork1, creat, unlink, exec, execve, rename, and exit system calls. By catching these calls TREC is able to record a process's input files, child processes, environment variables, command line parameters, and output files.

After explaining the basic architecture, this work addresses the limitations of TREC. In order to address concerns about the performance overhead from intercepting system calls, the authors examined the added overhead for a test program that simply called open and close in a loop and two typical application executions. The test program saw 54% overhead but the typical applications saw only 13% and 3%. The authors observed that the overhead is directly proportional to the system call rate and noted that a kernel-level implementation would significantly reduce this overhead.

The authors also noted several requirements for TREC to produce correct results. The program itself must be deterministic and repeatable, it cannot rely on user input, and interaction with environment variables must be reproducible. File contents must be static, files such as /dev/rmt0 could produce incorrect results. File contents must be changed locally; for example, NFS-mounted files could be modified on a remote machine and not be reported to TREC. Processes that base their results on communication with remote servers cannot, in general, be correctly tracked. Lastly, a program must complete successfully.

After detailing the limitations of this system, the authors provided three examples of applications that use the TREC framework: unmake, transparent make and a Web cache that enables server and proxy caching of dynamically generated Web content.

Unmake provides a facility for users to query the TREC framework to determine how specific output files were created as well as questions about process lineage. Transparent make provides an alternative to make that automatically determines file dependencies. Instead of providing a possibly complicated Makefile, the user provides a simple shell script that performs the complete build sequence. Once transparent make has observed this shell script and each program's input and resulting outputs, transparent make can be used for subsequent builds to execute only those commands for which the inputs have changed in some manner. This system has the following advantages: user errors in dependency specification are avoided, dependencies are updated as the input is changed (e.g., a header file is added to a program being compiled), and the users are saved from learning the Makefile specification language. Transparent make provides two current variants: a passive version that will update output files when executed and an active version that registers a callback with the TREC framework. For this active version, upon observing changes to specific input files if a callback was registered for that file, then transparent make will prompt the user for reexecution of the registered module.

The third example of the uses for the TREC framework is a modification to the Apache Web server to cache the results of cgi scripts. The authors modified an Apache server to store copies of the results from a cgi program's execution indexed by the program's parameters. When the cgi program is called, the server first checks for a pregenerated result for the requested program and parameters. If these exist, it responds with the contents of that file instead of executing the cgi script. To invalidate these dynamic cache entries, the TREC framework is then used to profile the execution of each cgi program. When an input to this program is observed to change, TREC notices a registered callback similar to those for the active version of transparent make. This callback invalidates the cached result. Comparing the two servers (caching versus forking) with a basic cgi script, the authors observed a 39% improvement in average response time.

Session: Extensibility

Summary by Karen Reid

In her introduction to this session, session chair Terri Watson Rashid noted that the three papers represent a wide range of how extensibility can be used. The first paper discusses a way to extend operating systems; the second describes an extension to the SPIN operating system; and the final paper presents methods of extending applications.

SLIC: An Extensibility System for Commodity Operating Systems

Douglas P. Ghormley, University of California, Berkeley; David Petrou, Carnegie-Mellon University; Steven H. Rodrigues, Network Appliance, Inc.; Thomas E. Anderson, University of Washington

The extension mechanism described by David Petrou makes it possible to add functionality to commodity operating systems with no changes to the operating system. These extensions allow system administrators to easily add trusted code to the kernel to fix security flaws, take advantage of the latest research, or better support demanding applications.

SLIC is built using the technique of interposition: capturing system events and passing them to the extensions. The system has two components: the dispatchers that catch events, call the extensions, and provide the API for the extension framework and the extensions themselves that implement new functionality.

Petrou described three examples of extensions implemented using SLIC. The first fixes a security flaw in the Solaris admintool. The second extension adds encryption to the NFS file system. The third one implements a restricted execution environment by filtering system calls.

Petrou and his co-author/slide-turner, Steven Rodrigues, pulled out an extra slide to answer the first question of how to manage the order that the interpositions were applied. Unfortunately, the answer was that although the syntax for composing interpositions is not a problem, determining that a series of interpositions is semantically correct is a difficult, and as yet unsolved, problem.

The second question confirmed that interpositions can be applied only to interfaces at the kernel boundary. Petrou noted that SLIC could not be used to change the paging algorithm for an application.

When asked whether the extensions would primarily be useful to prototype kernel extensions or for low-volume extensions, Petrou claimed that the examples showed that extensions could be used to solve a wide range of problems, not just those in a research environment.

David Korn asked if interpositions could store data on a per process basis. Petrou replied that the extension can store per thread state.

The remaining questions concerned the portability of the interposition system. Petrou argued that the extensions should be quite portable, but that the dispatchers needed to be ported to other architectures. They are currently working on porting the dispatchers to Linux.

A Transactional Memory Service in an Extensible Operating System

Yasushi Saito and Brian Bershad, University of Washington

Yasushi Saito presented Rhino, an extension for the SPIN operating system that implements a transactional memory service. A transactional memory service uses memory-mapped files and loads and stores to implement transactions that are atomic, isolated, and durable (ACID). Transactional memory can be used to support object-oriented databases and persistent low-level data structures such as filesystem metadata.

Saito contrasted his work with user-level implementations of transactional memory by highlighting several problems with the user-level approach. First, context switches for the signal handler and mprotect() incur too much overhead. Also, the user-level implementation requires fast interprocess communication. Finally, buffering problems occur because the user-level process that is mapping database files into memory has no control over the paging. Double buffering occurs when the memory system decides to reclaim pages and swaps out database pages instead of writing them back to the database file.

The approach taken by the authors to solve these problems is to do everything in the kernel. The SPIN extension gives them fast access detection through the kernel page fault handler and efficient memory-mapped buffers through cooperation with the kernel memory manager.

Three options for buffer management were discussed. The first relies on the user to notify the extension (by calling trans_setrange()) about a region that will be modified. This method is efficient for small transactions, but doesn't scale well when the number of setrange() calls is high. The second option is to log the entire page when at least one byte is modified. This approach works well for large transactions, but is costly for small transactions. The third method computes and logs the diffs between the page and the updates. Page diffing combines the advantages of the previous two approaches, but incurs significant overhead.

Saito compared the performance of the SPIN-based transactional memory service to one implemented at user-level on UNIX and to ObjectStore, a database management system. The SPIN-based system consistently outperformed the other two for the given workloads.

Terri Watson Rashid asked Saito to comment on his experiences implementing kernel extensions on SPIN. Saito was reluctant to make any strong comparisons between implementing the user-level UNIX implementation of Rhino and the SPIN extension, but said that debugging facilities for SPIN extensions made writing kernel-level code much easier.

Dynamic C++ Classes

Gísli Hjálmtysson, AT&T Labs-Research; Robert Gray, Dartmouth College

This work, presented by Gísli Hjálmtysson, was motivated by the desire to allow "hot" updates of running software. In other words, they wanted a system that allows users to insert or replace components of a software system while it continues to run. This technique can be applied to domains such as network processing, where it is often highly undesirable to halt and restart programs.

The authors achieved their goal of updating running code by implementing a library to support dynamic C++ classes. This approach was chosen because C++ is widely used, high performance can be maintained, and program abstractions can be preserved. Dynamic classes allow for runtime updates at the class granularity. New versions of existing classes can be installed, and new classes can be added. However, they require that class interfaces be immutable.

One big question is how to dynamically link in new or updated classes. Hjálmtysson proposes three different approaches to updating objects: imposing a barrier, where no new objects may be created until all objects of an older version have expired; migrating old objects to their new version; and allowing multiple concurrent versions of each class. The disadvantage of the barrier approach is that it is equivalent in some ways to halting the program and restarting. The migration approach is hard to automate efficiently, so the authors chose the third approach of allowing concurrent versions of a class.

Dynamic classes have two parts: an abstract interface class and an implementation class. An interface monitor, implemented as a class proxy, screens messages that pass through dynamic class interfaces and direct them to the correct version of the class. Two levels of indirection are used: one to map to the implementation and the other to map the methods within a version. This approach requires that all public methods of a dynamic class be virtual.

Using the factory pattern, an object of a dynamic class is constructed by calling the default constructor, which locates and loads the dynamic shared library, calls the constructor for the correct version, and stores a pointer to the version of the implementation class in the proxy.

Three different templates for proxies are defined. They differ in ease of use, functionality, and performance. The high-functionality version of the template allows multiple implementations of a dynamic class interface as well as multiple active versions of an implementation. The medium-functionality version allows multiple versions, but not multiple implementations of a dynamic class. Both the medium- and high-functionality versions implement methods to invalidate other dynamic class versions. Finally, the low-functionality, but highest performance, version of the template allows multiple concurrent versions of a dyn-amic class, but old versions cannot be invalidated.

The flexibility of dynamic classes does not come without a cost. Each instance requires space for three or four extra pointers. The method invocation overhead is approximately doubled because of the extra checks required and because some of these checks cannot be optimized (by most compilers) because a method that can throw an exception cannot be inlined.

Hjálmtysson used mobile agents as an example of how dynamic classes could be used. Different versions of the agents for different architectures can be downloaded so that agents can be instantiated on a variety of platforms.

Hjálmtysson concluded by describing dynamic classes as a lightweight mechanism to update running code that preserves type safety and class abstractions. It compiles on SGI, Sun, and Windows 95/NT and is available from AT&T by contacting the authors.

During the question period, the similarity between dynamic classes and Corba or ActiveX was noted. Hjálmtysson acknowledged the similarity and claimed that dynamic classes have less fireworks around them and are a more lightweight approach. Not surprisingly, other questions focused on how a similar system might be written for Java, whether the Java Virtual Machine (JVM) would have to be modified, and how the JVM might be forced to unload classes. Hjálmtysson said he believed that it may be possible to implement dynamic classes without modifying the JVM, but that the class loader would need to be modified, making it less portable.

Session: Commercial Applications

Summary by Brian Kurotsuchi

Each of the papers in this session dealt with the low-level, behind-the-scenes operating systems internals, specifically, the filesystem and virtual memory subsystems.

Fast Consistency Checking for the Solaris File System

J. Kent Peacock, Ashvin Kamaraju, Sanjay Agrawal, Sun Microsystems Computer Company

Kent Peacock presented his group's work with the optimization of the native Solaris UFS filesystem to improve performance while supporting the semantics of NFS services. He explained that NFS semantics require data to be committed to stable secondary storage before the NFS transaction can be completed. This requirement unfortunately precludes the use of filesystem block caches, which are generally used to improve read/write performance. In order to overcome the synchronous write requirement, they decided to use some type of fast NVRAM storage medium to provide safe buffering of the physical storage device; they first used a UPS on the system, then actual NVRAM boards. With this NVRAM solution, they gained performance by not having to wait for slow secondary storage to complete before acknowledging the NFS transaction. Peacock also mentioned that they tried traditional logging (journaling) to the NVRAM, but were unable to meet performance requirements using that approach.

The second issue that Peacock addressed was that of filesystem performance at runtime and when fsck is required to check the filesystem. In order to do this, they added additional data structures to the on-disk filesystem representation and modified some of the ways in which metadata are juggled. The areas Peacock focused on were busy bitmaps and the changes in the use of indirect blocks.

The Solaris UFS filesystem is divided into cylinder groups, each of which contains a bitmap of free blocks. An fsck involves checking this data in each cylinder group on the disk, an operation that can take some time. In order to reduce the number of metadata structures that need to be checked during an fsck run, there are special bitmaps in parallel (physically) with the free block bitmap. These new bitmaps indicate which blocks and i-nodes in that cylinder group are being modified (are busy). Each cylinder group can then be flushed and marked as "stable" asynchronously by a kernel thread. This can greatly reduce the time needed to do an fsck because only cylinder groups that are still marked "busy" need to be checked.

An interesting variation that Peacock's group came up with is the handling of indirect block maps to reduce the number of writes to disk. Indirect blocks are normally stored separately from the i-node, hence in a different block on the disk. Updating a large file that requires the use of the indirect blocks incurs a read and write of at least two blocks instead of one (i-node only vs. i-node + indirect block[s]). To defer the need to deal with the additional blocks, temporary indirect block storage is interleaved on odd i-nodes in the i-node table. Each time an indirect block is needed, it is written into the i-node slot adjacent to the file's i-node, requiring only a single write operation. When the adjacent i-node storing the indirect pointers is full, it is flushed to the traditional indirect block (hence deferring all indirect block I/O operations until this time).

In conclusion, Peacock reminded us that NFS is inherently disk bound because of the synchronous write requirements. His group was able to overcome this by using NVRAM storage to satisfy NFS semantics and attain high throughput performance. On top of that, they were able to make additional gains by modifying UFS to use the indirect block cache and busy maps. The data gathered by Peacock's group seem to indicate runtime and fsck performance above and beyond that of standard UFS and the widely used Veritas File System. This modified filesystem is in use on Sun's Netra NFS server products and may appear in a future Solaris release.

The audience questions indicated some skepticism on the Veritas benchmarks that were stated. An important question concerned NFS version 2 versus version 3, for which Peacock said they found a smaller performance gap between their Netra NFS implementation and NFS version 3.

General Purpose Operating System Support for Multiple Page Sizes

Narayanan Ganapathy and Curt Schimmel, Silicon Graphics Computer Systems, Inc.

Narayanan Ganapathy gave an excellent presentation that outlined the advantages of using virtual memory page sizes above the normal 4k and a walk-though of how they implemented this idea in IRIX (v6.4 & 6.5). There are some applications out there that could see improvements in performance if they could use large pages of memory versus small (4k) pages. Much of the overhead for an application that deals with large sets of data could be TLB misses. Ganapathy presented an explanation of the reason behind and experience they had at SGI while retrofitting the IRIX virtual memory system to allow processes to use multiple page sizes.

One of the goals in designing this multi-sized paging system was minimizing change to existing operating system code and maintaining flexibility and compatibility with existing application binaries. The implementation they chose makes changes at a high level in the virtual memory subsystem, the per process page table's entries (PTEs). This is the map of all pieces of memory that can be accessed by a process. To support the large pages, each PTE has a field that states the size of that page (4k-16M on the R10000). The memory area to which that page refers may be handled at a lower level by a pfdat (page frame data) structure, which they chose to keep as statically sized 4k pieces for compatibility. A major advantage to doing things this way is that multiple processes can still share memory, but the size of the area that each of them sees in its page table does not have to be the same. One process can map 16k pages while another maps 4k pages, both of them ultimately referring to the same 4k pfdat structures (in effect the same physical memory).

Allowing processes to manipulate their PTEs in this way produced some interesting problems, such as memory fragmentation, fast processing of TLB misses, additional systems calls and tools to manipulate the page sizes. Fragmentation is avoided by intelligent allocation of pages through the use of maps for free segments of memory of similar size and a "coalescing daemon" to defragment memory in the background using page migration to rearrange. To prevent all processes from going through extra code even when they are using the default page size, IRIX provides the ability to assign a TLB miss handler on a per process basis. A system call has been provided to change the page sizes, plus tools to allow normal binaries to be configured to use large pages without recompilation.

In closing, Ganapathy mentioned the possibility of intelligent kernels that could automatically choose page sizes for a process based upon TLB misses.

Implementation of Multiple Pagesize Support in HP-UX

Indira Subramanian, Cliff Mather, Kurt Peterson, and Balakrishna Raghunath, Hewlett-Packard Company

The final presentation in this session was given by Indira Subramanian. Although this presentation was on the same subject as the previous one by the SGI group, they were well coordinated and did not seem like redundant subject matter.

As in the Silicon Graphics implementation, the HP group wanted to minimize kernel VM subsystem changes. Their implementation also avoids changes to the low-level VM data structures and implements variable sized pages at the PTE level.

Allocation and fragmentation management is governed by an implementation of the buddy system, with the pools representing free memory regions from 4k to 64M in size. The page management subsystem uses two different strategies to deal with requests for new pages. The VM system will automatically combine pages to create larger pages as soon as a page is freed. Next, pages not currently in the cache will be evicted and coalesced into the free pool. The last resort is to evict and coalesce pages currently in the cache. Using this algorithm should give the greatest chance for pages to be retrieved out of the cache.

A single modified HP-UX page fault handler is used for all page faults that occur in the system. It is capable of dealing with copy-on-write, copy-on-reference, zero filling, and retrieval of large pages when necessary. It is possible to provide the page fault handler with a page size hint either through the use of a utility program (chatr) or through an intelligent memory region allocation routine. This "hint" allows the page fault handler to bypass page size calculation and allocation if it can determine that the default of 4k is going to be used. The basic idea was that if the page size hint shows that the new page was probably going to be greater than 4k, the fault handler would take the following steps: (1) calculate the size, (2) allocate the necessary region, (3) add the necessary translations, and (4) zero fill the page if needed. The size calculation and large region allocation can be completely skipped if the new page will be a simple 4k, hence preserving performance in those cases.

Perhaps the most gratifying part of this presentation was the place where Subramanian spent a lot of the time ­ the graphs. By experimenting with applications that use memory in different ways, the data showed that one large page size was not suitable for all situations. In one application, there was a very high TLB miss rate while using 4k pages, but a much better hit rate with 4M pages. As you would expect, the law of diminishing returns kicked in when excessive page sizes were selected.

In conclusion, Subramanian reminded us that the page sizes are not promoted (combined to make larger pages) except when the page fault handler identifies regions experiencing large TLB miss rates. Reducing TLB misses was the project goal, which they accomplished through adding the ability to use large pages and by having the VM system dynamically monitor memory usage and adjusting page sizes to reduce TLB misses at runtime.

One unfortunate factor in this presentation was lack of time. Our presenter was fielding some very interesting questions from the audience and flipping at blinding speed between graphs of data right up to the end.

Session: Performance II

Summary by Vikas Sinha

This was the second of system performance-related sessions. Papers focusing on performance issues pertaining to simulation to better understand the obtruse program execution events, cache design for servers, and messaging techniques for exploiting the current networking technology were presented. The session was chaired by Mike Nelson of Silicon Graphics.

SimICS/sun4m : A Virtual Workstation

Peter S. Magnusson, Fredrik Larsson, Andreas Moestedt, Bengt Werner, Swedish Institute of Computer Science; Fredrik Dalhgren, Magnus Karlsson, Fredrik Lundholm, Jim Nilsson, Per Stentröm, Chalmers University of Technology; Håkan Grahn, University of Karlskrona/Ronneb

Peter Magnusson presented the paper describing the capabilities and the current status of the instruction-set simulator SimICS/sun4m, which has been developed by his research group at the Swedish Institute of Computer Science (SICS) over the past several years.

Simulation is essentially running a program on a simulator on some arbitrary computer that should behave like a program actually running on a specific target computer. Simulation focuses on capturing characteristics like hardware events induced on a target platform during program execution and some details of the software running that are otherwise difficult to gather. Gathering such detailed characteristics using simulators does involve a slowdown of typically two to three orders of magnitude in program execution compared to its execution on native hardware.

System-level simulators facilitate understanding the intricacies of program execution on a target system because of their capability to re-create an accurate and complete replica of the program behavior. Such simulators have thus been an indispensable tool for computer architects and system software engineers for studying architectural design alternatives, debugging, and system performance tuning.

SimICS/sun4m is an instruction-set simulator that supports more than one SPARC V8 processor and is fully compatible with the sun4m architecture from Sun Microsystems. It is capable of booting unmodified operating systems like Linux 2.0.30 and Solaris 2.6 directly from dumps of the partitions that would boot a target machine. Binary compatible simulators for devices like SCSI, console, interrupt, timers, EEPROM, and Ethernet have been implemented by Magnusson's research group. SimICS can extensively profile data and instruction cache misses, translation look-aside buffer (TLB) misses, and instruction counts. It can run realistic workloads like the database benchmark TPC-D or interactive applications such as Mozilla.

A noteworthy application of the SimICS/sun4m platform is its use for evaluating design alternatives for multiprocessors. The evaluation of the memory hierarchy of a shared-memory multiprocessor running a database application was presented as a case study.

In the presentation the performance of the SimICS/sun4m simulator was demonstrated by comparing the execution time of the SPECint95 programs on the target and host, using the train dataset. The slowdown was in the range of 26-75 over native execution for the test environment chosen.

SimICS/sun4m is available for the research community at <>. The presentation slides are available at <>. Magnusson also welcomed those interested in knowing more about his work to contact him at <>.

A few interesting questions were asked after the presentation. To demonstrate user and system mode debugging, evaluation of Mozilla running on top of Solaris on SimICS had been presented in the talk. In the presentation it was also noted that reloading a page required 214 million SPARC instructions, and about 25% of these were spent in the idle loop. The question was whether it was clear as to why so much time was spent in the idle loop. Magnusson said that the answer wasn't clear to them, and to get the answers to such questions, they were working on adding high-end scripting tools to their simulator because the current tools are not sufficient for detailed analysis.

In reply to the question of what was the hardest problem to solve in the work, Magnusson said that from an engineering point of view it was the modelling of devices at a level to run real SCSI devices, real ethernet drivers, etc. From the research point of view it was the design of memory fast enough and flexible enough to give one the desired information. In reply to the question on use of interpreters as against realtime code generation, Magnusson said that although the "religious belief" of programmers that realtime code generation was faster held true, he wasn't aware of any group that had actually implemented it with the desired stability. He added that one of the reasons they are going commercial ­ Virtutech is the new company their group has founded ­ was the hope that they will have access to resources required to better address such issues, which are often not feasible in the academic research environment.

High-Performance Caching With The Lava Hit-Server

Jochen Liedtke, Vsevolod Panteleenko, Trent Jaeger, and Nayeem Islam, IBM T.J. Watson Research Center

Jochen Liedtke presented the results of an ongoing experiment at the T.J. Watson Research Center on the architecture design for a high-performance server capable of efficiently serving future local clusters of network computers and other future thin clients (PDAs, laptops, pagers, printers, etc.). The key component in their architecture is a generic cache module designed to fully utilize the available bandwidth.

Liedtke's group envisions future local networks serving thousands up to hundreds of thousands of resource-poor clients with no or little disk storage. In such scenarios the clients will download a significant amount of data from the server, whose performance can become the bottleneck. They suggest that high-performance customizable servers, capable of handling tens of thousands of transactions per second (Tps) with bandwidths of the order of gigabytes per second will be required.

The basic goals of their research were to find the upper bounds, both theoretical and practical, and to find a highly customizable, scalable architecture for such a scenario.

They based their work on the well-established approach of increasing server performance via efficient cache design. The fundamental idea behind their work is separating the hit-server from the miss-server. The hit-server is connected to both the pool of clients and the miss-server using different Ethernet connections. There could be several Ethernet cards on the hit-server, each connecting several clients. If the desired object is in the hit-server, it is accessed using standard get and put commands; otherwise the miss-server is signalled.

Because the hit-server is vital for performance, they make it general and policy-free, so that it can adapt to any application. The hit-server allows get/put operation on entire as well as partial objects beside providing support for active objects. The miss handling and replacement policy is handled in the customizable miss-servers. To achieve scalability, it is suggested that multiple customized miss-servers, e.g., fileservers, Web proxy servers, etc,. could be implemented. Or more hit-servers can be incorporated in the design to increase the overall cache size. The paper describes the mechanisms that allow the miss-servers to support the desired consistency protocol per object.

Throughputs of up to 624 Mbps are possible using the 1 Gbps PCI bus. But current commercial and research servers still achieve rates up to 2,100 and 700 Tps, respectively, for moderately small 1K size objects. It was demonstrated that the problem was not with the network hardware, but with the memory bus. Thus it is imperative to minimize memory bus access, which slows down the performance. The CPU should limit itself to using the L1 and L2 caches as far as possible. Using lazy evaluation techniques and precompiling packets and precomputing packet digests can facilitate this. L2 misses can be minimized by proper data structuring. Lava's get performance is 59,000 Tps and 8,000 Tps for 1K and 10K objects, respectively. Liedtke explained that the throughput limit of 624 Mpbs suggested in the paper was incorrect because they had based their measurements on the time to transmit a single packet using the PCI bus and had not considered the time interval between the "start transmit" signal to the controller and the start of the transmission, which could be used by some other packet in case of multiple packet transmissions.

A simple application of multiple clients booting by accessing 5M-15M of data over a short interval of five minutes was shown to have an average latency of about 1.5 s for 1,000 clients.

Before concluding, Liedtke put up some open questions. They were whether the hit-server could be used in the WAN environment where different protocols were prevalent, how cache friendly will future applications be and if the system can be customized for them, and whether it will be possible to integrate dynamic applications like databases into the design.

Liedtke concluded by saying that the lessons his group had learned from the implementation were that designing from scratch pays. He also suggested that it is a good strategy to separate the generic-fast-simple from the customizable-complicated-slower and noted that generality goes with simplicity. He also said that even though an ideal case analysis might be wrong, it is essential, and that designing before implementing should be done whenever possible.

Cheating the I/O Bottleneck: Network Storage with Trapeze/Myrinet

Darrell C. Anderson, Jeffrey S. Chase, Syam Gadde, Andrew J. Gallatin, and Kenneth G. Yocum, Duke University; Michael J. Feeley, University of British Columbia

Darrell Anderson presented a messaging system designed to deliver the high-performance potential of current hardware for network storage systems, including cluster filesystems and network memory.

They note that the I/O bottleneck arises because disks are inherently slow due to mechanical factors. Very fast networks like Myrinet, on which their work is based, offer point-to-point connections capable of 1 GB/s bandwidths for large file transfers and small latencies of 5-10 microseconds for small messages. The network can instead be viewed as the primary I/O path in a cluster, with the goal of achieving I/O at gigabit network speeds for unmodified applications. By allowing all I/O to/from network memory, the I/O bottleneck can be cheated. Also by pipelining the network with sequential read-ahead, write-behind high bandwidth, file access through the network can be achieved. They rely on the Global Memory Service (GMS) developed at the University of Washington, Seattle, to provide the I/O via the network. Myrinet provides link speeds matching PCI bandwidth, link-level flow control, and a programmable network interface card (NIC), which is vital in their environment. Their firmware runs on the NIC, they modify the kernel RPC, and they treat file and virtual memory systems as an extension of the underlying gigabit network protocol stack. Their firmware and Myrinet messaging system is called Trapeze. They have been able to achieve sequential file access bandwidths of 96 MB/s using GMS over Trapeze.

The GMS system that has been used in Anderson's research lets the system see the I/O through the network. GMS is integrated with the file and VM systems such that whenever a file block or virtual memory page is discarded on a node, it is in fact pushed over the network to some other node, where later cache-misses or page-faults can retrieve it with a network transfer.

In the request-response model on which network storage systems are based, a small request solicits a relatively large page or file block in response. In their work they address the challenges in designing an RPC for network storage and its requirements for low overhead, low latency, and high bandwidth. Support for RPC variations, like multiway RPC for directory lookup and request delegation, is provided. Nonblocking RPC used for implementing read-ahead, write-behind is also supported.

Their Trapeze messaging is reportedly the highest bandwidth Myrinet messaging system. It consists of two parts, the firmware running in the NIC and the messaging layer used for kernel or user communication. It supports IP through sockets as well as kernel-to-kernel messaging and is optimized for block and page transfers. It provides features for zero-copy communication through unified buffering with the system page frame pool and by using Incoming Payload Table (IPT) to map specific frames to receive into. The key Trapeze data structures reside in the NIC, where they are used by the firmware, but are also accessible to the messaging layer. The Send and Receive Rings in the NIC point to aligned system page frames, which are used to send and receive pages using DMA. These page frames can also be mapped into user space. Particular incoming messages can also be tagged with a token that, when used in conjunction with the Trapeze IPT can deliver the message data into a specific frame. This is used in implementing their zero-copy RPC. Their zero-copy TCP/IP over Trapeze can deliver a highly respectable bandwidth of 86 MB/s for 8 KB data transfers.

They short-circuit the IP layer, which is nevertheless available to user applications over the socket layer, in their integration of RPC with the network interface. This avoids the costly copying at the IP layer in the standard page fetch using RPC over IP.

They report highest bandwidths and lowest overheads using the file mapping mmap system call.

Anderson referred those interested in learning more about their work to their Web site <>.

A question was asked as to how IP performance could be improved, which came as a surprise to Anderson, who wasn't expecting the question and handled it by saying that their MTU size is 8 KB and also page remapping is done to avoid the costly data copying to improve the overall performance. Answering questions on reliability of their RPC and data corruption in the underlying hardware, he said that because they were using Myrinet, which provides a hardware checksum and also link-level flow control, messages are not corrupted or dropped in the network.

Session: Neat Stuff

Summary by Kevin Fu

This session consisted of a collection of interesting utilities. Pei Cao from the University of Wisconsin maintained order as the session chair.

Mhz: Anatomy of a Micro-benchmark

Carl Staelin, Hewlett-Packard Laboratories; Larry McVoy, BitMover, Inc.

Carl Staelin talked about mhz, a utility to determine processor clock speed in a platform independent way. Mhz takes several measurements of simple C expressions, then finds the greatest common divisor (GCD) to compute the duration of one clock tick.

Measuring a single clock tick is difficult because clock resolution is often too coarse. One could measure the execution time of a simple expression repeated many times, then divide by the number of instructions. However, this too has complications. For instance, a compiler may optimize "a++" run many times in a loop. Moreover, interrupts muddle with the measurements by randomly grabbing CPU time.

Staelin proposed a solution based on ideas learned from high school chemistry and physics to determine atomic weights. Measure the time of simple operations; then use the GCD to determine the duration of one clock tick. Mhz uses nine C expressions for time measurements. The expressions have inter- and intra-expression dependencies to prevent the compiler from overlapping execution of expressions. The operations must also be immune to optimization and be of varying complexity.

Mhz requires the operations to have relatively prime execution times. However, measurements will have variants and fluctuations. Therefore, mhz must minimize noise and detect when a measurement is incorrect. Mhz prunes out incorrect results by measuring many executions of a particular expression. If any particular execution is off by more than a factor of five when compared to other executions, the result is disregarded. Mhz calculates the duration of one clock tick using many subsets of the nine measurements. To produce a final answer, mhz takes the mode of the calculations.

The mhz utility works on x86, Alpha, PowerPC, SPARC, PA-RISC, Linux, HP-UX, SunOS, AIX, and IRIX. Mhz is accurate and OS/CPU independent. Mhz works in Windows NT, but NT does not offer the gettimeofday() call. As a result, Staelin used NT's native, something-left-to-be-desired timer. Mhz produced correct results, but Staelin did not report this because he does not want to support NT. Porting is painful for a variety of reasons.

Staelin was also asked about loop overhead and interrupts. Mhz was developed with a timing harness that performs a variety of experiments to detect clock granularity. Mhz can remove the overhead caused by the gettimeofday() call. Interrupts are random and hence dealt with by using multiple experiments.

An audience member asked whether mhz could produce more accurate results when given more time to compute. Staelin responded, "Good benchmarking hygiene should be good. We wanted something that would work in a second or so."

There may be other areas of computer performance where this method has applicability. This is a trick that can go into your mental toolkit. See <> for the source code.

Automatic Program Transformation with JOIE

Geoff A. Cohen and Jeffrey S. Chase, Duke University; David L. Kaminsky, IBM Application Development Technology Institute

Geoff Cohen, an IBM graduate fellow and doctoral student at Duke, talked about load-time transformations in the Java Object Instrumentation Environment (JOIE). Transportable code allows sharing of code from multiple sources. Cohen used JOIE as an environment toolkit to transform compiled Java bytecode.

There already exist binary transformation tools such as OM/ATOM, EEL, and Purify. BIT and BCA allow transformations in Java. However, BCA does not modify bytecodes, and BIT only inserts method calls into bytecodes. Neither is a general transformation tool.

There are a few kinds of transformers. A symbolic transformation could add interfaces or change names, types, or superclasses. A structural transformation could add new fields, methods, or constructors. Bytecode transformation allows for insertion, reordering, or replacement of bytecodes within method implementations. This last transformation significantly distinguishes JOIE from BCA.

Such transformers can extend Java to support generic types and new primitives. For instance, transformers can work with caching, security, and visualization for system integration. Moreover, transformers can add functionality such as versioning or logging.

Load-time transformations with JOIE are incremental, automatic, and universal. JOIE is an enabling technology that gives users more control with programs. Related issues are performance, security, safety, and ease of use.

An audience member asked why not perform transformations in the JIT/JVM. Cohen's response was this method is not platform independent and is harder to write. In the JIT, symbols may have been lost.

JOIE is written in Java. Performance is on the order of single-digit milliseconds. But once time allows for some tuning, Cohen expects JOIE to run in hundreds of milliseconds.

Responding to a question, Cohen said that it is possible to debug transformed code, but it is very hard. The JVM should prevent anything unsafe created by JOIE at runtime (e.g., read /etc/passwd).

Finally, an audience member asked about reversibility: could a transformation be undone by another transformation. In theory, this is possible, but some functions are one-way.

See <>.

Deducing Similarities in Java Sources from Bytecodes

Brenda S. Baker, Bell Laboratories, Lucent Technologies; Udi Manber, University of Arizona

Brenda Baker spoke about how to detect similarities in Java bytecode. She is interested in string matching and Web-based proxy services. Java is the juggernaut and is expected to be widespread and ubiquitous. Typically, the bytecode is not distributed with the source code when programmers want to keep the source secret. Baker's goal is, given a set of bytecode files, to discover similarities among the bytecode files that reflect similarities among their Java source files. Furthermore, all this should happen without access to Java source files.

Detecting similarities has application to plagiarism detection, program management to find common sources, program reuse and reengineering, uninstallation, and security (similar to known malicious code). For instance, one could detect the incorporation of benchmarks into programs or whether JOIE was applied. There is also a potential battle against code obfuscators.

Baker adapted three tools: siff finds pairs of text files that contain a significant number of common blocks (Manber), dup finds all longest parameterized matches (Baker), and diff is a dynamic programming tool to identify line-by-line changes (GNU).

Siff and diff are not too useful on raw bytecode, even when the byte code is disassembled. When changing a 4 to a 5 in two lines of a 182-line Java file, diff generated 1,100 lines of output on the disassembled bytecode, but siff found less than 1% of similarity.

Baker described three experiments. The first experiment involved random changes to a Java source file (insertion, deletion, substitution within statements). The bytecode was compiled, disassembled, then encoded. The average similarity in the disassembled code using siff never grew larger than 9% off from the same measurement on the Java source. Averages stayed very close, making this a promising approach.

In the second experiment, Baker's group tried to find similarities in 2,056 files from 38 collections. Thresholds were set as follows: siff reported pairs with at least 50% similarity, dup reported pairs matching at least 200 lines. Nine pairs of programs across different collections were reported as similar by both siff and dup. Eight of these had the same name. One program had the same implementation of the MD5 hash algorithm.

One pair was reported only by dup ­ probably a false positive. However, siff reported 23 pairs unreported by dup. Some had similar names while the other pairs consisted of one very small file and one very large file. The small/large file pairs are probably false positives.

Experiment three involved false negatives. Baker's group asked friends to randomly pick 10 programs from set of 765 Java programs. The person would make random changes, then compile the Java code ­ even with different compiler versions. The bytecode was then returned in random order.

Of the 12 pairs of similar code, siff found nine of 12 pairs with a threshold of 65%; dup found eight pairs with a threshold of 100 lines. Together siff and dup found 10 pairs. There is a trade-off between false positives/negatives and the threshold.

Baker found the offsets to be important for matching. Also, siff can handle large amounts of code, but diff requires the most intensive computation. When analyzing lots of files, first use siff, then dup, then diff. diff has a quadratic blowup with respect to the number of file inputs.

An audience member asked whether Baker had tried comparing the output of two different compilers. Baker doubts her group will find much similarity. But if you have the code, you could compile in another compiler to test for similarity. As for false positives, if you lower the threshold too far, you could get hundreds of false positives. Moving code around will not affect dup, but will affect siff. This all depends on the threshold. Using siff, dup, and diff in combination makes detection more powerful.

In further research, Baker's group hopes to use additional information in bytecode such as the constant pool.

Session: Networking

Summary by Jon Howell

The networking session was chaired by Elizabeth Zwicky of Silicon Graphics.

Transformer Tunnels: A Framework for Providing Route-Specific Adaptation

Pradeep Sudame and B. R. Badrinath, Rutgers University

Pradeep Sudame presented the concept of transformer tunnels as a way to provide better service to mobile hosts that encounter diverse networks. In a day, a mobile host might access the network at large over a modem, a cellular phone, a wireless LAN in the office, and a high-speed wired LAN. Each network has different properties, and transformer tunnels provide a way to manipulate the traffic going over the mobile host's link to minimize certain undesirable effects.

The mechanics of transformer tunnels are as follows: a routing table entry at the source end of the tunnel indicates that packets bound for a given link should be transformed by a certain function. The source node transforms the packet payload, rewrites the header to point to the tunnel destination, rewrites the protocol number to arrange for the transformation to be inverted at the far end, and attaches the original header information to the end of the packet so it can be reconstructed at the other end.

When the packet arrives at the destination, its protocol number directs it to the appropriate inverse transformation function. The reconstructed packet is delivered to IP, where it is delivered in the usual way to an application on the local host or forwarded on into the network.

Sudame gave interesting examples of how transformer tunnels can provide useful trade-offs for mobile hosts on links with different characteristics. A compression module is useful on slow links, trading off host-processing overhead. A big packet buffer compensates well for links with bursty losses (such as during a cell handoff), trading off memory requirements. A tunnel that aggregates packets to reduce radio transmitter on-time reduces energy consumption, trading off an increase in the burstiness of the link.

Joe Pruett, in the Q/A period, asked how the transformer deals with a maximally sized packet to which it needs to add overhead. Sudame responded that, for optional optimizations, it would be passed on unchanged; for mandatory transformations such as encryption, it would be transformed and then fragmented.

Ian Vaughn asked if IPsec was used for encryption, to which Sudame replied that they used only a simple exclusive-OR as a proof-of-concept.

Elizabeth Zwicky asked how difficult it was for an unfamiliar programmer to write a transformation function. Sudame replied that it required the programmer be only somewhat aware of systems programming concepts.

Sudame provided the following URLs for more information and indicated that the group would like many people to try out the code and comment on it. <> and <>.

The Design and Implementation of an IPv6/IPv4 Network Address and Protocol Translator

Marc E. Fiuczynski, Vincent K. Lam, and Brian N. Bershad, University of Washington

Marc Fiuczynski discussed an IPv6/IPv4 Network Address and Protocol Translator (NAPT). He identified three possible scenarios in which one might configure a NAPT: use within an intranet, providing your shiny new IPv6 systems with access to the existing IPv4 Internet, and duct-taping your rusty old IPv4 systems to the emerging IPv6 Internet. As he began his talk, Fiuczynski fumbled with the pointer, but then fell back on his Jedi light saber training, muttering, "Luke, use the laser pointer."

He outlined the project's goals for a translator: a translator should be transparent so that the end host is oblivious of its presence. It must scale with the size of the network it is serving. It should be failure resilient, in that it can restore service after a reboot. It should, of course, perform suitably. And finally, it should deploy easily.

A translator must attend to several issues. It needs to preserve the meaning of header bits across the header formats. It translates addresses between the IPv4 space and the IPv6 space, which it can do using a stateful or stateless server. And it also needs to translate certain higher-level protocols such as ICMP and DNS that encode IP addresses in the IP payload.

The group built two translators, one stateful and one stateless. The stateful translator has a table of IPv4 to IPv6 address mappings. It attempts to garbage-collect IPv4 addresses to reduce the number needed to serve a site. This garbage collection was challenging because "you might break someone's ongoing communication ... that would be bad ... that's definitely not a goal of the translator." However, because the translator is stateful, it is not scalable or fault resilient; because it requires rewriting some transport protocol headers, it is not transparent.

The stateless translator uses special IPv6 addresses that map one-to-one with IPv4 addresses. It is scalable, fault resilient, transparent, and has no need to garbage-collect IPv4 addresses. However, using the special compatibility addresses means that routers will still have the "stress" of routing IPv4-like addresses, a problem IPv6 addresses are designed to relieve. Fiuczynski concluded that a stateless translator is best suited to connecting an IPv6 site to the IPv4 Internet or for translating within an intranet.

Joe Pruett asked about DNS translation and whether all internal IPv6 nodes could be reachable from the outside network using IPv4 addresses. Fiuczynski replied that the stateful translator would have to garbage-collect addresses to share them among internal hosts and translate (or directly answer?) DNS queries according to the current mapping.

Greg Minshall asked what the difference was between IPv4 to IPv4 translators and Washington's IPv6 to IPv4 translators. The reply was that IPv4 NATs are stopgap measures with no clear replacement, but IPv6 translators are a transitional mechanism meant to be eventually removed.

Dave Presotto asked whether the system was rule based, that is, whether he could add new translation functions, other than IPv4 to IPv6 translation, in order to perform other functions using the same system. Fiuczynski expressed confidence that such an extension would be straightforward.

B.R. Badrinath asked if multicast address translation would be a problem, to which Fiuczynski offered a succinct "yes."

The work is documented at <>, and source will be available there soon.

Increasing Effective Link Bandwidth by Suppressing Replicated Data

Jonathan Santos and David Wetherall, Massachusetts Institute of Technology

Jonathan Santos spoke about his group's work in identifying and suppressing replicated data crossing a given network link. The work applies to any link that is a bottleneck due to cost or congestion reasons. The novel approach of the project was to identify redundancy in packet payloads traversing the link without using protocol-specific knowledge.

Santos defined "replicated data" as a packet payload that is byte-for-byte identical to a previously encountered packet. The researchers studied a packet trace from an Internet gateway at MIT and discovered that 20% of the outbound volume and 7% of the inbound volume of data met their definition of replicated. HTTP traffic was responsible for 87% of the replication found in the outbound trace, and 97% of the volume of replicated data was delivered in packets larger that 500 bytes, indicating that per packet compression savings would dwarf any added overhead.

To identify whether the replication could be detected and exploited in an online system, they graphed replicated volume against window size. The graph had a knee at around 100-200MB, signifying that most of the available redundancy could be exploited with a cache of that size.

Their technique for redundancy suppression involved caching payloads at both ends of the link and transmitting a 128-bit MD5 fingerprint to represent replicated payloads. One issue involved retransmitting the payload when the initial packet (containing the real payload) is lost. They also prevent corruption due to fingerprint collisions (the unlikely possibility that two payloads share the same MD5 checksum) in the absence of message loss. (Greg Rose from Qualcomm Australia pointed out that RSA, Inc., issues a cash prize if you discover an MD5 collision. Hopefully, the software reports any collisions to the system administrator.)

Santos concluded that their system was a cost-effective alternative to purchasing link bandwidth and that it complements link-level compression well.

Fred Douglis inquired whether they might be able to identify and compress very similar but not identical packets in an online fashion. Santos suggested using fingerprinting at a finer granularity (over parts of packets).

Nick Christenson pointed out that most of the replication is due to outbound HTTP traffic and asked whether it might have been nearly as effective to simply use a Web cache on the outbound end of the link. Santos said they assumed client-side and proxy caches were in use when the traces were taken. [This does not account for the redundancy available if all clients passed through the same Web cache at the outbound end of the link, which appeared to be Christenson's point.]

Andy Chu pointed out that, to save bandwidth on a typically congested link to an ISP, one must funnel all data through one link and put the box at your ISP. [Also observe that the cost savings will apply only to the link bandwidth; the ISP will surely still desire compensation for the now-increased use of its upstream link.]

Session: Security

Summary by Kevin Fu

The papers in this session dealt with controlled execution of untrusted code. Specifically, the papers discuss how to confine untrusted code to a safe environment. Fred Douglis from AT&T Labs ­ Research served as the session chair.

Implementing Multiple Protection Domains in Java

Chris Hawblitzel, Chi-Chao Chang, Grzegorz Czajkowski, Deyu Hu, and Thorsten von Eicken, Cornell University

Chris Hawblitzel gave a confident, well-paced presentation of the J-Kernel, a portable protection system written completely in Java. The J-Kernel allows programmers to launch multiple protection domains within a single Java Virtual Machine (JVM) while maintaining communication and sharing of objects and classes in a controlled way.

Hawblitzel listed three ways an applet security model can enforce security:

  • restrict which classes an applet can access (type hiding)
  • restrict which objects an applet can access (capabilities)
  • perform additional checks (stack inspection)

However, a problem persists in that applets have no way to communicate in a secure, controlled way. Therefore, the J-Kernel group decided on three requirements for their protection system:

  1. Revocation. Java provides no way to revoke references to objects. Therefore, the J-Kernel must provide its own revocation mechanism on top of Java.
  2. Termination. If one merely stops a domain's threads, there may still be reachable objects from other domains. Such domains will not be garbage-collected. Therefore, the J-Kernel must free up objects when a domain terminates.
  3. Protection of threads. In maintaining control over a thread, ownership must change during a boundary crossing of a cross-domain call. Java lets you stop and change the priority of threads. This could allow for malicious behavior. The J-Kernel should not allow outside changes to a thread when another domain is in control.

The J-Kernel distinguishes between objects and classes that can be shared between domains ­ and what is private to a single domain. Furthermore, the J-Kernel necessitates a revocation mechanism only for shared objects, simplifies security analysis of communication channels, and allows the runtime system to know which objects are shared.

Hawblitzel noted that it can be hard to maintain a distinction between shared and private information. Private objects must not be passed through method invocations on shared objects to other domains. The J-Kernel solves this by passing shared objects by reference. Private objects passed are by copy.

Using Microsoft's JVM or Sun's JDK with Symantec's JIT compiler on 300MHz Pentium II running Windows NT 4.0, a null J-Kernel local RMI takes about 60x to 180x longer than a regular method invocation. This result is mostly due to thread management and locking when entering a call. Synchronization comprises 60-80% of the overhead. The J-Kernel suffers some performance loss because it is written in Java. See the paper for a table of performance results.

The J-Kernel group created a few sample applications as well. They finished an extensible Web server and are working on telephony server. Private domains interface to the Web, PBX, and phone lines while user domains run servlets to process requests and calls. New services can then be uploaded safely. Related work includes Java sandbox extensions, object references treated as capabilities (e.g., Spin, Odyssey, E), safe language technology (e.g., Java), and capability systems (e.g., Hydra, Amoeba).

One audience member asked how the J-Kernel copies parameters and how it handles data considered to be transient. Hawblitzel explained that the J-Kernel can use serialization to copy objects (the objects are serialized into a big byte array, and then the byte array is deserialized to generate new copies of the objects), or it can generate specialized copy routines that are faster than serialization because they do not go through the intermediate byte array.

Source code and binaries are available for NT and Solaris. For more information, see <>.

The Safe-Tcl Security Model

Jacob Y. Levy, Laurent Demailly, Sun Microsystems Laboratories; John Ousterhout and Brent B. Welch, Scriptics Inc.

Safe-Tcl allows execution of untrusted Tcl scripts while preventing damage to the environment or leakage of private information. Safe-Tcl uses a padded cell approach (as opposed to pure sandboxing or code signing). Each script (applet) operates in a safe interpreter where it cannot interact directly with the rest of the application. Safe-Tcl's main claim to fame is its flexibility: the capabilities given to the script can be increased or decreased based on the degree to which the script is trusted.

Unfortunately, there was some confusion among the authors about who was supposed to present, with the result that no one showed up at the session. However, the paper is well written and worth the read. You can find related material from Scriptics <>, the Tcl Resource Center at Scriptics <>, the Tcl Consortium <>, or the Tcl plugin download page <>. The plugin is the best example of an application using Safe-Tcl and is a good starting point for people who want to learn more about Safe-Tcl.

Session: Work-in-Progress Reports

Summaries provided by the authors, and edited and compiled by session chair Terri Watson Rashid

Jon Howell <> described Snowflake, a project to allow users to build single-system images that span administrative domains. The central concept is that users must perform the aggregation of their single-system image in order to freely aggregate resources supplied by multiple administrators.

Ray Pereda <> talked about a new programming language that he and Clint Jeffery developed at the University of Texas, San Antonio. The language is called Godiva. It is a very high-level dialect of Java.

Bradley Kuhn <> ) discussed his work at the University of Cincinnati on Java language optimization. The goal of this research is to create a framework for implementation of both static and dynamic optimizations for Java. Such a framework will allow for testing and benchmarking of both new and old dynamic and static optimizations proposed for object-oriented languages in the literature. The framework will build upon existing pieces of the free Java environment, such as Guavac and Japhar, to make implementations of new optimizations for Java easy and accessible to all.

Jun-ichiro Itoh <> talked about his IPv6 and IPsec effort in Japan. The project, called KAME Project, is trying to establish BSD-copyrighted, export control-free, reference network code for Internet researchers as well as commercial use. They also intend to incorporate best code for recent topics, such as QoS, IP over satellite, etc.

Oleg Kiselyov <> spoke about an HTTP Virtual File System for Midnight Commander (MC). The VFS lets the MC treat remote directories as if they were on a local filesystem. A user can then view and copy files to and from a remote computer and even between remote boxes of various kinds. A remote system can be an arbitrary UNIX/Win95/WinNT box with an HTTP server capable of running a simple, easily configurable sh or Perl script.

Ian Brown <> described ongoing work on signatures using PGP that can be checked only by people designated by the signer. Typical digital signatures on messages can be checked by anyone. This is useful for contracts, but for confidential messages senders may not want recipients to be able to prove to anyone what they wrote. Comments during the WIP session pointed out that the current approach did not check integrity of encrypted but unsigned data. The authors noted that this is a general PGP problem and have since augmented their design to fix this.

Tom M. Kroeger <> from the University of California, Santa Cruz, presented some preliminary work on efficiently modelling I/O reference patterns in order to improve caching decisions. This work is attempting to use models from data compression to learn the relationships that exist between file accesses. They have been addressing the issues of model space and adapting to changing patterns by partitioning the data model and limiting the size of each partition. They are working to implement these techniques within the Linux operating system.

Poul-Henning Kamp <> talked about "timecounters," a new concept for tracking realtime in UNIX kernels. With this code NTP can track any current or future possible time signal into the 1E-18 second regime, limited only by hardware issues. A couple of plots showed how NTP tracked a GPS receiver with approximately 10nsec noise.

James Armitage <> and Bari Perelli-Minetti <> spoke about their research with John Rulnick in the Network Operations Research Lab at WPI concerning the causes of soft (transient) errors in workstation and server memory and the effects of these errors on system operation. The techniques being used to explore the effects of soft errors were also briefly presented. One member of the audience provided information on related experiments on errors occurring in satellite circuits due to cosmic rays in space.

Kostas Magoutis <> briefly talked about his work on eVINO, an extensible embedded kernel for intelligent I/O based on the VINO operating system for task management, extensibility, and networking. He argued that I/O platforms (IOP) in the form of add-on adapters with fast, programmable I/O processors are effective in helping servers face demands of today's gigabit networks and RAID storage systems, offloading interrupt processing and allowing them to better multiplex their resources to application programs. eVINO will focus on I/O and provide extensibility on the IOP with applications such as active networking and Web server caching.


Repetitive Strain Injury: Causes, Treatment, and Prevention

Jeff Okamoto, Hewlett-Packard

Summary by Eileen Cohen

Jeff Okamoto, a lively speaker with a sense of humor that helped lighten a grim topic, spoke about Repetitive Strain Injury (RSI) from the depths of personal experience. He worked hard for ten years before his injury ­ which, he said, may take as long as another ten years to go away ­ started to occur. At that point he decided to educate himself about RSI, and he used his talk to give the audience the lessons he learned "the hard way ­ and with ignorance and some amount of fear."

RSI is a topic of serious import to computing professionals. Not only is more work being done at desktop machines than ever before, but many people are also working longer and longer hours at their computers. It was estimated in 1994 that over 700,000 RSIs occur every year, with a total annual cost of over $20 billion. RSI can be devastating, affecting one's ability not only to do a job, but also to perform the basic tasks, and enjoy the pleasures, of daily life.

Okamoto began with facts about human anatomy that explain why RSIs occur, then moved on to discuss ergonomics. Companies spend a lot on ergonomics, and even though they're not all doing it the right way, he urges participating in any ergonomic assessment program your employer offers ­ "it can be a lifesaver." He provided detailed tips about hand position and motion, criteria for a good chair, monitor position, and use of pointing devices.

After explaining the range of possible RSI diagnoses and treatments, Okamoto emphasized that if an injury happens on the job, the only way to get proper medical help without paying out of your own pocket is to open a worker's compensation case. (Many people resist doing this.) Employers are legally bound to provide the necessary paperwork you need to file a claim. Unfortunately, said Okamoto, "having to deal with the worker's comp system is the worst thing in the world for me." He gave valuable advice, based on his experience in California, on negotiating the system ­ in particular how to choose your own physician instead of using one from the short list the state provides, who is "likely to be biased against you."

"An RSI is something I wouldn't wish on my worst enemy," said Okamoto. As a closing note, he raised the specter of what will happen if future computer users, who are getting younger and younger, are not trained as children to type and point properly. "By the time they get out of college, they'll be 90% on the road to injury."

The slides from this talk are available at <>. hey contain pointers to books, Web resources, and mailing lists on the topic.

Mixing UNIX and PC Operating Systems via Microkernels: Experiences Using Rhapsody for Apple Environments and INTERIX for NT Systems

Stephen R. Walli, Softway Systems, Inc.; Brett R. Halle, Apple Computer, Inc.

Summary by Kevin Fu

Stephen Walli, vice president of research and development at Softway Systems, started the session by discussing INTERIX, a system to allow UNIX application portability to Windows NT. For the second half of the invited talk, Brett R. Halle, manager of CoreOS at Apple Computer, talked about the Rhapsody architecture. He is the manager of CoreOS.

Walli first noted that INTERIX is the product formerly known as OpenNT. Just this week the product was renamed to avoid confusion with Microsoft products. Walli further noted, "This is not a talk about NT being great."

Walli explained his "first law" of application portability: every useful application outlives the platform on which it was developed and deployed. There exist migration alternatives to rewriting applications for Win32. For instance, one could use UNIX emulation, a common portability library, the MS POSIX subsystem, or INTERIX. On another side note, Walli exclaimed, "MS POSIX actually got a lot right. Originally, the ttyname function just returned NULL! There were stupid little things, but the signalling and forking were done right."

However, there is a problem with rewriting an application to Win32. The cost of rewriting increases with the complexity of the application. This led into Walli's discussion of the design goals of INTERIX:

  • Complete porting of runtime environment for UNIX.
  • Provide true UNIX semantics for the system services.
  • Ensure that changes to application code are made more, not less, portable.
  • Maintain good performance.
  • Do not compromise the security of NT.
  • Integrate INTERIX cleanly into an NT world.

The first big step was implementing Berkeley sockets. This Walli called "a big win." System V IPC was a big win, too. Other interesting tidbits about INTERIX include:

  • ACLs that map to file permissions
  • no /etc/passwd or /etc/group
  • no superuser

Walli tried not to "mess with the plumbing," but the INTERIX development team did have to make a /proc system for gdb.

Asked why INTERIX does not implement SUID capabilities, Walli explained that INTERIX did not implement SUID because of implications to the filesystem. If INTERIX provided an interface, it would have to provide complete semantics. As an alternative, INTERIX created a SetUser environment.

Another audience member asked about memory requirements to run INTERIX. Walli noted that NT itself requires more resources when moving from NT 3.51 to 4.0. INTERIX does not need much more space after getting enough memory for NT. 32MB is sufficient.

The INTERIX group has ported SSH, but Walli's CEO got paranoid and said "not in our Web site." SSH is available in Finland because of export laws.

Walli concluded with his "second law" of application portability: useful applications seldom live in a vacuum. After Berkeley sockets were implemented, the Apache port required just hours. Early porting experiences include 4.4BSD-Lite, GNU source, Perl 5, Apache, and xv.
See <>.

For the second half of the session, Halle discussed the Rhapsody architecture. He mainly summarized where Rhapsody came from and what it includes. Rhapsody evolved from NextStep 4.2 (Mach 2.x BSD 4.3, OpenStep) and later became MacOS X (Mach 3.0, Carbon, Blue Box). Portions of the code came from NetBSD, FreeBSD, and OpenBSD. CoreOS provides preemption and protection; supports application environments, processor, and hardware abstractions; and offers flexibility and scalability. Rhapsody runs on Intel boxes as well as the PowerPC. Rhapsody includes several filesystems, such as UFS, ISO 9660, and NFS. However, it does not support hard links in HFS+ or UFS. The networking is based on the BSD 4.4 TCP/IP stack and the ANS 700 Appletalk stack.

The audience then bombarded Halle with questions. Halle said that the yellow box will be available for Win95 and WinNT.

An audience member noted that Halle hinted at the cathedral and bazaar model. Apple could kickstart free software as far as GUIs go. When asked if there are any rumblings about giving back to the community, Halle replied that MKLinux is a prime example.

Another audience participant questioned Rhapsody's choice to omit some of the most common tools found in UNIX. "If you want to leverage applications, then don't make UNIX shell tools optional," said the participant. Halle responded with an example where UNIX tools could hurt acceptance by the general population. Your grandma or kid could be using this operating system. Halle reasoned that full UNIX does not make a lot of sense in all environments. You want the right minimal set, but there should be a way to obtain the tools.

Rhapsody does include SSH and TCP wrappers. For more information on Rhapsody, see Programming under Mach by Joseph Boykin et al. or <>, or <>.

Random, quotable quotation:

Walli: "NT is the world's fattest microkernel with maybe 36 million lines of code. Now that's a microkernel."

Succumbing to the Dark Side of the Force: The Internet as Seen from an Adult Web Site

Daniel Klein, Erotika

Summary by David M. Edsall

The weather wasn't the only thing that was hot and steamy during the USENIX conference in New Orleans. In one of the most popular invited talks of the conference, Dan Klein entertained and educated a packed auditorium with his discussion of what is necessary to carry the world's oldest profession to the realm of the world's latest technology.

Humorously introduced as a "purveyor of filth" and a "corrupter of our nation's youth," Klein went on to show us he was anything but and why everyone around him, including his mother, thinks it is OK. Klein has given talks worldwide and is a freelance software developer. But he is probably best known to the USENIX community as the tutorial coordinator, a skill he used well in teaching all of us everything we always wanted to know about Internet sex but were afraid to ask.

He began by reminding us of the stereotypes of the adult entertainment industry. Images of Boogie Nights, dark alleys, and ladies of the evening all come to mind. What we don't realize is that there are "ordinary people" working in this industry as well. The owner/administrators of two of the more popular Web sites, Persian Kitty and Danni's Hard Drive, were each housewives before their online business skyrocketed.

Klein then discussed the two tiers into which the industry can be split. Tier 1 consists mainly of magazine publishers, filmmakers, and prostitutes; while Tier 2 includes resellers such as Klein and phone sex operators. In his explanation of the "product" purveyed by the adult Web business he stated "If there is a philia, phobia, mania, or pathia for it, it's out there. All the world's queer except thee and me, and I'm not so sure about thee. It's not bad, just different." In his opinion, "You can stick whatever you want wherever you want to stick it so long as what you stick it in wants to get stuck."

How much money can be made from the online sex industry? Examples Klein gave included Persian Kitty, which earned $900,000 in its first year and now is pulling in $1.5 million selling only links. Another company, UltraPics, has 14,000 members at $12 per member. Club Love had 20 times more hits in one day than <> did in a month. Klein himself is in a partnership of four and his share is up to $3,000 in a good month.

Where does one obtain the product? Some companies simply scan the pictures from magazines in blatant violation of copyright law. Some use Web mirroring to essentially steal their content from similar Web sites. Others, such as Klein's company, download noncopyrighted images from USENET newsgroups and repackage them. Klein's group also provides original content, hiring models and photographers. This comes with its own complications. Klein described the need for qualified photographers, proper lighting, props, costumes, and a director.

Running an adult Web site requires a variety of different technologies. To conserve resources, it helps to use compressed images, and Klein is convinced that the adult industry is one of the major influences driving digital compression. It also helps to split the server load among several machines. He elaborated on a number of ways to accomplish this, including DNS round robin and front-end CGIs. In addition, good programming is useful for automating your site, relieving you of the tedious task of wading through USENET postings, dealing with the administration of members, site updates, logging, reporting, accounting, and checking for people who have shared their passwords, to list a few examples.

Klein discussed, to the dismay of a few members of the audience, ways in which you can boost the visibility of your Web site in top ten lists and search engines. One method uses "click bots" to artificially increase the number of hits your pages receive. Another well-known trick is including a popular word, trademark, or brand in META tags for broader exposure to search engines. Klein also described nastier techniques, such as DNS cache poisoning attacks, misleading ads, and domain name spoofing.

All of this does not come without a price. Klein described the importance of making sure you abide by the law. He described his own methods of USENET scanning as acting as a common carrier, ignorant of whether or not the material is copyright protected, and making sure the images they display carry no copyright notice. When photographing models, his company goes to great lengths to make sure they are of legal age and that they are videotaped to prevent lawsuits. He stressed the importance of reporting all income and paying the IRS their due. Most of all, Klein emphasized, "Don't tempt fate. If [they] look too young, they probably are."

In the brief question-and-answer session which followed, one attendee asked if the adult Web market has peaked. Klein responded by saying, "I think it can still triple before that happens. It's like the Red Light District in Amsterdam. Eventually, people stopped being really interested, but it still is thriving after years." How will Klein deal with it? "I'm honest and fair and not ashamed of it."

ADAPT: A Flexible Solution for Managing the DNS

Jim Reid and Anton Holleman, ORIGIN TIS-INS

Summary by Jim Simpson

As more and more domains and networks come online, the amount of DNS-involved management will only increase. Reid and Holleman implemented a large-scale solution for DNS management for a network in a production environment at Philips. Their presentation was given in two parts: first, Holleman gave a demure explanation about the new DNS architecture, and then Reid gave a sometimes amusing explanation of the tool they developed and the problems they had with deploying it, especially when describing an interaction with a particular NT client.

ORIGIN is a global IT service company created by the merger of two parts of the Philips group. Philips is a large corporation, and with a large, far-reaching, and networked corporation came the need for a large DNS. They use a split DNS policy for security and billing purposes, but the old-style DNS architecture in such an environment grew to the point where zone transfers on the nameservers ­ making the DNS service erratic ­ were failing because of resource problems. This created the need and desire to reimplement their DNS architecture; the criteria that had to be met included:

  • The systems needed to be adaptable because Philips is diverse in its DNS needs.
  • In providing service to Philips, ORIGIN cannot impose a solution ­ it must fulfill a need.
  • Fast lookups are necessary with minimal impact upon WAN.
  • The system must be scalable, robust, but at the same time simple, and cannot rely on delegated nameservers 24hours/7days.

They decided to create a centrally managed backbone. They used BSD/OS as their platform because it was already used by large ISPs, the VM subsystem is nameserver-friendly, and it's commercially supported, though some system administrators still pined for Solaris. BIND8 was their choice for software; despite poor documentation they were happy with it during testing. Because BSD/OS runs on i386, there was no real choice in hardware, but this also worked out well due to the low prices associated with Intel-based machines and their ease of replacement. Nameservers with good network connectivity to the global backbone in fixed locations were located and installed in pairs for redundancy.

The actual setup consists of three parts: DNS server architecture, DNS resolver architecture, and DNS contents architecture ­ parts of which were designed to be centrally managed. In order to manage DNS and move it along, they had to come up with a tool which they ended up calling ADAPT. In their scheme, ADAPT eliminates the need for local DNS administration; the local admin controls local data, and the backbone people control the "glue." If local admins update something, they send it to the backbone using dnssend. If the data are good, they are put into the DNS database.

There are still some unresolved problems, a few being:

  • A lot of sites still run brain-dead nameservers.
  • There are strange interactions with WINS.
  • There are bizarre nameserver configurations.

However, they met their design goals: costs are low and service is stable.

The session ended with a myriad of questions: Q: How do you cope with caching nameservers? A: Don't use dynamic templates; use notify instead. Q: How do you deal with errors in the host file? A: Their makefile creates backups of previous good data; servers are installed in pairs. Q: How is that done, and what happens when one goes down? A: They're installed manually, and people have to go in and reboot them by hand.

Panel Discussion: Is a Clustered Computer In Your Future?

Moderator: Clem Cole, Digital Equipment Corporation
Panelists: Ron Minnich, Sarnoff Corporation; Fred Glover, Digital Equipment Corporation; Bruce Walker, Tandem Computers, Inc.; Rick Rashid, Microsoft, Inc.; Frank Ho, Hewlett-Packard Company.

Summary by Steve Hanson

This panel was interesting in that it presented clustered computing from a number of different viewpoints. It was light on detailed technical information, but made it clear that the panelists, all representing major manufacturers or trends in clustered computing, were looking at the topic from very different viewpoints and were solving different problems. First each panelist presented information on the clustering product produced or used by his company. These presentations were followed by a roundtable discussion of clustering, including a question-and-answer period.

Fred Glover, Digital, introduced the TruCluster system in spring, 1994. TruCluster had limited capabilities at introduction, but now provides a highly available and scalable computing environment. TruCluster supports up to tens of nodes, which may be SMP systems. Normal single-host applications run in this environment. TruCluster provides a single-system view of the cluster. Standard digital systems are used in the cluster and are connected with a high-speed interconnect. A Cluster Tool Kit adds distributed APIs so that applications may be coded to better take advantage of the environment. Digital's emphasis is on running commercial applications in this environment, essentially providing a more scalable computing environment than is available in a single machine as well as providing higher availability through redundant resources.

The primary motivations for HP's cluster environments are to provide higher availability and capacity while lowering the cost of management and providing better disaster recovery. Frank Ho stated that 60% of servers are now in mission-critical environments and that 24x7 operation is increasingly important. Downtime has a significant impact on companies. HP's goal is to guarantee 99.999% uptime per year, which is equivalent to five minutes of downtime per year. Today's high availability clusters guarantee 99.95% uptime and are generally taken down only for major upgrades. HP clearly has emphasized high availability, which is the primary thrust of its marketing to commercial customers.

Bruce Walker's talk was primarily about the comeback of servers and the evolution of servers from single boxes to clusters. According to Walker, clustering currently falls into categories of clusters providing fail-over (high availability) or NUMA clusters, providing higher cluster performance. Tandem claims that its full SSI (Single System Image) clustering provides both parallel performance and availability. Tandem currently ships 2-6 node clusters, having up to 24 CPUs. Tandem is working with SCO on its operating system, which is based on UNIXware.

The Microsoft agenda for clustered computing is to introduce clustered computing technology in stages. Clustered computing currently is a very small portion of the marketplace. According to Rick Rashid, the current thrust of Microsoft's strategy is to work on high availability solutions (which are available currently in NT) and to introduce scalability of clustering in 1999.

Ron Minnich spoke about the implementation of high-powered computing clusters on commodity PC systems. There is a history of implementing high-powered computer clusters on commodity systems. FERMILAB and other high-energy physics sites have for years done the bulk of their computing on clusters of small UNIX workstations. The availability of free, stable operating systems on very inexpensive hardware has allowed the design of very high powered computing clusters at comparatively low prices. Minnich made the disputable statement that PC/Open UNIX computing is about 10 times as reliable as proprietary UNIX systems and 100 times as reliable as NT. Having formerly been an administrator for large computer clusters on proprietary UNIX systems at FERMILAB, I somehow doubt that, because the UNIX clusters we were using almost always failed due to hardware failure, not OS failure. I find it difficult to believe that $1,000 PC systems are more than 10 times more reliable than entry-level UNIX desktops. However, I think the point is well taken that this is a means of building reliable computer engines of very high power at a very low price. Redhat currently offers a $29.95 Beowulf cluster CD that includes clustering software for Linux.

A question-and-answer period followed the presentations. The questions indicated that may organizations in the real world are asking for more from the clustering vendors than they are currently providing. Questions were asked about a means of developing and debugging software for a high-availability environment and about how to establish a high-availability network across a large geographic area. The response of the panel members seemed to indicate that they hadn't gotten that far yet.

Other discussions involved recommendations of other approaches to clustering, including use of Platform Computing's LSF software product <> as well as the University of Wisconsin's Condor software, which excels for use in environments where unused hours of CPU on desktop systems can be harvested for serious compute cycles <>.

Berry Kercheval asked whether it is important for a cluster to provide a single system image. As an example he mentioned Condor computing, which is not a single system image design, but provides an environment that looks like a single system to the application software. He also made the point that SSI clusters are likely to be more viable because they are a simpler mechanism for replacing mainframes in a commercial environment.

The Future of the Internet

John S. Quarterman, Matrix Information and Directory Services (MIDS)

Summary by David M. Edsall

Death of the Internet! Details at eleven! That may be the conclusion you would have drawn had you read Bob Metcalfe's op-ed piece in the December 4, 1995, issue of Infoworld <> where he predicted the collapse of the Internet in 1996. Fortunately, his dire predictions did not come true. In his talk in New Orleans, Internet statistician John Quarterman showed us why.

Quarterman is president of Matrix Information and Directory Services (MIDS) in Austin, Texas, a company that studies the demographics and performance of the Internet and other networks worldwide. Drawing upon the large collection of resources and products available from MIDS, Quarterman educated an attentive audience on the past and current growth of the Internet and other networks before taking the risk of making predictions of his own. (The slides presented by Quarterman, including the graphs, are available on both the USENIX '98 conference CD and at <>.)

Quarterman began his talk by discussing the current number of Internet hosts worldwide. Not surprisingly, most of the hosts are located in the industrialized countries, with a dense concentration in large urban centers. What is exciting is the number of hosts popping up in some of the more remote areas of the world. As Quarterman said, "Geographically, the Internet is not a small thing anymore."

Next, he discussed the history of the growth of computer networking from the humble beginnings of the ARPANET with two nodes at UCLA and SRI, through the split of the ARPANET in 1983 and the subsequent creation of the NFSNET, to the eventual domination of the Internet over all of the other networks. Quarterman showed how many of the other networks (UUCP, BITNET, and FIDONET) have reached a plateau and are now declining in use while the Internet continues to increase in number of hosts at an exponential rate. He similarly showed the parallel growth in the number of Matrix users (users who use email regardless of the underlying network protocol used for its transmission), with the Matrix users increasing more rapidly due to the multiplicity of networks available in the 1980s.

Quarterman next showed the growth of the Internet in countries worldwide. As expected, the United States currently leads the rest of the world in total number of hosts and has a growth rate similar to that of the Internet as a whole. He attributed the slow growth of the number of Japanese hosts to the difficulty that Japanese ISPs had in obtaining licenses, a restriction that was eased in 1994, leading to a large spurt in the growth rate in Japan now.

Moving on to the present day, Quarterman displayed an interesting plot that reflects the latency vs. time of various hosts on the Internet <>. It was this graphic that persuaded Bob Metcalfe that large latencies do not remain constant, and hence there will be no global breakdown. (But Metcalfe still maintains that there could be a local crash; this is a much less controversial position, because few people would disagree that there are often localized problems in the Internet.) Even more interesting was an animated image of a rotating globe with growing and shrinking circles <> representing latencies in various parts of the world during the course of a day. This image showed that the latencies undulate on a daily basis much like the circadian rhythm obeyed by the human body. The image also shows different patterns, depending on which country you study. Latencies appear to increase at noon in Japan and decrease in the afternoon in Spain.

With the past and the present behind, what lies in the future for the Internet? Using a deliberately naive extrapolation of the data presented, Quarterman predicted that, by the year 2005, the number of hosts in the world will nearly equal the world's population. He also predicted that the US will continue to have the dominant share of the world's Internet hosts, but will eventually reach a saturation point. But it is difficult to say where any bends in the curve will come.

His confidence in his projected growth of the Internet is based partly on comparisons he has made between the Internet and other technologies of the past. He presented a plot of the number of telephones, televisions, and radios in the United States vs. time alongside the growth of the world's population and the growth of the Internet. The growth of the Internet has been much faster than the growth of any of these industries. All three of the older technologies eventually levelled off and paralleled the world's population growth, whereas the Internet shows no signs of doing so soon. He has yet to find any technology whose growth compares with that of the Internet. At this point, Quarterman asked the audience to suggest other technologies whose growth could be compared to the Internet. Ideas, both serious and humorous, included the sales of Viagra, production of CPU chips, automobile purchases, and size of the UNIX kernel. A grateful Quarterman stated that no other audience had ever given him so many suggestions.

Quarterman finished by discussing possible future problems that may hinder the growth and use of the Internet. In a scatter plot of the number of hosts per country per capita and per gross national product, he showed the audience that the growth of the Internet is also dependent on economic and political conditions around the world. Countries with large numbers of hosts tend to have higher standards of living and less internal strife. He also stressed the importance of the social behavior of those using the Internet. In a long discussion of spam, Quarterman revealed that he prefers that the governments of the world adopt a hands-off approach, leaving the policing of the net to those who have the ability to control mail relays. You can find more information at <>.

Will the Internet eventually collapse and fold? Stay tuned.


Machine-Independent DMA Framework for NetBSD

Jason R. Thorpe, NASA Ames Research Center

Summary by Keith Parkins

Jason Thorpe spoke about the virtues of and reasons for developing a machine-independent DMA framework in the NetBSD kernel. The inspiration seems fairly obvious: if you were involved with porting a kernel to different architectures, wouldn't you want to keep one DMA mapping abstraction rather than one per architecture? Given the fact that many modern machines and machine families share common architectural features, the implementation of a single abstraction seemed the way to go.

Thorpe walked through the different DMA mapping scenarios, bus details, and design considerations before unveiling the bus access interface in NetBSD. A couple of questions were asked during the session, but most of the answers can be found in the paper.

Thorpe seems to have followed the philosophy of spending his time on sharpening his axe before chopping at the tree. He noted that while the implementation of the interface took a long time, it worked on architectures with varying cache implementations such as mips and alpha without hacking the code.

Examples of the front-end of the interface can be found at:<> in the pci, isa, eisa, and tc folders. For examples of the backend, look in the ic folder.

Linux Operating System

Theodore Ts'o, MIT; Linus Torvalds, Transmeta

Summary by Keith Parkins

They had to take down the walls that separated the Mardi Gras room into three smaller cells for the Linux state of the union talk. Instead of being flanked off into seats that would put them out of visual range, audience members chose to position themselves against the back wall for a closer view. This was not what the speakers had expected because they had initially envisioned the talk being a BOF.

Theodore Ts'o began the aforementioned "state of the union." He started out by citing a figure gathered by Bob Young, CEO of Red Hat Linux, 18 months earlier. The figure in question was the number of Linux users at that time, which is not an easy figure to derive because one purchased copy of Linux can sit on any number of machines and people are also free to download it. The best results derived from various metrics showed users at that time to number between three and five million. Today, similar metrics show this number at six to seven million, although Corel, in its announcement of a Linux port of Office Suite, claimed the number to be eight million.

Because of these rising figures and subsequent rising interest by commercial developers, Ts'o noted that the most exciting work in the Linux universe was taking place in userland, not in the kernel, as had historically been the case. Ts'o noted the development of the rival desktop/office environments, KDE and GNOME, and new administration tools that make it easy for "mere mortals" to maintain their systems. Ts'o also talked briefly about the Linux Software Base, an attempt to keep a standard set of shared libraries and filesystem layouts in every Linux distribution so that developers don't loose interest due to porting their software to each distribution of Linux.

Ts'o then spoke about the ext2 filesystem. The first thing he emphasized was that although most distributions use ext2fs, it is not the only filesystem used with Linux. He touched upon ideas such as using b-trees everywhere to show the other work out there. While work continues on ext2fs, Ts'o stated that their number one priority is to ensure that any new versions do not endanger stored data. This means extensive testing before placing the filesystem in a stable release, probably not until 2.3 or 2.4. Features to come include metadata logging to increase recovery time, storing capabilities in the filesystem (a POSIX security feature to divide the setuid bit into discrete capabilities), and a logical volume manager.

When Linus Torvalds took the floor, he too expressed his hope that the presentation would be a BOF. In keeping with this hope, he kept his portion of the state of the Linux union brief before opening the floor to questions. Instead of focusing on what people can expect in future releases, he concentrated on two differences between 2.2 and earlier releases. He quickly noted that the drop in memory prices had encouraged the Linux kernel developers to exploit the fact that many machines have a lot more memory than they used to. He elaborated that the kernel will still be frugal with memory resources, but that it seemed poor form to not exploit this trend. He then noted that although earlier releases had been developed for Intel and Alpha machines, 2.2 will add Sparc, Power PC, ARM, and others to the list. As Torvalds put it, "anything that is remotely interesting, and some [things] that aren't" will be supported.

There were many good questions and answers when the floor was opened. When asked if Torvalds and company would make it easier for other flavors of UNIX to emulate Linux, Torvalds replied that although he was not trying to make matters difficult for others, he was not going to detract time from making clean and stable kernels to make Linux easier to emulate. He also noted that the biggest stumbling block for others in this task would be Linux's implementation of threads.

On a question concerning his stance on licensing issues, Torvalds stated simply that he is developing kernels because it was what he enjoys doing. He went on to state that he personally does not care what people do with his end product or what they call it.

When asked about the Linux kernel running on top of Mach, Torvalds stated that he feels the Mach kernel is a poor product and that he has not heard a good argument for placing Linux on top of it. At one point, Torvalds had thought about considering the Mach kernel as just another hardware port. He said that he later changed his mind, when he saw that the kernel running natively on an architecture runs much faster. This initial question led to a question as to whether Linux will become a true distributed operating system. Torvalds stated that he feels it does not make sense to do all the distribution at such a low level and that it makes more sense to make hard decisions at a higher level with some kernel support.

The closing comment from the floor was a thank-you to Torvalds for bringing the world Linux. Torvalds graciously responded by saying that while he is happy to accept the thanks, he was not as involved in the coding process, and the thanks should go out to all the people involved with writing the kernel and applications as well as himself.

Panel Discussion: Whither IPSec?

Moderator: Angelos D. Keromytis, University of Pennsylvania
Panelists: Hugh Daniel, Linux FreeS/WAN Project; John Ioannidis, AT&T Labs ­ Research; Theodore Ts'o, MIT; Craig Metz, Naval Research Laboratory.

Summary by Kevin Fu

Angelos Keromytis moderated a lively discussion on IPSec's past, present, and future. In particular, the panelists addressed problems of IPSec deployment. The panel included four individuals intimately involved with IPSec. IPSec is mandatory in all IPv6 implementations.

A jolly John Ioannidis claims to have been involved with IPSec "since the fourteenth century." He became involved with IPSec before the name IPSec existed. As a graduate student, Ioannidis spoke with other security folks at an IETF BOF in 1992. Later, Matt Blaze, Phil Karn, and Ioannidis talked about an encapsulating protocol. Finally, to win a bet, Ioannidis distributed a floppy disk with a better implementation of swIPe, a network-layer security protocol for the IP protocol suite.

Theodore Ts'o pioneered the ext2 filesystem for Linux, works on Kerberos applications, and currently is the IPSec working group chair at the IETF. He gave a short "this is your life" history of IPSec. After the working group formed in late 1993, arguments broke out over packet formats. However, the hard part became the management of all the keys. Two rival solutions appeared: Photurus by Phil Karn and SKIP by Sun Microsystems. SKIP had a zero round-trip setup time, but makes some assumptions that were probably not applicable for wide-scale deployment on the Internet. Then ISAKMP Oakley developed as a third camp and was adopted by the NSA. Ironically, the ISAKMP protocol was designed to be modular, but the implementations are not so modular. Microsoft did not implement modularity in order to make the software more easily exportable. Ts'o describes the current status of IPSec as the "labor" phase for key management and procedural administrivia. Looking to the future, Ts'o notes there is some interest in multicast. But he worries about the trust model of multicast­ if 1,000 friends share a secret, it can't be all that secret. Ts'o also stresses the difference between host- and user-level security. Are we authenticating the host or user? Will IPSec be used to secure VPNs and routing or the user? Right now the answer is VPNs.

Hugh Daniel is the "mis-manager" for a free Linux version of the Secure Wide Area Network (FreeS/WAN). Because of the lords in DC, foreigners coded all of FreeS/WAN outside the US. There is a key management daemon called Pluto for ISAKMP Oakley and a Linux kernel module for an IPSec framework.

Craig Metz then gave a short slide presentation on NRL's IPSec implementation. Conference attendees should note a change in slide 7 on page 120 of the FREENIX proceedings: it now supports FreeBSD and OpenBSD.

Keromytis opened with a question about deployment. People went through lots of trouble to get IPSec standardized. What are the likely main problems in deployment and use of IPSec? Metz answered that getting the code in the hands of potential users is the hardest part. IPSec does not have to be the greatest, but it has to be in the hands of the users. IPSec does not equal VPN. IPSec can do more and solves real-world problems. Ioannidis commented that the problem with IPSec is that some people want perfection before releasing code. If only three people have IPSec, it is not too useful. This is just like the fax ­ you need a pool of users before IPSec becomes useful. Key management is also a hard problem.

The next question involved patches. Are patches accepted from people in the US? Daniel replied that you can whine on the bug mailing list, but you cannot say what line the bug is on or what the bug is. Linux FreeS/WAN will not take patches from US citizens. Ts'o explained that MIT develops code in the US, but does not give permission to export. When Ts'o receives a patch from Italy, he is not able to tell if it came from a legal export license. Besides, no one would commit such a "violent, despicable act."

Metz was asked why the government is interested in IPSec. Metz answered that many people in the government want the ability to buy IPSec off the shelf. A lot of the custom-built stuff for the government leaves something to be desired. Metz further explained, "If we can help get IPSec on the street, the government can get higher quality network security software for a lower cost."

Ts'o said that Microsoft NT 5.0 is shipping with IPSec. Microsoft wants IPSec more than VPNs. Interoperability with the rest of the world will be interesting. Microsoft has a lot of good people; UNIX people should hurry up.

An audience member asked about the following situation: let a packet require encryption to be sent over a link. Is there a defined ICMP packet that says "oops, can't get encryption on this link"? What is the kernel supposed to return to when this happens? Daniel reported that this is not properly defined yet. Metz expanded that there is a slight flame war now. Some believe this would allow an active attacker to discover encryption policies. Should such a mechanism exist? The answer is likely to be "maybe."

Ts'o responded, "Think SYN flood. Renegotiating allows for denial of service. This is not as simple as you might think." Daniel substantiated this with figures for bulk encryption. When you deal with PKCS and elliptic curves, encryption can take a 500MHz alpha to its knees. It could take five minutes! Metz mentioned a hardware solution for things such as modular exponentiation.

Dummynet and Forward Error Correction

Luigi Rizzo, Dip. di Ingegneria dell'Informazione ­ Universitá di Pisa

Summary by Jason Peel

Luigi Rizzo took the floor and immediately asked the audience: "how [may we] study the behavior of a protocol or an application in a real network?" His answer? A flexible and cost-effective evaluation tool dubbed "dummynet," designed to help developers study the behavior of software on networks.

In the scrutiny of a particular protocol or application, simulation is often not plausible ­ it would require building a simulated model of the system whose features may not even be known. Alternatively, research on an actual network might be plagued as well, perhaps due to the proper equipment not being available, or difficulties in configuration. The solution presented in dummynet combines the advantages of both model simulation and actual network test beds.

With a real, physical network as a factor in the communication of multiple processes, traffic can be affected through propagation delays, queuing delays (due to limited-capacity network channels), packet losses (caused by bounded-size queues, and possibly noise), and reordering (because of multiple paths between hosts). These phenomena are replicated in dummynet by passing packets coming in to or going out of the protocol stack through size-, rate-, and delay-configurable queues that simulate network latency, dropped packets, and packet reordering. Dummynet has been implemented as an extension of the ipfw firewall code so as to take advantage of its packet-filtering capabilities and, as such, allows configuration that the developer may already by acquainted with.

The other tool Rizzo presented is an implementation of a particular class of algorithm known as an erasure code. Erasure codes such as his are used in a technique called Forward Error Correction (FEC) as an attempt to eliminate the need for rebroadcasts caused by errors in transmission. In certain situations, particularly asymmetric communication channels or multicast applications with a large number of receivers, FEC can be used to encode data redundantly, such that the source data can be successfully reconstructed even if packets are lost. As useful as this may seem, however, FEC has only rarely been implemented in network protocols due to the perceived high computational cost of encoding and decoding. With his implementation, Rizzo demonstrated how FEC can be taken advantage of on commonly available hardware without a significant performance hit.

To develop this linear algebra erasure code, known technically as a "Vandermonde" code, Rizzo faced several obstacles. First, the implementation of such a code requires highly precise arithmetic; this was solved by splitting packets into smaller units. Second, operand expansion results in a large number of bits; by performing computations in a "Finite" or "Galois" field, this too was overcome. Lastly, a systematic code ­ one in which the encoded packets include a verbatim copy of the source packets ­ may at times be desired so that no decoding effort is necessary in the absence of errors. By using various algebraic manipulations, Rizzo was able to obtain a systematic code.

Nearing the close of his session, Rizzo utilized a FreeBSD-powered palmtop to demonstrate the ease of use with which dummynet can simulate various network scenarios. Then he used this virtual test bed network as he showed RMDP (a multicast file-transfer application making use of FEC) in action. The crowd was enthused, and Rizzo was let off the hook with only one question to answer, "Where can we get this?"

Arla ­ A Really Likable AFS-Client

Johan Danielsson, Parallelldatorcentrum KTH; Assar Westerlund, Swedish Institute of Computer Science

Summary by Gus Shaffer

Johan and Assar gave a very exciting presentation about their new, free, and portable AFS client, Arla, which is based on the Andrew File System version 3.

A major part of their presentation explained the difference in structure between Transarc's AFS and Arla. Arla exports most of its internals to a highly portable user-land daemon, as opposed to Transarc's massive kernel module. The presenters admitted that their change did bring up some performance issues, but the speed of porting was largely increased: they already support six platforms, with four more on the way.

Johan and Assar also mentioned that students at the University of Michigan's Center for Information Technology Incorporation <> have incorporated disconnected client modifications originally written for AFS, and they hope to eventually incorporate this code into the main Arla source tree.

The most exciting announcement of the presentation concerned the other half of client-server architecture ­ the developers have an almost-working fileserver! They presently need to add database servers (volume server and protection server) to have a free, fully functional AFS server.

The presentation drew questions from such noted people as CITI's Jim Rees <> and produced tongue-in-cheek comments along the lines of, "Production use means someone bitches wildly if it doesn't work."

ISC DHCP EDistribution

Ted Lemon, Internet Software Consortium

Summary by Branson Matheson

Ted Lemon gave a good talk regarding his Dynamic Host Configuration Protocol implementation. About 50 people attended the discussion. He described the benefits of a DHCP server, which include allowing users plug-and-play ability, making things easier for network/sysadmins regarding address assignment, and how conflicts are prevented. He also discussed potential improvements for version 3, including authentication, Dynamic DNS, and fail over protocol. Lemon also mentioned that ISC is part of his implementation and that it is assisting with standards and some financing. The question-and-answer session was full of good comments and discussion. There was quite a bit of talk about the different ways people have implemented a dynamic DNS setup, how to id the client requesting an ip.

Heimdal ­ An independent implementation of Kerberos 5

Johan Danielsson, Paralleldatorcentrum KTH; Assar Westerlund, Swedish Institute of Computer Science

Summary by Branson Matheson

Johan Danielsson and Assar Westerlund travelled from Sweden to discuss their implementation of a free Kerberos software. Heimdal, which is named after the watchman on the bridge to Asgard, was developed independently and is internationally available. They described Kerberos/Heimdal in general and then also discussed some of the additions and improvements that they had made including 3DES, secure X11, IPv6, and firewall support. They also discussed some of the problems they had with the implementation and how they solved them, including how to get secure packets across a firewall. During the question-and-answer session, there was lots of discussion of the S/Key and OPIE, encryption, and proxy authentication. Although the language barrier sometimes seemed to be a factor, this discussion went well and was well received.

Samba as WNT Domain Controller

John Blair, University of Alabama

Summary by Steve Hanson

John Blair is the primary author of the excellent documentation for the freeware Samba package. I find this interesting because the Samba documentation is not only a fine example of how good the documentation for a freeware package can be, but is also one of the best sources for information on how Microsoft's Win95/NT file sharing works. Unfortunately, Blair was so busy working on documentation for the Samba project and his new book on Samba (Samba: Integrating UNIX and Windows) that he did not complete a paper for the conference. Therefore the talk didn't relate to a paper, but was a more general discussion of Samba's progress toward having domain controller capabilities, as well as Samba development in general. This was a very interesting talk, more as a general viewpoint on the motivations of the Samba project than anything else.

Due to the extra time created by the missing first presentation, Blair chose to have a question-and-answer period before beginning his talk. Many questions were asked, primarily about trust relationships with NT servers and using Samba as a domain controller. Blair responded that although the code for domain control exists in the current Samba release, it isn't considered production quality. Many sites are successfully using the current code, however. In addition to some details on how to enable the domain controller code in Samba, Blair presented some information on the difficulty of creating interoperability between the NT world and UNIX. This includes mapping 32-bit ID to UNIX IDs, having to develop by reverse engineering, bugs in NT that cause unpredictable behavior, etc. The inevitable discussion of NT security was interwoven into the talk, particularly in regard to new potential security issues and possible exploits that were discovered while reverse engineering the domain controller protocols.

The balance of Blair's talk was devoted to the Samba project in general. Several issues about the need for Samba were raised. Although there are a number of commercial software packages allowing SMB file sharing from UNIX, Samba holds a unique place in the world because the code is freely available and well documented. Samba code and documentation are the best window into determining how Microsoft networking actually works. In some cases bugs and potential exploits have been discovered, some of which Microsoft has fixed. Public scrutiny of the NT world is possible only through projects such as this. It seems unlikely that corporate America will learn to say no to the Windows juggernaut, but at least this sort of review stands some chance of opening the Windows world of networking to review.

Samba is also important because the UNIX platforms on which it runs are more scalable than the current NT platforms. The release dates for NT 5.0 and Merced processors seem to be continually moving over the horizon, so interoperable UNIX platforms at least offer a scalable stable place to host services in the meantime. The work being done here also raises the interesting prospect of being able to administer the NT world from the UNIX platform.

Further information on Samba is available at <>.

Using FreeBSD for a Console Server

Branson Matheson, Ferguson Enterprises, Inc.

Summary by Branson Matheson

Branson Matheson gave a discussion of his implementation of a console server using FreeBSD. He described the hardware and software requirements and the problems associated with the installation of a console server. The problems included security concerns, layout, and planning. He went into specifics over the implementation of the software and hardware. There was some good discussion about other implementations during the question-and-answer session. Security seemed to be the central theme of the questions: maintaining the security of the consoles while giving the system administrator the necessary privileges and functionality.


Reconfigured Self as Basis for Humanistic Intelligence

Steve Mann, University of Toronto

Summary by Jim Simpson

As we spin and hurtle ourselves faster toward the future, we find the tools helping us there can now be used against us in a myriad of ways. Steve Mann offered a very sharp, pointed, and humorous presentation about taking technology back, through Humanistic Intelligence.

Humanistic Intelligence is the interaction between a human and a computer, and encapsulates the two into a single functioning unit. The ideal is for the computer to augment the reality of the human working with it. It sounds like that goal is well on the way; Mann typed most of his thesis while standing in line, noting his primary task was to stand and wait in a line, but that WearComp, his implementation of Humanistic Intelligence, allowed for a secondary task where he could be creative. WearComp consists of a host computer on the person's body, a pair of customized glasses and connectivity; specifics about WearComp are at <>. Note that WearComp runs on an OS, and not a virus. Despite large evolutionary strides, Mann commented about the setup, "The problem with wires is you get tangled up in them."

A few of the more interesting scenarios and uses of WearComp include visual mediation. Say you don't wish to see a particular ad. You have your image processor map a texture over it. Imagine if you were about to be mugged on the street. You could simply beam back an image of the perpetrator. Finally, and perhaps one of the most important uses is that people could better understand each other. Mann illustrated this with a story about being late. Whoever is waiting could simply see the world through your eyes and, instead of being suspicious or upset, know the person is being genuine with the explanation.

We then were treated to an excerpt from a video documentary Mann did, called Shooting Back. It demonstrated the modern double standard we're held to; as Mann asked about surveillance cameras in everyday stores, he was bounced from person to person. Mann turned the tables, and when those persons were asked how they felt being videotaped, they had the same reaction that prompted Mann's deployment of a video camera. What's more interesting is that while pretending to initiate recording the other party, he'd been surreptitiously recording everything with WearComp.

Toward the end of the session, the chair began to check his watch nervously; it seemed almost awkward when the chair had to tell Mann the time because Mann was well aware of the time ­ it was happily ticking away in the form of an xclock on the other side of his glasses.

Because this is a working product, Mann answered a few questions about WearComp and how it has fared: What operating system do you use? Linux 2.0, RedHat, but Mann has written custom stuff like DSP. Has the system ever cut out? Yes, there are dark crashes. The most common thing is for the battery to die, but there is a 30-minute warning system. You don't want to be in mediated reality, walk across the street, and then have a system cut out the moment a truck is barreling toward you. Can you show us what you're seeing? No, the video output is in a special format that won't hook up to a standard VGA projector.

Free Stuff

Opinion by Peter H. Salus

[Editor's Note: While Peter is director pro tem of the Tcl/Tk Consortium, he is not an employee of any of the companies mentioned in this report.]

The Association held its June meeting in hot, steamy New Orleans. I emerged from the hotel into the humid heat only twice in four days. Inside the hotel it was cooler and there were lots of folks to talk to.

However, for the first time in a dozen years, I hardly attended any mainline technical papers: I went to the parallel FREENIX track. I learned about NetBSD, FreeBSD, Samba, and OpenBSD. I went to the "Historical UNIX" session (it's 20 years since Dennis Ritchie and Steve Johnson ported V7 to the Interdata 8 and Richard Miller and Juris Reinfelds ported it to the Interdata 7) and to the 90-minute history BOF that extended to nearly three hours. And I was present at the awards, the Tcl BOF, Linus Torvalds's talk, and James Randi's entertaining, but largely irrelevant, keynote.

There was also a session on Eric Raymond's "The Cathedral and the Bazaar," which was largely a love-in conducted by Eric and Kirk McKusick until the very last minutes, which were occupied by a lengthy flame from Rob Kolstad. More heat was radiated than light was shed.

If you know me, you will see the connecting motif: ever since I saw UNIX in the late 1970s, I have been interested in the way systems develop: in Raymond's terms, I'm much more a bazaar person than a cathedral architect. (And remember that although treasures can be found in a bazaar, Microsoft products are misshapen in a cathedral in Washington.)

UNIX was the first operating system to be successfully ported. And it was ported to two different machines (an Interdata 7 and an Interdata 8; later Perkin-Elmer machines) virtually simultaneously independently by teams half the planet apart. Not only that, but V7 contained awk, lint, make, uucp (from Bell Labs); Bill Joy's changes in data block size; Ed Gould's movement of the buffers out of kernel space; John Lions's (New South Wales!) new procedure for directory pathnames; Bruce Borden's symorder program; etc., etc. A bazaar stretching from Vienna through the US to Australia. I have outlined the contributions to the development of UNIX in several places, but the important thing is to recognize the bazaarlike activity in the 1970s and 1980s. With Linux, we progress into the 1990s.

NetBSD, OpenBSD, FreeBSD, BSDI, SVR5, and the various Linuxes are the result of this bazaar, with AT&T, Bell Labs, Western Electric, UNIX International, X/Open, OSF, and (now) the Open Group flailing about to get the daemons back in the licensing box. No hope.

John Ousterhout (who received the annual Software Tools User Group Award) nodded at both open development and Eric Raymond at the Tcl BOF, saying that he was slightly toward the cathedral side of the middle. By this he meant that he welcomed extensions and applications to Tcl/Tk, but that he reserved the right to decide what was included in any "official" distribution. Because Ousterhout is an intelligence whom I would entrust with such a role, I foresee no problem. But what if Tcl were usurped by an evil empire?

Cygnus, RedHat, Walnut Creek, Scriptics, etc., are examples that money can be made from "free" source. (This is blasphemy to "pure" FSFers, who think that the taint of, say, MetroX in RedHat's Linux distribution poisons all of RedHat. They're extremists.) Integrating free software with solid, useful proprietary software is a good thing: it tests the proprietary software among the wider user community, and it spreads the free software to the users of the proprietary stuff.

This aside, I thought the two papers on IPsec (by Angelo Keromytis and Craig Metz) were quite good. Thorpe on NetBSD and de Raadt on OpenBSD were quite lucid, as was Matheson on FreeBSD. Blair on Samba was as good as I had hoped. Because the other author in the session was a no-show, we had an open Q&A and discussion for nearly 90 minutes.

Microsoft may control 90% of the world's desktops, but all the important developments in OSes are clearly taking place in the bazaar.

?Need help? Use our Contacts page.
First posted: 27st October 1998 jr
Last changed: 27st October 1998 jr
Conference index
Proceedings index