USENIX 1998 Annual Technical Conference
NEW ORLEANS, LOUISIANA
June 15-19, 1998
KEYNOTE ADDRESS
Science and the Chimera
James "The Amazing" Randi
Summary by Peter Collinson
James Randi's keynote talk kicked the 1998 Technical Conference into
life. His talk zipped by, and we walked out at the end having seen
several examples of his work, his magic, and his mission to expose the
quacks and charlatans who often employ technology to fool us, the
gullible public.
James's roving, debunking eye has been aimed in many directions, from
harmless mentalists to the somewhat more serious faith healers in both
the US and the Philippines whose activities not only net them huge sums
of money, but also give false hope to people whom medical care cannot
cure. He looked at homeopathic cures which are now sold in many
drugstores. These cures are simply water because the original chemical
has been diluted 102500 times.
His talk ended on some serious notes:
-
We need to teach our children to think critically so they stop
being fooled
-
We need to stand up and expose fraudulent use of science and
technology.
His Web site is <https://www.randi.org> and is well worth a visit.
REFEREED TRACK
Session: Performance 1
Summary by Tom M. Kroeger
Scalable Kernel Performance for Internet Servers Under Realistic Loads
Gaurav Banga, Rice University; Jeffrey C. Mogul, Digital Equipment
Corp., Western Research Lab
This work, presented by Gaurav Banga, earned both the Best Paper and
Best Student Paper awards at the conference. It examined an
inconsistency between the observed performance of event-driven servers
under standard benchmarks, like SPECWeb96, and real workloads. Banga
and Mogul observed that in a WAN environment, which is characterized by
inherent delays, a server is forced to manage a large number of
connections simultaneously. Because commonly used benchmarks lack slow
client connections, they failed to test system behavior under such
conditions. From this observation they developed a new WAN benchmark
that tried to model slow connections.
They then profiled the Squid proxy server under both a standard
benchmark and their new benchmark. The standard benchmark showed no
specific procedure in the system to be a bottleneck, but in the WAN
benchmark the kernel procedures for select and file descriptor
allocation (ufalloc) accounted for 40% of the CPU time. With
this information Banga explained how they examined the implementations
of select and ufalloc.
The select system call in Digital UNIX (and in fact in most
UNIX variants) was designed at a time when several hundred connections
as an input argument would have seemed extreme. Banga explained how the
current implementations scale poorly because of linear searches,
layered demultiplexing, the linear system call interface, and the
pressure that these functions put on the CPU data cache. The key
insight here is that select wastes a significant portion of
time rediscovering information that previous stages of protocol
processing had available. Using hints to transfer this state
information, they were able to prune the scans that select
needed to perform.
Next Banga explained how they examined ufalloc. This procedure
is called every time a new file descriptor is allocated. Again a linear
search was at the heart of the problem. UNIX semantics state that
ufalloc must provide the lowest free descriptor; this prevents
the use of a free-list. To solve this problem, the authors
reimplemented ufalloc, adding a two-level tree of bitmaps to
indicate available descriptors. This new implementation changed
ufalloc's complexity from O(n) to O(log n). It also provided
for better cache behavior requiring two memory reads vis-à-vis a
sequential scan that would thrash the entire data cache. Lastly, it
provided better locking behavior because of a shorter critical section.
Banga explained how they set up two testbeds to evaluate the effect of
these changes. First, using the WAN benchmark on both the Squid proxy
and the thttpd Web server, they showed that scalability with respect to
connection rate and connection count was significantly improved. They
then tested their changes under a live load, i.e., on Digital's Palo
Alto Web proxies. Again, they found significant improvements in
performance from the modified systems.
Tribeca: A System for Managing Large Databases of Network Traffic
Mark Sullivan, Juno Online Services; Andrew Heybey, Niksun, Inc.
Mark Sullivan presented the Tribeca system for network traffic
analysis, which the authors developed after noting how the use of ad
hoc analysis programs resulted in redundant efforts. They have
developed a general query system for an environment where network
traffic data streams by at rates of up to 155 megabytes per second.
They observed that a typical relational database system would not be
effective for network analysis for the following reasons. Both the data
and storage media are stream oriented. Relational database systems do
not normally handle tape data well, and tape data are commonly used for
network traffic analysis. The operators needed are more like those in
temporal and sequential databases. Traffic analysis commonly requires
running several queries during a single pass. Lastly, relational
database systems rarely consider the memory capacity of the system on
which they are running. The Tribeca system addresses all of these
issues, but also differs from conventional relational databases in that
it does not support random access to data, transactional updates,
conventional indices, or traditional joins. Tribeca takes its source
data from either a live network adapter or tape data.
The query language in Tribeca is based on a data description language.
The different protocols are expressed as different data types; this
language then allows the user to create new types by extending
compiled-in types. This language also provides support for inheritance,
arbitrary offsets, and bit fields. Each query has one source stream and
one or more result streams. To manipulate these streams, Tribeca
provides three basic operators: qualification, projection, and
aggregation. Additionally, the query language provides for stream
demultiplexing and remultiplexing. Finally, the language also provides
a method for operating on windows over the stream.
Tribeca's implementation shares several similarities with traditional
relational database systems. Queries are compiled into directed acyclic
graphs. These graphs are then optimized to improve performance. The
basic data management for Tribeca makes use of existing operating
system support for sequential I/O, and because data are not updated, no
support for concurrency control is needed. Special attention was paid
in the implementation to minimize internal data copying. Additionally,
the optimizer
also works to ensure that a query's intermediate state can be fit into
the memory available.
The authors presented some basic tests to examine the overhead
incurred. They compared the overhead for a basic query with that of the
standard UNIX program dd: dd used 68% of the CPU on a
Sparc 10, but Tribeca used only 70%-75% of the CPU time. Lastly, they
compared the performance of Tribeca to that of programs written
directly to execute a specific query. The results showed that Tribeca
incurs between 1% and 7% overhead. The authors concluded by noting that
the increased flexibility and convenience provided by Tribeca are well
worth the minimal overhead introduced.
Transparent Result Caching
Amin Vahdat, University of California, Berkeley; Thomas Anderson,
University of Washington
Amin Vahdat presented a system (TREC) developed by the authors to track
output of a process's execution based on the inputs. Using this
information, TREC provides a framework to make use of previously
existing outputs and observe process lineage and file dependencies.
Implemented through the use of the procfile system under
Solaris, TREC intercepts the open, fork,
fork1, creat, unlink, exec,
execve, rename, and exit system calls. By
catching these calls TREC is able to record a process's input files,
child processes, environment variables, command line parameters, and
output files.
After explaining the basic architecture, this work addresses the
limitations of TREC. In order to address concerns about the performance
overhead from intercepting system calls, the authors examined the added
overhead for a test program that simply called open and
close in a loop and two typical application executions. The
test program saw 54% overhead but the typical applications saw only 13%
and 3%. The authors observed that the overhead is directly proportional
to the system call rate and noted that a kernel-level implementation
would significantly reduce this overhead.
The authors also noted several requirements for TREC to produce correct
results. The program itself must be deterministic and repeatable, it
cannot rely on user input, and interaction with environment variables
must be reproducible. File contents must be static, files such as
/dev/rmt0 could produce incorrect results. File contents must
be changed locally; for example, NFS-mounted files could be modified on
a remote machine and not be reported to TREC. Processes that base their
results on communication with remote servers cannot, in general, be
correctly tracked. Lastly, a program must complete successfully.
After detailing the limitations of this system, the authors provided
three examples of applications that use the TREC framework: unmake,
transparent make and a Web cache that enables server and proxy caching
of dynamically generated Web content.
Unmake provides a facility for users to query the TREC framework to
determine how specific output files were created as well as questions
about process lineage. Transparent make provides an alternative to make
that automatically determines file dependencies. Instead of providing a
possibly complicated Makefile, the user provides a simple shell script
that performs the complete build sequence. Once transparent make has
observed this shell script and each program's input and resulting
outputs, transparent make can be used for subsequent builds to execute
only those commands for which the inputs have changed in some manner.
This system has the following advantages: user errors in dependency
specification are avoided, dependencies are updated as the input is
changed (e.g., a header file is added to a program being compiled), and
the users are saved from learning the Makefile specification language.
Transparent make provides two current variants: a passive version that
will update output files when executed and an active version that
registers a callback with the TREC framework. For this active version,
upon observing changes to specific input files if a callback was
registered for that file, then transparent make will prompt the user
for reexecution of the registered module.
The third example of the uses for the TREC framework is a modification
to the Apache Web server to cache the results of cgi scripts. The
authors modified an Apache server to store copies of the results from a
cgi program's execution indexed by the program's parameters. When the
cgi program is called, the server first checks for a pregenerated
result for the requested program and parameters. If these exist, it
responds with the contents of that file instead of executing the cgi
script. To invalidate these dynamic cache entries, the TREC framework
is then used to profile the execution of each cgi program. When an
input to this program is observed to change, TREC notices a registered
callback similar to those for the active version of transparent make.
This callback invalidates the cached result. Comparing the two servers
(caching versus forking) with a basic cgi script, the authors observed
a 39% improvement in average response time.
Session: Extensibility
Summary by Karen Reid
In her introduction to this session, session chair Terri Watson Rashid
noted that the three papers represent a wide range of how extensibility
can be used. The first paper discusses a way to extend operating
systems; the second describes an extension to the SPIN operating
system; and the final paper presents methods of extending applications.
SLIC: An Extensibility System for Commodity Operating Systems
Douglas P. Ghormley, University of California, Berkeley; David Petrou,
Carnegie-Mellon University; Steven H. Rodrigues, Network Appliance,
Inc.; Thomas E. Anderson, University of Washington
The extension mechanism described by David Petrou makes it possible to
add functionality to commodity operating systems with no changes to the
operating system. These extensions allow system administrators to
easily add trusted code to the kernel to fix security flaws, take
advantage of the latest research, or better support demanding
applications.
SLIC is built using the technique of interposition: capturing system
events and passing them to the extensions. The system has two
components: the dispatchers that catch events, call the extensions, and
provide the API for the extension framework and the extensions
themselves that implement new functionality.
Petrou described three examples of extensions implemented using SLIC.
The first fixes a security flaw in the Solaris admintool. The second
extension adds encryption to the NFS file system. The third one
implements a restricted execution environment by filtering system
calls.
Petrou and his co-author/slide-turner, Steven Rodrigues, pulled out an
extra slide to answer the first question of how to manage the order
that the interpositions were applied. Unfortunately, the answer was
that although the syntax for composing interpositions is not a problem,
determining that a series of interpositions is semantically correct is
a difficult, and as yet unsolved, problem.
The second question confirmed that interpositions can be applied only
to interfaces at the kernel boundary. Petrou noted that SLIC could not
be used to change the paging algorithm for an application.
When asked whether the extensions would primarily be useful to
prototype kernel extensions or for low-volume extensions, Petrou
claimed that the examples showed that extensions could be used to solve
a wide range of problems, not just those in a research environment.
David Korn asked if interpositions could store data on a per process
basis. Petrou replied that the extension can store per thread state.
The remaining questions concerned the portability of the interposition
system. Petrou argued that the extensions should be quite portable, but
that the dispatchers needed to be ported to other architectures. They
are currently working on porting the dispatchers to Linux.
A Transactional Memory Service in an Extensible Operating System
Yasushi Saito and Brian Bershad, University of Washington
Yasushi Saito presented Rhino, an extension for the SPIN operating
system that implements a transactional memory service. A transactional
memory service uses memory-mapped files and loads and stores to
implement transactions that are atomic, isolated, and durable (ACID).
Transactional memory can be used to support object-oriented databases
and persistent low-level data structures such as filesystem metadata.
Saito contrasted his work with user-level implementations of
transactional memory by highlighting several problems with the
user-level approach. First, context switches for the signal handler and
mprotect() incur too much overhead. Also, the user-level
implementation requires fast interprocess communication. Finally,
buffering problems occur because the user-level process that is mapping
database files into memory has no control over the paging. Double
buffering occurs when the memory system decides to reclaim pages and
swaps out database pages instead of writing them back to the database
file.
The approach taken by the authors to solve these problems is to do
everything in the kernel. The SPIN extension gives them fast access
detection through the kernel page fault handler and efficient
memory-mapped buffers through cooperation with the kernel memory
manager.
Three options for buffer management were discussed. The first relies on
the user to notify the extension (by calling trans_setrange())
about a region that will be modified. This method is efficient for
small transactions, but doesn't scale well when the number of
setrange() calls is high. The second option is to log the
entire page when at least one byte is modified. This approach works
well for large transactions, but is costly for small transactions. The
third method computes and logs the diffs between the page and the
updates. Page diffing combines the advantages of the previous two
approaches, but incurs significant overhead.
Saito compared the performance of the SPIN-based transactional memory
service to one implemented at user-level on UNIX and to ObjectStore, a
database management system. The SPIN-based system consistently
outperformed the other two for the given workloads.
Terri Watson Rashid asked Saito to comment on his experiences
implementing kernel extensions on SPIN. Saito was reluctant to make any
strong comparisons between implementing the user-level UNIX
implementation of Rhino and the SPIN extension, but said that debugging
facilities for SPIN extensions made writing kernel-level code much
easier.
Dynamic C++ Classes
Gísli Hjálmtysson, AT&T Labs-Research; Robert Gray,
Dartmouth College
This work, presented by Gísli Hjálmtysson, was motivated
by the desire to allow "hot" updates of running software. In other
words, they wanted a system that allows users to insert or replace
components of a software system while it continues to run. This
technique can be applied to domains such as network processing, where
it is often highly undesirable to halt and restart programs.
The authors achieved their goal of updating running code by
implementing a library to support dynamic C++ classes. This approach
was chosen because C++ is widely used, high performance can be
maintained, and program abstractions can be preserved. Dynamic classes
allow for runtime updates at the class granularity. New versions of
existing classes can be installed, and new classes can be added.
However, they require that class interfaces be immutable.
One big question is how to dynamically link in new or updated classes.
Hjálmtysson proposes three different approaches to updating
objects: imposing a barrier, where no new objects may be created until
all objects of an older version have expired; migrating old objects to
their new version; and allowing multiple concurrent versions of each
class. The disadvantage of the barrier approach is that it is
equivalent in some ways to halting the program and restarting. The
migration approach is hard to automate efficiently, so the authors
chose the third approach of allowing concurrent versions of a class.
Dynamic classes have two parts: an abstract interface class and an
implementation class. An interface monitor, implemented as a class
proxy, screens messages that pass through dynamic class interfaces and
direct them to the correct version of the class. Two levels of
indirection are used: one to map to the implementation and the other to
map the methods within a version. This approach requires that all
public methods of a dynamic class be virtual.
Using the factory pattern, an object of a dynamic class is constructed
by calling the default constructor, which locates and loads the dynamic
shared library, calls the constructor for the correct version, and
stores a pointer to the version of the implementation class in the
proxy.
Three different templates for proxies are defined. They differ in ease
of use, functionality, and performance. The high-functionality version
of the template allows multiple implementations of a dynamic class
interface as well as multiple active versions of an implementation. The
medium-functionality version allows multiple versions, but not multiple
implementations of a dynamic class. Both the medium- and
high-functionality versions implement methods to invalidate other
dynamic class versions. Finally, the low-functionality, but highest
performance, version of the template allows multiple concurrent
versions of a dyn-amic class, but old versions cannot be invalidated.
The flexibility of dynamic classes does not come without a cost. Each
instance requires space for three or four extra pointers. The method
invocation overhead is approximately doubled because of the extra
checks required and because some of these checks cannot be optimized
(by most compilers) because a method that can throw an exception cannot
be inlined.
Hjálmtysson used mobile agents as an example of how dynamic
classes could be used. Different versions of the agents for different
architectures can be downloaded so that agents can be instantiated on a
variety of platforms.
Hjálmtysson concluded by describing dynamic classes as a
lightweight mechanism to update running code that preserves type safety
and class abstractions. It compiles on SGI, Sun, and Windows 95/NT and
is available from AT&T by contacting the authors.
During the question period, the similarity between dynamic classes and
Corba or ActiveX was noted. Hjálmtysson acknowledged the
similarity and claimed that dynamic classes have less fireworks around
them and are a more lightweight approach. Not surprisingly, other
questions focused on how a similar system might be written for Java,
whether the Java Virtual Machine (JVM) would have to be modified, and
how the JVM might be forced to unload classes. Hjálmtysson said
he believed that it may be possible to implement dynamic classes
without modifying the JVM, but that the class loader would need to be
modified, making it less portable.
Session: Commercial Applications
Summary by Brian Kurotsuchi
Each of the papers in this session dealt with the low-level,
behind-the-scenes operating systems internals, specifically, the
filesystem and virtual memory subsystems.
Fast Consistency Checking for the Solaris File System
J. Kent Peacock, Ashvin Kamaraju, Sanjay Agrawal, Sun Microsystems
Computer Company
Kent Peacock presented his group's work with the optimization of the
native Solaris UFS filesystem to improve performance while supporting
the semantics of NFS services. He explained that NFS semantics require
data to be committed to stable secondary storage before the NFS
transaction can be completed. This requirement unfortunately precludes
the use of filesystem block caches, which are generally used to improve
read/write performance. In order to overcome the synchronous write
requirement, they decided to use some type of fast NVRAM storage medium
to provide safe buffering of the physical storage device; they first
used a UPS on the system, then actual NVRAM boards. With this NVRAM
solution, they gained performance by not having to wait for slow
secondary storage to complete before acknowledging the NFS transaction.
Peacock also mentioned that they tried traditional logging (journaling)
to the NVRAM, but were unable to meet performance requirements using
that approach.
The second issue that Peacock addressed was that of filesystem
performance at runtime and when fsck is required to check the
filesystem. In order to do this, they added additional data structures
to the on-disk filesystem representation and modified some of the ways
in which metadata are juggled. The areas Peacock focused on were busy
bitmaps and the changes in the use of indirect blocks.
The Solaris UFS filesystem is divided into cylinder groups, each of
which contains a bitmap of free blocks. An fsck involves
checking this data in each cylinder group on the disk, an operation
that can take some time. In order to reduce the number of metadata
structures that need to be checked during an fsck run, there
are special bitmaps in parallel (physically) with the free block
bitmap. These new bitmaps indicate which blocks and i-nodes in that
cylinder group are being modified (are busy). Each cylinder group can
then be flushed and marked as "stable" asynchronously by a kernel
thread. This can greatly reduce the time needed to do an fsck
because only cylinder groups that are still marked "busy" need to be
checked.
An interesting variation that Peacock's group came up with is the
handling of indirect block maps to reduce the number of writes to disk.
Indirect blocks are normally stored separately from the i-node, hence
in a different block on the disk. Updating a large file that requires
the use of the indirect blocks incurs a read and write of at least two
blocks instead of one (i-node only vs. i-node + indirect block[s]). To
defer the need to deal with the additional blocks, temporary indirect
block storage is interleaved on odd i-nodes in the i-node table. Each
time an indirect block is needed, it is written into the i-node slot
adjacent to the file's i-node, requiring only a single write operation.
When the adjacent i-node storing the indirect pointers is full, it is
flushed to the traditional indirect block (hence deferring all indirect
block I/O operations until this time).
In conclusion, Peacock reminded us that NFS is inherently disk bound
because of the synchronous write requirements. His group was able to
overcome this by using NVRAM storage to satisfy NFS semantics and
attain high throughput performance. On top of that, they were able to
make additional gains by modifying UFS to use the indirect block cache
and busy maps. The data gathered by Peacock's group seem to indicate
runtime and fsck performance above and beyond that of standard
UFS and the widely used Veritas File System. This modified filesystem
is in use on Sun's Netra NFS server products and may appear in a future
Solaris release.
The audience questions indicated some skepticism on the Veritas
benchmarks that were stated. An important question concerned NFS
version 2 versus version 3, for which Peacock said they found a smaller
performance gap between their Netra NFS implementation and NFS version 3.
General Purpose Operating System Support for Multiple Page Sizes
Narayanan Ganapathy and Curt Schimmel, Silicon Graphics Computer
Systems, Inc.
Narayanan Ganapathy gave an excellent presentation that outlined the
advantages of using virtual memory page sizes above the normal 4k and a
walk-though of how they implemented this idea in IRIX (v6.4 & 6.5).
There are some applications out there that could see improvements in
performance if they could use large pages of memory versus small (4k)
pages. Much of the overhead for an application that deals with large
sets of data could be TLB misses. Ganapathy presented an explanation of
the reason behind and experience they had at SGI while retrofitting the
IRIX virtual memory system to allow processes to use multiple page
sizes.
One of the goals in designing this multi-sized paging system was
minimizing change to existing operating system code and maintaining
flexibility and compatibility with existing application binaries. The
implementation they chose makes changes at a high level in the virtual
memory subsystem, the per process page table's entries (PTEs). This is
the map of all pieces of memory that can be accessed by a process. To
support the large pages, each PTE has a field that states the size of
that page (4k-16M on the R10000). The memory area to which that page
refers may be handled at a lower level by a pfdat (page frame data)
structure, which they chose to keep as statically sized 4k pieces for
compatibility. A major advantage to doing things this way is that
multiple processes can still share memory, but the size of the area
that each of them sees in its page table does not have to be the same.
One process can map 16k pages while another maps 4k pages, both of them
ultimately referring to the same 4k pfdat structures (in effect the same physical memory).
Allowing processes to manipulate their PTEs in this way produced some
interesting problems, such as memory fragmentation, fast processing of
TLB misses, additional systems calls and tools to manipulate the page
sizes. Fragmentation is avoided by intelligent allocation of pages
through the use of maps for free segments of memory of similar size and
a "coalescing daemon" to defragment memory in the background using page
migration to rearrange. To prevent all processes from going through
extra code even when they are using the default page size, IRIX
provides the ability to assign a TLB miss handler on a per process
basis. A system call has been provided to change the page sizes, plus
tools to allow normal binaries to be configured to use large pages
without recompilation.
In closing, Ganapathy mentioned the possibility of intelligent kernels
that could automatically choose page sizes for a process based upon TLB
misses.
Implementation of Multiple Pagesize Support in HP-UX
Indira Subramanian, Cliff Mather, Kurt Peterson, and Balakrishna
Raghunath, Hewlett-Packard Company
The final presentation in this session was given by Indira Subramanian.
Although this presentation was on the same subject as the previous one
by the SGI group, they were well coordinated and did not seem like
redundant subject matter.
As in the Silicon Graphics implementation, the HP group wanted to
minimize kernel VM subsystem changes. Their implementation also avoids
changes to the low-level VM data structures and implements variable
sized pages at the PTE level.
Allocation and fragmentation management is governed by an
implementation of the buddy system, with the pools representing free
memory regions from 4k to 64M in size. The page management subsystem
uses two different strategies to deal with requests for new pages. The
VM system will automatically combine pages to create larger pages as
soon as a page is freed. Next, pages not currently in the cache will be
evicted and coalesced into the free pool. The last resort is to evict
and coalesce pages currently in the cache. Using this algorithm should
give the greatest chance for pages to be retrieved out of the cache.
A single modified HP-UX page fault handler is used for all page faults
that occur in the system. It is capable of dealing with copy-on-write,
copy-on-reference, zero filling, and retrieval of large pages when
necessary. It is possible to provide the page fault handler with a page
size hint either through the use of a utility program (chatr)
or through an intelligent memory region allocation routine. This "hint"
allows the page fault handler to bypass page size calculation and
allocation if it can determine that the default of 4k is going to be
used. The basic idea was that if the page size hint shows that the new
page was probably going to be greater than 4k, the fault handler would
take the following steps: (1) calculate the size, (2) allocate the
necessary region, (3) add the necessary translations, and (4) zero fill
the page if needed. The size calculation and large region allocation
can be completely skipped if the new page will be a simple 4k, hence
preserving performance in those cases.
Perhaps the most gratifying part of this presentation was the place
where Subramanian spent a lot of the time the graphs. By
experimenting with applications that use memory in different ways, the
data showed that one large page size was not suitable for all
situations. In one application, there was a very high TLB miss rate
while using 4k pages, but a much better hit rate with 4M pages. As you
would expect, the law of diminishing returns kicked in when excessive
page sizes were selected.
In conclusion, Subramanian reminded us that the page sizes are not
promoted (combined to make larger pages) except when the page fault
handler identifies regions experiencing large TLB miss rates. Reducing
TLB misses was the project goal, which they accomplished through adding
the ability to use large pages and by having the VM system dynamically
monitor memory usage and adjusting page sizes to reduce TLB misses at
runtime.
One unfortunate factor in this presentation was lack of time. Our
presenter was fielding some very interesting questions from the
audience and flipping at blinding speed between graphs of data right up
to the end.
Session: Performance II
Summary by Vikas Sinha
This was the second of system performance-related sessions. Papers
focusing on performance issues pertaining to simulation to better
understand the obtruse program execution events, cache design for
servers, and messaging techniques for exploiting the current networking
technology were presented. The session was chaired by Mike Nelson of
Silicon Graphics.
SimICS/sun4m : A Virtual Workstation
Peter S. Magnusson, Fredrik Larsson, Andreas Moestedt, Bengt Werner,
Swedish Institute of Computer Science; Fredrik Dalhgren, Magnus
Karlsson, Fredrik Lundholm, Jim Nilsson, Per Stentröm, Chalmers
University of Technology; Håkan Grahn, University of
Karlskrona/Ronneb
Peter Magnusson presented the paper describing the capabilities and the
current status of the instruction-set simulator SimICS/sun4m, which has
been developed by his research group at the Swedish Institute of
Computer Science (SICS) over the past several years.
Simulation is essentially running a program on a simulator on some
arbitrary computer that should behave like a program actually running
on a specific target computer. Simulation focuses on capturing
characteristics like hardware events induced on a target platform
during program execution and some details of the software running that
are otherwise difficult to gather. Gathering such detailed
characteristics using simulators does involve a slowdown of typically
two to three orders of magnitude in program execution compared to its
execution on native hardware.
System-level simulators facilitate understanding the intricacies of
program execution on a target system because of their capability to
re-create an accurate and complete replica of the program behavior.
Such simulators have thus been an indispensable tool for computer
architects and system software engineers for studying architectural
design alternatives, debugging, and system performance
tuning.
SimICS/sun4m is an instruction-set simulator that supports more than
one SPARC V8 processor and is fully compatible with the sun4m
architecture from Sun Microsystems. It is capable of booting unmodified
operating systems like Linux 2.0.30 and Solaris 2.6 directly from dumps
of the partitions that would boot a target machine. Binary compatible
simulators for devices like SCSI, console, interrupt, timers, EEPROM,
and Ethernet have been implemented by Magnusson's research group.
SimICS can extensively profile data and instruction cache misses,
translation look-aside buffer (TLB) misses, and instruction counts. It
can run realistic workloads like the database benchmark TPC-D or
interactive applications such as Mozilla.
A noteworthy application of the SimICS/sun4m platform is its use for
evaluating design alternatives for multiprocessors. The evaluation of
the memory hierarchy of a shared-memory
multiprocessor running a database application was presented as a case
study.
In the presentation the performance of the SimICS/sun4m simulator was
demonstrated by comparing the execution time of the SPECint95 programs
on the target and host, using the train dataset. The slowdown was in
the range of 26-75 over native execution for the test environment
chosen.
SimICS/sun4m is available for the research community at
<https://www.sics.se/simics>. The presentation slides are available at
<https://www.simics.se/simics/usenix98>. Magnusson also welcomed
those interested in knowing more
about his work to contact him at <psm@sics.se>.
A few interesting questions were asked after the presentation. To
demonstrate user and system mode debugging, evaluation of Mozilla
running on top of Solaris on SimICS had been presented in the talk. In
the presentation it was also noted that reloading a page required 214
million SPARC instructions, and about 25% of these were spent in the
idle loop. The question was whether it was clear as to why so much time
was spent in the idle loop. Magnusson said that the answer wasn't clear
to them, and to get the answers to such questions, they were working on
adding high-end scripting tools to their simulator because the current
tools are not sufficient for detailed analysis.
In reply to the question of what was the hardest problem to solve in
the work, Magnusson said that from an engineering point of view it was
the modelling of devices at a level to run real SCSI devices, real
ethernet drivers, etc. From the research point of view it was the
design of memory fast enough and flexible enough to give one the
desired information. In reply to the question on use of interpreters as
against realtime code generation, Magnusson said that although the
"religious belief" of programmers that realtime code generation was
faster held true, he wasn't aware of any group that had actually
implemented it with the desired stability. He added that one of the
reasons they are going commercial
Virtutech is the new company their group has founded was the
hope that they will have access to resources required to better address
such issues, which are often not feasible in the academic research
environment.
High-Performance Caching With The Lava Hit-Server
Jochen Liedtke, Vsevolod Panteleenko, Trent Jaeger, and Nayeem Islam,
IBM T.J. Watson Research Center
Jochen Liedtke presented the results of an ongoing experiment at the
T.J. Watson Research Center on the architecture design for a
high-performance server capable of efficiently serving future local
clusters of network computers and other future thin clients (PDAs,
laptops, pagers, printers, etc.). The key component in their
architecture is a generic cache module designed to fully utilize the
available bandwidth.
Liedtke's group envisions future local networks serving thousands up to
hundreds of thousands of resource-poor clients with no or little disk
storage. In such scenarios the clients will download a significant
amount of data from the server, whose performance can become the
bottleneck. They suggest that high-performance customizable servers,
capable of handling tens of thousands of transactions per second (Tps)
with bandwidths of the order of gigabytes per second will be required.
The basic goals of their research were to find the upper bounds, both
theoretical and practical, and to find a highly customizable, scalable
architecture for such a scenario.
They based their work on the well-established approach of increasing
server performance via efficient cache design. The fundamental idea
behind their work is
separating the hit-server from the miss-server. The hit-server is
connected to both the pool of clients and the miss-server using
different Ethernet connections. There could be several Ethernet cards
on the hit-server, each connecting several clients. If the desired
object is in the hit-server, it is accessed using standard get
and put commands; otherwise the miss-server is signalled.
Because the hit-server is vital for performance, they make it general
and policy-free, so that it can adapt to any application. The
hit-server allows get/put operation on entire as well as
partial objects beside providing support for active objects. The miss
handling and replacement policy is handled in the customizable
miss-servers. To achieve scalability, it is suggested that multiple
customized miss-servers, e.g., fileservers, Web proxy servers, etc,.
could be implemented. Or more hit-servers can be incorporated in the
design to increase the overall cache size. The paper describes the
mechanisms that allow the miss-servers to support the desired
consistency protocol per object.
Throughputs of up to 624 Mbps are possible using the 1 Gbps PCI bus.
But current commercial and research servers still achieve rates up to
2,100 and 700 Tps, respectively, for moderately small 1K size objects.
It was demonstrated that the problem was not with the network hardware,
but with the memory bus. Thus it is imperative to minimize memory bus
access, which slows down the performance. The CPU should limit itself
to using the L1 and L2 caches as far as possible. Using lazy evaluation
techniques and precompiling packets and precomputing packet digests can
facilitate this. L2 misses can be minimized by proper data structuring.
Lava's get performance is 59,000 Tps and 8,000 Tps for 1K and
10K objects, respectively. Liedtke explained that the throughput limit
of 624 Mpbs suggested in the paper was incorrect because they had based
their measurements on the time to transmit a single packet using the
PCI bus and had not considered the time interval between the "start
transmit" signal to the controller and the start of the transmission,
which could be used by some other packet in case of multiple packet
transmissions.
A simple application of multiple clients booting by accessing 5M-15M of
data over a short interval of five minutes was shown to have an average
latency of about 1.5 s for 1,000 clients.
Before concluding, Liedtke put up some open questions. They were
whether the hit-server could be used in the WAN environment where
different protocols were prevalent, how cache friendly will future
applications be and if the system can be customized for them, and
whether it will be possible to integrate dynamic applications like
databases into the design.
Liedtke concluded by saying that the lessons his group had learned from
the implementation were that designing from scratch pays. He also
suggested that it is a good strategy to separate the
generic-fast-simple from the customizable-complicated-slower and noted
that generality goes with simplicity. He also said that even though an
ideal case analysis might be wrong, it is essential, and that designing
before implementing should be done whenever possible.
Cheating the I/O Bottleneck: Network Storage with Trapeze/Myrinet
Darrell C. Anderson, Jeffrey S. Chase, Syam Gadde, Andrew J. Gallatin,
and Kenneth G. Yocum, Duke University; Michael J. Feeley, University of
British Columbia
Darrell Anderson presented a messaging system designed to deliver the
high-performance potential of current hardware for network storage
systems, including cluster filesystems and network memory.
They note that the I/O bottleneck arises because disks are inherently
slow due to mechanical factors. Very fast networks like Myrinet, on
which their work is based, offer point-to-point connections capable of
1 GB/s bandwidths for large file transfers and small latencies of 5-10
microseconds for small messages. The network can instead be viewed as
the primary I/O path in a cluster, with the goal of achieving I/O at
gigabit network speeds for unmodified applications. By allowing all I/O
to/from network memory, the I/O bottleneck can be cheated. Also by
pipelining the network with sequential read-ahead, write-behind high
bandwidth, file access through the network can be achieved. They rely
on the Global Memory Service (GMS) developed at the University of
Washington, Seattle, to provide the I/O via the network. Myrinet
provides link speeds matching PCI bandwidth, link-level flow control,
and a programmable network interface card (NIC), which is vital in
their environment. Their firmware runs on the NIC, they modify the
kernel RPC, and they treat file and virtual memory systems as an
extension of the underlying gigabit network protocol stack. Their
firmware and Myrinet messaging system is called Trapeze. They have been
able to achieve sequential file access bandwidths of 96 MB/s using GMS
over Trapeze.
The GMS system that has been used in Anderson's research lets the
system see the I/O through the network. GMS is integrated with the file
and VM systems such that whenever a file block or virtual memory page
is discarded on a node, it is in fact pushed over the network to some
other node, where later cache-misses or page-faults can retrieve it
with a network transfer.
In the request-response model on which network storage systems are
based, a small request solicits a relatively large page or file block
in response. In their work they address the challenges in designing an
RPC for network storage and its requirements for low overhead, low
latency, and high bandwidth. Support for RPC variations, like multiway
RPC for directory lookup and request delegation, is provided.
Nonblocking RPC used for implementing read-ahead, write-behind is also
supported.
Their Trapeze messaging is reportedly the highest bandwidth Myrinet
messaging system. It consists of two parts, the firmware running in the
NIC and the messaging layer used for kernel or user communication. It
supports IP through sockets as well as kernel-to-kernel messaging and
is optimized for block and page transfers. It provides features for
zero-copy communication through unified buffering with the system page
frame pool and by using Incoming Payload Table (IPT) to map specific
frames to receive into. The key Trapeze data structures reside in the
NIC, where they are used by the firmware, but are also accessible to
the messaging layer. The Send and Receive Rings in the NIC point to
aligned system page frames, which are used to send and receive pages
using DMA. These page frames can also be mapped into user space.
Particular incoming messages can also be tagged with a token that, when
used in conjunction with the Trapeze IPT can deliver the message data
into a specific frame. This is used in implementing their zero-copy
RPC. Their zero-copy TCP/IP over Trapeze can deliver a highly
respectable bandwidth of 86 MB/s for 8 KB data transfers.
They short-circuit the IP layer, which is nevertheless available to
user applications over the socket layer, in their integration of RPC
with the network interface. This avoids the costly copying at the IP
layer in the standard page fetch using RPC over IP.
They report highest bandwidths and lowest overheads using the file
mapping mmap system call.
Anderson referred those interested in learning more about their work to
their Web site <https://www.cs.duke.edu/ari/trapeze>.
A question was asked as to how IP performance could be improved, which
came as a surprise to Anderson, who
wasn't expecting the question and handled it by saying that their MTU
size is 8 KB and also page remapping is done to avoid the costly data
copying to improve the overall performance. Answering questions on
reliability of their RPC and data corruption in the underlying
hardware, he said that because they were using Myrinet, which provides
a hardware checksum and also link-level flow control, messages are not
corrupted or dropped in the network.
Session: Neat Stuff
Summary by Kevin Fu
This session consisted of a collection of interesting utilities. Pei
Cao from the University of Wisconsin maintained order as the session
chair.
Mhz: Anatomy of a Micro-benchmark
Carl Staelin, Hewlett-Packard Laboratories; Larry McVoy, BitMover, Inc.
Carl Staelin talked about mhz, a utility to determine processor
clock speed in a platform independent way. Mhz takes several
measurements of simple C expressions, then finds the greatest common
divisor (GCD) to compute the duration of one clock tick.
Measuring a single clock tick is difficult because clock resolution is
often too coarse. One could measure the execution time of a simple
expression repeated many times, then divide by the number of
instructions. However, this too has complications. For instance, a
compiler may optimize "a++" run many times in a loop. Moreover,
interrupts muddle with the measurements by randomly grabbing CPU time.
Staelin proposed a solution based on ideas learned from high school
chemistry and physics to determine atomic weights. Measure the time of
simple operations; then use the GCD to determine the duration of one
clock tick. Mhz uses nine C expressions for time measurements.
The expressions have inter- and intra-expression dependencies to
prevent the compiler from overlapping execution of expressions. The
operations must also be immune to optimization and be of varying
complexity.
Mhz requires the operations to have relatively prime execution
times. However, measurements will have variants and fluctuations.
Therefore, mhz must minimize noise and detect when a measurement
is incorrect. Mhz prunes out incorrect results by measuring many
executions of a particular expression. If any particular execution is
off by more than a factor of five when compared to other executions,
the result is disregarded. Mhz calculates the duration of one
clock tick using many subsets of the nine measurements. To produce a
final answer, mhz takes the mode of the calculations.
The mhz utility works on x86, Alpha, PowerPC, SPARC, PA-RISC,
Linux, HP-UX, SunOS, AIX, and IRIX. Mhz is accurate and OS/CPU
independent. Mhz works in Windows NT, but NT does not offer the
gettimeofday() call. As a result, Staelin used NT's native,
something-left-to-be-desired timer. Mhz produced correct
results, but Staelin did not report this because he does not want
to support NT. Porting is painful for a variety of reasons.
Staelin was also asked about loop overhead and interrupts. Mhz
was developed with a timing harness that performs a variety of
experiments to detect clock granularity. Mhz can remove the
overhead caused by the gettimeofday() call. Interrupts are
random and hence dealt with by using multiple experiments.
An audience member asked whether mhz could produce more accurate
results when given more time to compute. Staelin responded, "Good
benchmarking hygiene should be good. We wanted something that would
work in a second or so."
There may be other areas of computer performance where this method has
applicability. This is a trick that can go into your mental toolkit.
See <https://www.bitmover.com/> for the source code.
Automatic Program Transformation with JOIE
Geoff A. Cohen and Jeffrey S. Chase, Duke University; David L.
Kaminsky, IBM Application Development Technology Institute
Geoff Cohen, an IBM graduate fellow and doctoral student at Duke,
talked about load-time transformations in the Java Object
Instrumentation Environment (JOIE). Transportable code allows sharing
of code from multiple sources. Cohen used JOIE as an environment
toolkit to transform compiled Java bytecode.
There already exist binary transformation tools such as OM/ATOM, EEL,
and Purify. BIT and BCA allow transformations in Java. However, BCA
does not modify bytecodes, and BIT only inserts method calls into
bytecodes. Neither is a general transformation tool.
There are a few kinds of transformers. A symbolic transformation could
add interfaces or change names, types, or superclasses. A structural
transformation could add new fields, methods, or constructors. Bytecode
transformation allows for insertion, reordering, or replacement of
bytecodes within method implementations. This last transformation
significantly distinguishes JOIE from BCA.
Such transformers can extend Java to support generic types and new
primitives. For instance, transformers can work with caching, security,
and visualization for system integration. Moreover, transformers can
add functionality such as versioning or logging.
Load-time transformations with JOIE are incremental, automatic, and
universal. JOIE is an enabling technology that gives users more control
with programs. Related issues are performance, security, safety, and
ease of use.
An audience member asked why not perform transformations in the
JIT/JVM. Cohen's response was this method is not platform independent
and is harder to write. In the JIT, symbols may have been lost.
JOIE is written in Java. Performance is on the order of single-digit
milliseconds. But once time allows for some tuning, Cohen expects JOIE
to run in hundreds of milliseconds.
Responding to a question, Cohen said that it is possible to debug
transformed code, but it is very hard. The JVM should prevent anything
unsafe created by JOIE at runtime (e.g., read /etc/passwd).
Finally, an audience member asked about reversibility: could a
transformation be undone by another transformation. In theory, this is
possible, but some functions are one-way.
See <https://www.cs.duke.edu/ari/joie/>.
Deducing Similarities in Java Sources from Bytecodes
Brenda S. Baker, Bell Laboratories, Lucent Technologies; Udi Manber,
University of Arizona
Brenda Baker spoke about how to detect similarities in Java bytecode.
She is interested in string matching and Web-based proxy services. Java
is the juggernaut and is expected to be widespread and ubiquitous.
Typically, the bytecode is not distributed with the source code when
programmers want to keep the source secret. Baker's goal is, given a
set of bytecode files, to discover similarities among the bytecode
files that reflect similarities among their Java source files.
Furthermore, all this should happen without access to Java source
files.
Detecting similarities has application to plagiarism detection, program
management to find common sources, program reuse and reengineering,
uninstallation, and security (similar to known malicious code). For
instance, one could detect the incorporation of benchmarks into
programs or whether JOIE was applied. There is also a potential battle
against code obfuscators.
Baker adapted three tools: siff finds pairs of text files that
contain a significant number of common blocks (Manber), dup
finds all longest parameterized matches (Baker), and diff is a
dynamic programming tool to identify line-by-line changes (GNU).
Siff and diff are not too useful on raw bytecode,
even when the byte code is disassembled. When changing a 4 to a 5 in
two lines of a 182-line Java file, diff generated 1,100 lines
of output on the disassembled bytecode, but siff found less
than 1% of similarity.
Baker described three experiments. The first experiment involved random
changes to a Java source file (insertion, deletion, substitution within
statements). The bytecode was compiled, disassembled, then encoded. The
average similarity in the disassembled code using siff never
grew larger than 9% off from the same measurement on the Java source.
Averages stayed very close, making this a promising approach.
In the second experiment, Baker's group tried to find similarities in
2,056 files from 38 collections. Thresholds were set as follows:
siff reported pairs with at least 50% similarity, dup
reported pairs matching at least 200 lines. Nine pairs of programs
across different collections were reported as similar by both
siff and dup. Eight of these had the same name. One
program had the same implementation of the MD5 hash algorithm.
One pair was reported only by dup probably a false
positive. However, siff reported 23 pairs unreported by
dup. Some had similar names while the other pairs consisted of
one very small file and one very large file. The small/large file pairs
are probably false positives.
Experiment three involved false negatives. Baker's group asked friends
to randomly pick 10 programs from set of 765 Java programs. The person
would make random changes, then compile the Java
code even with different compiler versions. The bytecode was
then returned in random order.
Of the 12 pairs of similar code, siff found nine of 12 pairs
with a threshold of 65%; dup found eight pairs with a
threshold of 100 lines. Together siff and dup found
10 pairs. There is a trade-off between false positives/negatives and
the threshold.
Baker found the offsets to be important for matching. Also,
siff can handle large amounts of code, but diff
requires the most intensive computation. When analyzing lots of files,
first use siff, then dup, then diff.
diff has a quadratic blowup with respect to the number of file
inputs.
An audience member asked whether Baker had tried comparing the output
of two different compilers. Baker doubts her group will find much
similarity. But if you have the code, you could compile in another
compiler to test for similarity. As for false positives, if you lower
the threshold too far, you could get hundreds of false positives.
Moving code around will not affect dup, but will affect
siff. This all depends on the threshold. Using siff,
dup, and diff in combination makes detection more
powerful.
In further research, Baker's group hopes to use additional information
in bytecode such as the constant pool.
Session: Networking
Summary by Jon Howell
The networking session was chaired by Elizabeth Zwicky of Silicon
Graphics.
Transformer Tunnels: A Framework for Providing Route-Specific
Adaptation
Pradeep Sudame and B. R. Badrinath, Rutgers University
Pradeep Sudame presented the concept of transformer tunnels as a way to
provide better service to mobile hosts that encounter diverse networks.
In a day, a mobile host might access the network at large over a modem,
a cellular phone, a wireless LAN in the office, and a high-speed wired
LAN. Each network has different properties, and transformer tunnels
provide a way to manipulate the traffic going over the mobile host's
link to minimize certain undesirable effects.
The mechanics of transformer tunnels are as follows: a routing table
entry at the source end of the tunnel indicates that packets bound for
a given link should be transformed by a certain function. The source
node transforms the packet payload, rewrites the header to point to the
tunnel destination, rewrites the protocol number to arrange for the
transformation to be inverted at the far end, and attaches the original
header information to the end of the packet so it can be reconstructed
at the other end.
When the packet arrives at the destination, its protocol number directs
it to the appropriate inverse transformation function. The
reconstructed packet is delivered to IP, where it is delivered in the
usual way to an application on the local host or forwarded on into the
network.
Sudame gave interesting examples of how transformer tunnels can provide
useful trade-offs for mobile hosts on links with different
characteristics. A compression module is useful on slow links, trading
off host-processing overhead. A big packet buffer compensates well for
links with bursty losses (such as during a cell handoff), trading off
memory requirements. A tunnel that aggregates packets to reduce radio
transmitter on-time reduces energy consumption, trading off an increase
in the burstiness of the link.
Joe Pruett, in the Q/A period, asked how the transformer deals with a
maximally sized packet to which it needs to add overhead. Sudame
responded that, for optional optimizations, it would be passed on
unchanged; for mandatory transformations such as encryption, it would
be transformed and then fragmented.
Ian Vaughn asked if IPsec was used for encryption, to which Sudame
replied that they used only a simple exclusive-OR as a
proof-of-concept.
Elizabeth Zwicky asked how difficult it was for an unfamiliar
programmer to write a transformation function. Sudame replied that it
required the programmer be only somewhat aware of systems programming
concepts.
Sudame provided the following URLs for more information and indicated
that the group would like many people to try out the code and comment
on it. <https://www.cs.rutgers.edu/dataman> and
<https://www.cs.rutgers.edu/~sudame/xtunnel-dist.html>.
The Design and Implementation of an IPv6/IPv4 Network Address and
Protocol Translator
Marc E. Fiuczynski, Vincent K. Lam, and Brian N. Bershad, University of
Washington
Marc Fiuczynski discussed an IPv6/IPv4 Network Address and Protocol
Translator (NAPT). He identified three possible scenarios in which one
might configure a NAPT: use within an intranet, providing your shiny
new IPv6 systems with access to the existing IPv4 Internet, and
duct-taping your rusty old IPv4 systems to the emerging IPv6 Internet.
As he began his talk, Fiuczynski fumbled with the pointer, but then
fell back on his Jedi light saber training, muttering, "Luke, use the
laser pointer."
He outlined the project's goals for a translator: a translator should
be transparent so that the end host is oblivious of its presence. It
must scale with the size of the network it is serving. It should be
failure resilient, in that it can restore service after a reboot. It
should, of course, perform suitably. And finally, it should deploy
easily.
A translator must attend to several issues. It needs to preserve the
meaning of header bits across the header formats. It translates
addresses between the IPv4 space and the IPv6 space, which it can do
using a stateful or stateless server. And it also needs to translate
certain higher-level protocols such as ICMP and DNS that encode IP
addresses in the IP payload.
The group built two translators, one stateful and one stateless. The
stateful translator has a table of IPv4 to IPv6 address mappings. It
attempts to garbage-collect IPv4 addresses to reduce the number needed
to serve a site. This garbage collection was challenging because "you
might break someone's ongoing communication ... that would be bad ...
that's definitely not a goal of the translator." However, because the
translator is stateful, it is not scalable or fault resilient; because
it requires rewriting some transport protocol headers, it is not
transparent.
The stateless translator uses special IPv6 addresses that map
one-to-one with IPv4 addresses. It is scalable, fault resilient,
transparent, and has no need to garbage-collect IPv4 addresses.
However, using the special compatibility addresses means that routers
will still have the "stress" of routing IPv4-like addresses, a problem
IPv6 addresses are designed to relieve. Fiuczynski concluded that a
stateless translator is best suited to connecting an IPv6 site to the
IPv4 Internet or for translating within an intranet.
Joe Pruett asked about DNS translation and whether all internal IPv6
nodes could be reachable from the outside network using IPv4 addresses.
Fiuczynski replied that the stateful translator would have to
garbage-collect addresses to share them among internal hosts and
translate (or directly answer?) DNS queries according to the current
mapping.
Greg Minshall asked what the difference was between IPv4 to IPv4
translators and Washington's IPv6 to IPv4 translators. The reply was
that IPv4 NATs are stopgap measures with no clear replacement, but IPv6
translators are a transitional mechanism meant to be eventually
removed.
Dave Presotto asked whether the system was rule based, that is, whether
he could add new translation functions, other than IPv4 to IPv6
translation, in order to perform other functions using the same system.
Fiuczynski expressed confidence that such an extension would be
straightforward.
B.R. Badrinath asked if multicast address translation would be a
problem, to which Fiuczynski offered a succinct "yes."
The work is documented at
<https://www.cs.washington.edu/research/networking/napt>, and source
will be available there soon.
Increasing Effective Link Bandwidth by Suppressing Replicated Data
Jonathan Santos and David Wetherall, Massachusetts Institute of
Technology
Jonathan Santos spoke about his group's work in identifying and
suppressing replicated data crossing a given network link. The work
applies to any link that is a bottleneck due to cost or congestion
reasons. The novel approach of the project was to identify redundancy
in packet payloads traversing the link without using protocol-specific
knowledge.
Santos defined "replicated data" as a packet payload that is
byte-for-byte identical to a previously encountered packet. The
researchers studied a packet trace from an Internet gateway at MIT and
discovered that 20% of the outbound volume and 7% of the inbound volume
of data met their definition of replicated. HTTP traffic was
responsible for 87% of the replication found in the outbound trace, and
97% of the volume of replicated data was delivered in packets larger
that 500 bytes, indicating that per packet compression savings would
dwarf any added overhead.
To identify whether the replication could be detected and exploited in
an online system, they graphed replicated volume against window size.
The graph had a knee at around 100-200MB, signifying that most of the
available redundancy could be exploited with a cache of that size.
Their technique for redundancy suppression involved caching payloads at
both ends of the link and transmitting a 128-bit MD5 fingerprint to
represent replicated payloads. One issue involved retransmitting the
payload when the initial packet (containing the real payload) is lost.
They also prevent corruption due to fingerprint collisions (the
unlikely possibility that two payloads share the same MD5 checksum) in
the absence of message loss. (Greg Rose from Qualcomm Australia pointed
out that RSA, Inc., issues a cash prize if you discover an MD5
collision. Hopefully, the software reports any collisions to the system
administrator.)
Santos concluded that their system was a cost-effective alternative to
purchasing link bandwidth and that it complements link-level
compression well.
Fred Douglis inquired whether they might be able to identify and
compress very similar but not identical packets in an online fashion.
Santos suggested using fingerprinting at a finer granularity (over
parts of packets).
Nick Christenson pointed out that most of the replication is due to
outbound HTTP traffic and asked whether it might have been nearly as
effective to simply use a Web cache on the outbound end of the link.
Santos said they assumed client-side and proxy caches were in use when
the traces were taken. [This does not account for the redundancy
available if all clients passed through the same Web cache at the
outbound end of the link, which appeared to be Christenson's point.]
Andy Chu pointed out that, to save bandwidth on a typically congested
link to an ISP, one must funnel all data through one link and put the
box at your ISP. [Also observe that the cost savings will apply only to
the link bandwidth; the ISP will surely still desire compensation for
the now-increased use of its upstream link.]
Session: Security
Summary by Kevin Fu
The papers in this session dealt with controlled execution of untrusted
code. Specifically, the papers discuss how to confine untrusted code to
a safe environment. Fred Douglis from AT&T Labs Research
served as the session chair.
Implementing Multiple Protection Domains in Java
Chris Hawblitzel, Chi-Chao Chang, Grzegorz Czajkowski, Deyu Hu, and
Thorsten von Eicken, Cornell University
Chris Hawblitzel gave a confident, well-paced presentation of the
J-Kernel, a portable protection system written completely in Java. The
J-Kernel allows programmers to launch multiple protection domains
within a single Java Virtual Machine (JVM) while maintaining
communication and sharing of objects and classes in a controlled way.
Hawblitzel listed three ways an applet security model can enforce
security:
-
restrict which classes an applet can access (type hiding)
-
restrict which objects an applet can access (capabilities)
-
perform additional checks (stack inspection)
However, a problem persists in that applets have no way to communicate
in a secure, controlled way. Therefore, the J-Kernel group decided on
three requirements for their protection system:
-
Revocation. Java provides no way to revoke references to
objects. Therefore, the J-Kernel must provide its own revocation
mechanism on top of Java.
-
Termination. If one merely stops a domain's threads, there may
still be reachable objects from other domains. Such domains will not be
garbage-collected. Therefore, the J-Kernel must free up objects when a
domain terminates.
-
Protection of threads. In maintaining control over a thread,
ownership must change during a boundary crossing of a cross-domain
call. Java lets you stop and change the priority of threads. This could
allow for malicious behavior. The J-Kernel should not allow outside
changes to a thread when another domain is in control.
The J-Kernel distinguishes between objects and classes that can be
shared between domains and what is private to a single domain.
Furthermore, the J-Kernel necessitates a revocation mechanism only for
shared objects, simplifies security analysis of communication channels,
and allows the runtime system to know which objects are shared.
Hawblitzel noted that it can be hard to maintain a distinction between
shared and private information. Private objects must not be passed
through method invocations on shared objects to other domains. The
J-Kernel solves this by passing shared objects by reference. Private
objects passed are by copy.
Using Microsoft's JVM or Sun's JDK with Symantec's JIT compiler on
300MHz Pentium II running Windows NT 4.0, a null J-Kernel local RMI
takes about 60x to 180x longer than a regular method invocation. This
result is mostly due to thread management and locking when entering a
call. Synchronization comprises 60-80% of the overhead. The J-Kernel
suffers some performance loss because it is written in Java. See the
paper for a table of performance results.
The J-Kernel group created a few sample applications as well. They
finished an extensible Web server and are working on telephony server.
Private domains interface to the Web, PBX, and phone lines while user
domains run servlets to process requests and calls. New services can
then be uploaded safely. Related work includes Java sandbox extensions,
object references treated as capabilities (e.g., Spin, Odyssey, E),
safe language technology (e.g., Java), and capability systems (e.g.,
Hydra, Amoeba).
One audience member asked how the J-Kernel copies parameters and how it
handles data considered to be transient. Hawblitzel explained that the
J-Kernel can use serialization to copy objects (the objects are
serialized into a big byte array, and then the byte array is
deserialized to generate new copies of the objects), or it can generate
specialized copy routines that are faster than serialization because
they do not go through the intermediate byte array.
Source code and binaries are available for NT and Solaris. For more
information, see <https://www.cs.cornell.edu/jkernel/>.
The Safe-Tcl Security Model
Jacob Y. Levy, Laurent Demailly, Sun Microsystems Laboratories; John
Ousterhout and Brent B. Welch, Scriptics Inc.
Safe-Tcl allows execution of untrusted Tcl scripts while preventing
damage to the environment or leakage of private information. Safe-Tcl
uses a padded cell approach (as opposed to pure sandboxing or code
signing). Each script (applet) operates in a safe interpreter where it
cannot interact directly with the rest of the application. Safe-Tcl's
main claim to fame is its flexibility: the capabilities given to the
script can be increased or decreased based on the degree to which the
script is trusted.
Unfortunately, there was some confusion among the authors about who was
supposed to present, with the result that no one showed up at the
session. However, the paper is well written and worth the read. You can
find related material from Scriptics <https://www.scriptics.com>,
the Tcl Resource Center at Scriptics
<https://www.scriptics.com/resource>, the Tcl Consortium
<https://www.tclconsortium.org/>, or the Tcl plugin download page
<https://www.scriptics.com/resource/tools/plugin/>. The plugin is
the best example of an application using Safe-Tcl and is a good
starting point for people who want to learn more about Safe-Tcl.
Session: Work-in-Progress Reports
Summaries provided by the authors, and edited and compiled by session
chair Terri Watson Rashid
Jon Howell <jonh@cs.dartmouth.edu> described Snowflake, a project
to allow users to build single-system images that span administrative
domains. The central concept is that users must perform the aggregation
of their single-system image in order to freely aggregate resources
supplied by multiple administrators.
<https://www.cs.dartmouth.edu/~jonh/research/>
Ray Pereda <rpereda@cs.utsa.edu> talked about a new programming
language that he and Clint Jeffery developed at the University of
Texas, San Antonio. The language is called Godiva. It is a very
high-level dialect of Java.
<https://segfault.cs.utsa.edu/godiva>
Bradley Kuhn <bkuhn@ebb.org> ) discussed his work at the
University of Cincinnati on Java language optimization. The goal of
this research is to create a framework for implementation of both
static and dynamic optimizations for Java. Such a framework will allow
for testing and benchmarking of both new and old dynamic and static
optimizations proposed for object-oriented languages in the literature.
The framework will build upon existing pieces of the free Java
environment, such as Guavac and Japhar, to make implementations of new
optimizations for Java easy and accessible to all.
<https://www.ebb.org/gnu-spot/>
Jun-ichiro Itoh <itojun@kame.net> talked about his IPv6 and IPsec
effort in Japan. The project, called KAME Project, is trying to
establish BSD-copyrighted, export control-free, reference network code
for Internet researchers as well as commercial use. They also intend to
incorporate best code for recent topics, such as QoS, IP over
satellite, etc.
<https://www.kame.net/project-overview.html>
Oleg Kiselyov <oleg@pobox.com> spoke about an HTTP Virtual File
System for Midnight Commander (MC). The VFS lets the MC treat remote
directories as if they were on a local filesystem. A user can then view
and copy files to and from a remote computer and even between remote
boxes of various kinds. A remote system can be an arbitrary
UNIX/Win95/WinNT box with an HTTP server capable of running a simple,
easily configurable sh or Perl script.
<https://pobox.com/~oleg/USENIX98/>
Ian Brown <I.Brown@cs.ucl.ac.uk> described ongoing work on
signatures using PGP that can be checked only by people designated by
the signer. Typical digital signatures on messages can be checked by
anyone. This is useful for contracts, but for confidential messages
senders may not want recipients to be able to prove to anyone what they
wrote. Comments during the WIP session pointed out that the current
approach did not check integrity of encrypted but unsigned data. The
authors noted that this is a general PGP problem and have since
augmented their design to fix this.
Tom M. Kroeger <tmk@cse.ucsc.edu> from the University of
California, Santa Cruz, presented some preliminary work on efficiently
modelling I/O reference patterns in order to improve caching decisions.
This work is attempting to use models from data compression to learn
the relationships that exist between file accesses. They have been
addressing the issues of model space and adapting to changing patterns
by partitioning the data model and limiting the size of each partition.
They are working to implement these techniques within the Linux
operating system.
<https://www.cse.ucsc.edu/~tmk/predictive.html>
Poul-Henning Kamp <phk@freebsd.org> talked about "timecounters,"
a new concept for tracking realtime in UNIX kernels. With this code NTP
can track any current or future possible time signal into the 1E-18
second regime, limited only by hardware issues. A couple of plots
showed how NTP tracked a GPS receiver with approximately 10nsec
noise.
<https://phk.freebsd.dk/rover.html>
James Armitage <jma@wpi.edu> and Bari Perelli-Minetti
<baripm@wpi.edu> spoke about their research with John Rulnick in
the Network Operations Research Lab at WPI concerning the causes of
soft (transient) errors in workstation and server memory and the
effects of these errors on system operation. The techniques being used
to explore the effects of soft errors were also briefly presented. One
member of the audience provided information on related experiments on
errors occurring in satellite circuits due to cosmic rays in space.
Kostas Magoutis <magoutis@eecs.harvard.edu> briefly talked about
his work on eVINO, an extensible embedded kernel for intelligent I/O
based on the VINO operating system for task management, extensibility,
and networking. He argued that I/O platforms (IOP) in the form of
add-on adapters with fast, programmable I/O processors are effective in
helping servers face demands of today's gigabit networks and RAID
storage systems, offloading interrupt processing and allowing them to
better multiplex their resources to application programs. eVINO will
focus on I/O and provide extensibility on the IOP with applications
such as active networking and Web server caching.
INVITED TALKS TRACK
Repetitive Strain Injury: Causes, Treatment, and Prevention
Jeff Okamoto, Hewlett-Packard
Summary by Eileen Cohen
Jeff Okamoto, a lively speaker with a sense of humor that helped
lighten a grim topic, spoke about Repetitive Strain Injury (RSI) from
the depths of personal experience. He worked hard for ten years before
his injury which, he said, may take as long as another ten years
to go away started to occur. At that point he decided to educate
himself about RSI, and he used his talk to give the audience the
lessons he learned "the hard way and with ignorance and some
amount of fear."
RSI is a topic of serious import to computing professionals. Not only
is more work being done at desktop machines than ever before, but many
people are also working longer and longer hours at their computers. It
was estimated in 1994 that over 700,000 RSIs occur every year, with a
total annual cost of over $20 billion. RSI can be devastating,
affecting one's ability not only to do a job, but also to perform the
basic tasks, and enjoy the pleasures, of daily life.
Okamoto began with facts about human anatomy that explain why RSIs
occur, then moved on to discuss ergonomics. Companies spend a lot on
ergonomics, and even though they're not all doing it the right way, he
urges participating in any ergonomic assessment program your employer
offers "it can be a lifesaver." He provided detailed tips about
hand position and motion, criteria for a good chair, monitor position,
and use of pointing devices.
After explaining the range of possible RSI diagnoses and treatments,
Okamoto emphasized that if an injury happens on the job, the only way
to get proper medical help without paying out of your own pocket is to
open a worker's compensation case. (Many people resist doing this.)
Employers are legally bound to provide the necessary paperwork you
need to file a claim. Unfortunately, said Okamoto, "having to deal with
the worker's comp system is the worst thing in the world for me." He
gave valuable advice, based on his experience in California, on
negotiating the system in particular how to choose your own
physician instead of using one from the short list the state provides,
who is "likely to be biased against you."
"An RSI is something I wouldn't wish on my worst enemy," said Okamoto.
As a closing note, he raised the specter of what will happen if future
computer users, who are getting younger and younger, are not trained as
children to type and point properly. "By the time they get out of
college, they'll be 90% on the road to injury."
The slides from this talk are available at
<https://www.usenix.org/publications/library/proceedings/usenix98/invited_talks/okamoto_html/>.
hey contain pointers to books, Web resources, and mailing lists on the topic.
Mixing UNIX and PC Operating Systems via Microkernels: Experiences
Using Rhapsody for Apple Environments and INTERIX for NT Systems
Stephen R. Walli, Softway Systems, Inc.; Brett R. Halle, Apple
Computer, Inc.
Summary by Kevin Fu
Stephen Walli, vice president of research and development at Softway
Systems, started the session by discussing INTERIX, a system to allow
UNIX application portability to Windows NT. For the second half of the
invited talk, Brett R. Halle, manager of CoreOS at Apple Computer,
talked about the Rhapsody architecture. He is the manager of CoreOS.
Walli first noted that INTERIX is the product formerly known as OpenNT.
Just this week the product was renamed to avoid confusion with
Microsoft products. Walli further noted, "This is not a talk about NT
being great."
Walli explained his "first law" of application portability: every
useful application outlives the platform on which it was developed and
deployed. There exist migration alternatives to rewriting applications
for Win32. For instance, one could use UNIX emulation, a common
portability library, the MS POSIX subsystem, or INTERIX. On another
side note, Walli exclaimed, "MS POSIX actually got a lot right.
Originally, the ttyname function just returned NULL! There were stupid
little things, but the signalling and forking were done right."
However, there is a problem with rewriting an application to Win32. The
cost of rewriting increases with the complexity of the application.
This led into Walli's discussion of the design goals of INTERIX:
-
Complete porting of runtime environment for UNIX.
-
Provide true UNIX semantics for the system services.
-
Ensure that changes to application code are made more, not less,
portable.
-
Maintain good performance.
-
Do not compromise the security of NT.
-
Integrate INTERIX cleanly into an NT world.
The first big step was implementing Berkeley sockets. This Walli called
"a big win." System V IPC was a big win, too. Other interesting tidbits
about INTERIX include:
-
ACLs that map to file permissions
-
no /etc/passwd or /etc/group
-
no superuser
Walli tried not to "mess with the plumbing," but the INTERIX
development team did have to make a /proc system for gdb.
Asked why INTERIX does not implement SUID capabilities, Walli explained
that INTERIX did not implement SUID because of implications to the
filesystem. If INTERIX provided an interface, it would have to provide
complete semantics. As an alternative, INTERIX created a SetUser
environment.
Another audience member asked about memory requirements to run INTERIX.
Walli noted that NT itself requires more resources when moving from NT
3.51 to 4.0. INTERIX does not need much more space after getting enough
memory for NT. 32MB is sufficient.
The INTERIX group has ported SSH, but Walli's CEO got paranoid and said
"not in our Web site." SSH is available in Finland because of export
laws.
Walli concluded with his "second law" of application portability:
useful applications seldom live in a vacuum. After Berkeley sockets
were implemented, the Apache port required just hours. Early porting
experiences include 4.4BSD-Lite, GNU source, Perl 5, Apache, and xv.
See <https://www.interix.com/>.
For the second half of the session, Halle discussed the Rhapsody
architecture. He mainly summarized where Rhapsody came from and what it
includes. Rhapsody evolved from NextStep 4.2 (Mach 2.x BSD 4.3,
OpenStep) and later became MacOS X (Mach 3.0, Carbon, Blue Box).
Portions of the code came from NetBSD, FreeBSD, and OpenBSD. CoreOS
provides preemption and protection; supports application environments,
processor, and hardware abstractions; and offers flexibility and
scalability. Rhapsody runs on Intel boxes as well as the PowerPC.
Rhapsody includes several filesystems, such as UFS, ISO 9660, and NFS.
However, it does not support hard links in HFS+ or UFS. The networking
is based on the BSD 4.4 TCP/IP stack and the ANS 700 Appletalk stack.
The audience then bombarded Halle with questions. Halle said that the
yellow box will be available for Win95 and WinNT.
An audience member noted that Halle hinted at the cathedral and bazaar
model. Apple could kickstart free software as far as GUIs go. When
asked if there are any rumblings about giving back to the community,
Halle replied that MKLinux is a prime example.
Another audience participant questioned Rhapsody's choice to omit some
of the most common tools found in UNIX. "If you want to leverage
applications, then don't make UNIX shell tools optional," said the
participant. Halle responded with an example where UNIX tools could
hurt acceptance by the general population. Your grandma or kid could be
using this operating system. Halle reasoned that full UNIX does not
make a lot of sense in all environments. You want the right minimal
set, but there should be a way to obtain the tools.
Rhapsody does include SSH and TCP wrappers. For more information on
Rhapsody, see Programming under Mach by Joseph Boykin et al. or
<https://www.mklinux.apple.com/>, or
<https://www.apple.com/>.
Random, quotable quotation:
Walli: "NT is the world's fattest microkernel with maybe 36 million
lines of code. Now that's a microkernel."
Succumbing to the Dark Side of the Force: The Internet as Seen from an
Adult Web Site
Daniel Klein, Erotika
Summary by David M. Edsall
The weather wasn't the only thing that was hot and steamy during the
USENIX conference in New Orleans. In one of the most popular invited
talks of the conference, Dan Klein entertained and educated a packed
auditorium with his discussion of what is necessary to carry the
world's oldest profession to the realm of the world's latest
technology.
Humorously introduced as a "purveyor of filth" and a "corrupter of our
nation's youth," Klein went on to show us he was anything but and why
everyone around him, including his mother, thinks it is OK. Klein has
given talks worldwide and is a freelance software developer. But he is
probably best known to the USENIX community as the tutorial
coordinator, a skill he used well in teaching all of us everything we
always wanted to know about Internet sex but were afraid to ask.
He began by reminding us of the stereotypes of the adult entertainment
industry. Images of Boogie Nights, dark alleys, and ladies of
the evening all come to mind. What we don't realize is that there are
"ordinary people" working in this industry as well. The
owner/administrators of two of the more popular Web sites, Persian
Kitty and Danni's Hard Drive, were each housewives before their online
business skyrocketed.
Klein then discussed the two tiers into which the industry can be
split. Tier 1 consists mainly of magazine publishers, filmmakers, and
prostitutes; while Tier 2 includes resellers such as Klein and phone
sex operators. In his explanation of the "product" purveyed by the
adult Web business he stated "If there is a philia, phobia, mania, or
pathia for it, it's out there. All the world's queer except thee and
me, and I'm not so sure about thee. It's not bad, just different." In
his opinion, "You can stick whatever you want wherever you want to
stick it so long as what you stick it in wants to get stuck."
How much money can be made from the online sex industry? Examples Klein
gave included Persian Kitty, which earned $900,000 in its first year
and now is pulling in $1.5 million selling only links. Another company,
UltraPics, has 14,000 members at $12 per member. Club Love had 20 times
more hits in one day than <www.whitehouse.gov> did in a month.
Klein himself is in a partnership of four and his share is up to $3,000
in a good month.
Where does one obtain the product? Some companies simply scan the
pictures from magazines in blatant violation of copyright law. Some use
Web mirroring to essentially steal their content from similar Web
sites. Others, such as Klein's company, download noncopyrighted images
from USENET newsgroups and repackage them. Klein's group also provides
original content, hiring models and photographers. This comes with its
own complications. Klein described the need
for qualified photographers, proper lighting, props, costumes, and a
director.
Running an adult Web site requires a variety of different technologies.
To conserve resources, it helps to use compressed images, and Klein is
convinced that the adult industry is one of the major influences
driving digital compression. It also helps to split the server load
among several machines. He elaborated on a number of ways to accomplish
this, including DNS round robin and front-end CGIs. In addition, good
programming is useful for automating your site, relieving you of the
tedious task of wading through USENET postings, dealing with the
administration of members, site updates, logging, reporting,
accounting, and checking for people who have shared their passwords, to
list a few examples.
Klein discussed, to the dismay of a few members of the audience, ways
in which you can boost the visibility of your Web site in top ten lists
and search engines. One method uses "click bots" to artificially
increase the number of hits your pages receive. Another well-known
trick is including a popular word, trademark, or brand in META tags for
broader exposure to search engines. Klein also described nastier
techniques, such as DNS cache poisoning attacks, misleading ads, and
domain name spoofing.
All of this does not come without a price. Klein described the
importance of making sure you abide by the law. He described his own
methods of USENET scanning as acting as a common carrier, ignorant of
whether or not the material is copyright protected, and making sure the
images they display carry no copyright notice. When photographing
models, his company goes to great lengths to make sure they are of
legal age and that they are videotaped to prevent lawsuits. He stressed
the importance of reporting all income and paying the IRS their due.
Most of all, Klein emphasized, "Don't tempt fate. If [they] look too
young, they probably are."
In the brief question-and-answer session which followed, one attendee
asked if the adult Web market has peaked. Klein responded by saying, "I
think it can still triple before that happens. It's like the Red Light
District in Amsterdam. Eventually, people stopped being really
interested, but it still is thriving after years." How will Klein deal
with it? "I'm honest and fair and not ashamed of it."
ADAPT: A Flexible Solution for Managing the DNS
Jim Reid and Anton Holleman,
ORIGIN TIS-INS
Summary by Jim Simpson
As more and more domains and networks come online, the amount of
DNS-involved management will only increase. Reid and Holleman
implemented a large-scale solution for DNS management for a network in
a production environment at Philips. Their presentation was given in
two parts: first, Holleman gave a demure explanation about the new DNS
architecture, and then Reid gave a sometimes amusing explanation of the
tool they developed and the problems they had with deploying it,
especially when describing an interaction with a particular NT client.
ORIGIN is a global IT service company created by the merger of two
parts of the Philips group. Philips is a large corporation, and with a
large, far-reaching, and networked corporation came the need for a
large DNS. They use a split DNS policy for security and billing
purposes, but the old-style DNS architecture in such an environment
grew to the point where zone transfers on the nameservers
making the DNS service erratic were failing because of resource
problems. This created the need and desire to reimplement their DNS
architecture; the criteria that had to be met included:
-
The systems needed to be adaptable because Philips is diverse in
its DNS needs.
-
In providing service to Philips, ORIGIN cannot impose a solution
it must fulfill a need.
-
Fast lookups are necessary with minimal impact upon WAN.
-
The system must be scalable, robust, but at the same time
simple, and cannot rely on delegated nameservers 24hours/7days.
They decided to create a centrally managed backbone. They used BSD/OS
as their platform because it was already used by large ISPs, the VM
subsystem is nameserver-friendly, and it's commercially supported,
though some system administrators still pined for Solaris. BIND8 was
their choice for software; despite poor documentation they were happy
with it during testing. Because BSD/OS runs on i386, there was no real
choice in hardware, but this also worked out well due to the low prices
associated with Intel-based machines and their ease of replacement.
Nameservers with good network connectivity to the global backbone in
fixed locations were located and installed in pairs for redundancy.
The actual setup consists of three parts: DNS server architecture, DNS
resolver architecture, and DNS contents architecture parts of
which were designed to be centrally managed. In order to manage DNS and
move it along, they had to come up with a tool which they ended up
calling ADAPT. In their scheme, ADAPT eliminates the need for local DNS
administration; the local admin controls local data, and the backbone
people control the "glue." If local admins update something, they send
it to the backbone using dnssend. If the data are good, they
are put into the DNS database.
There are still some unresolved problems, a few being:
-
A lot of sites still run brain-dead nameservers.
-
There are strange interactions with WINS.
-
There are bizarre nameserver configurations.
However, they met their design goals: costs are low and service is
stable.
The session ended with a myriad of questions: Q: How do you cope with
caching nameservers? A: Don't use dynamic templates; use notify
instead. Q: How do you deal with errors in the host file? A: Their
makefile creates backups of previous good data; servers are
installed in pairs. Q: How is that done, and what happens when one goes
down? A: They're installed manually, and people have to go in and
reboot them by hand.
Panel Discussion: Is a Clustered Computer In Your Future?
Moderator: Clem Cole, Digital Equipment Corporation
Panelists: Ron Minnich, Sarnoff Corporation; Fred Glover, Digital
Equipment Corporation; Bruce Walker, Tandem Computers, Inc.; Rick
Rashid, Microsoft, Inc.; Frank Ho, Hewlett-Packard Company.
Summary by Steve Hanson
This panel was interesting in that it presented clustered computing
from a number of different viewpoints. It was light on detailed
technical information, but made it clear that the panelists, all
representing major manufacturers or trends in clustered computing, were
looking at the topic from very different viewpoints and were solving
different problems. First each panelist presented information on the
clustering product produced or used by his company. These presentations
were followed by a roundtable discussion of clustering, including a
question-and-answer period.
Fred Glover, Digital, introduced the TruCluster system in spring, 1994.
TruCluster had limited capabilities at introduction, but now provides a
highly available and scalable computing environment. TruCluster supports up
to tens of nodes, which may be SMP systems. Normal single-host
applications run in this environment. TruCluster provides a
single-system view of the cluster. Standard digital systems are used in
the cluster and are connected with a high-speed interconnect. A Cluster
Tool Kit adds distributed APIs so that applications may be coded to
better take advantage of the environment. Digital's emphasis is on
running commercial applications in this environment, essentially
providing a more scalable computing environment than is available in a
single machine as well as providing higher availability through
redundant resources.
The primary motivations for HP's cluster environments are to provide
higher availability and capacity while lowering the cost of management
and providing better disaster recovery. Frank Ho stated that 60% of
servers are now in mission-critical environments and that 24x7
operation is increasingly important. Downtime has a significant impact
on companies. HP's goal is to guarantee 99.999% uptime per year, which
is equivalent to five minutes of downtime per year. Today's high
availability clusters guarantee 99.95% uptime and are generally taken
down only for major upgrades. HP clearly has emphasized high
availability, which is the primary thrust of its marketing to
commercial customers.
Bruce Walker's talk was primarily about the comeback of servers and the
evolution of servers from single boxes to clusters. According to
Walker, clustering currently falls into categories of clusters
providing fail-over (high availability) or NUMA clusters, providing
higher cluster performance. Tandem claims that its full SSI (Single
System Image) clustering provides both parallel performance and
availability. Tandem currently ships 2-6 node clusters, having up to 24
CPUs. Tandem is working with SCO on its operating system, which is
based on UNIXware.
The Microsoft agenda for clustered computing is to introduce clustered
computing technology in stages. Clustered computing currently is a very
small portion of the marketplace. According to Rick Rashid, the current
thrust of Microsoft's strategy is to work on high availability
solutions (which are available currently in NT) and to introduce
scalability of clustering in 1999.
Ron Minnich spoke about the implementation of high-powered computing
clusters on commodity PC systems. There is a history of implementing
high-powered computer clusters on commodity systems. FERMILAB and other
high-energy physics sites have for years done the bulk of their
computing on clusters of small UNIX workstations. The availability of
free, stable operating systems on very inexpensive hardware has allowed
the design of very high powered computing clusters at comparatively low
prices. Minnich made the disputable statement that PC/Open UNIX
computing is about 10 times as reliable as proprietary UNIX systems and
100 times as reliable as NT. Having formerly been an administrator for
large computer clusters on proprietary UNIX systems at FERMILAB, I
somehow doubt that, because the UNIX clusters we were using almost
always failed due to hardware failure, not OS failure. I find it
difficult to believe that $1,000 PC systems are more than 10 times more
reliable than entry-level UNIX desktops. However, I think the point is
well taken that this is a means of building reliable computer engines
of very high power at a very low price. Redhat currently offers a
$29.95 Beowulf cluster CD that includes clustering software for Linux.
A question-and-answer period followed the presentations. The questions
indicated that may organizations in the real world are asking for more
from the clustering vendors than they are currently providing.
Questions were asked about a means of developing and debugging software
for a high-availability environment and about how to establish a
high-availability network across a large geographic area. The response
of the panel members seemed to indicate that they hadn't gotten that
far yet.
Other discussions involved recommendations of other approaches to
clustering, including use of Platform Computing's LSF software product
<https://www.platform.com> as well as the University of Wisconsin's Condor
software, which excels for use in environments where unused hours of
CPU on desktop systems can be harvested for serious compute cycles
<https://www.cs.wisc.edu/condor>.
Berry Kercheval asked whether it is important for a cluster to provide
a single system image. As an example he mentioned Condor computing,
which is not a single system image design, but provides an environment
that looks like a single system to the application software. He also
made the point that SSI clusters are likely to be more viable because
they are a simpler mechanism for replacing mainframes in a commercial
environment.
The Future of the Internet
John S. Quarterman, Matrix Information and Directory Services (MIDS)
Summary by David M. Edsall
Death of the Internet! Details at eleven! That may be the conclusion
you would have drawn had you read Bob Metcalfe's op-ed piece in the
December 4, 1995, issue of Infoworld
<https://www.infoworld.com> where he predicted the collapse of
the Internet in 1996. Fortunately, his dire predictions did not come
true. In his talk in New Orleans, Internet statistician John Quarterman
showed us why.
Quarterman is president of Matrix Information and Directory Services
(MIDS) in Austin, Texas, a company that studies the demographics and
performance of the Internet and other networks worldwide. Drawing upon
the large collection of resources and products available from MIDS,
Quarterman educated an attentive audience on the past and current
growth of the Internet and other networks before taking the risk of
making predictions of his own. (The slides presented by Quarterman,
including the graphs, are available on both the USENIX '98 conference
CD and at <https://www.mids.org/conferences/usenix>.)
Quarterman began his talk by discussing the current number of Internet
hosts worldwide. Not surprisingly, most of the hosts are located in the
industrialized countries, with a dense concentration in large urban
centers. What is exciting is the number of hosts popping up in some of
the more remote areas of the world. As Quarterman said,
"Geographically, the Internet is not a small thing anymore."
Next, he discussed the history of the growth of computer networking
from the humble beginnings of the ARPANET with two nodes at UCLA and
SRI, through the split of the ARPANET in 1983 and the subsequent
creation of the NFSNET, to the eventual domination of the Internet over
all of the other networks. Quarterman showed how many of the other
networks (UUCP, BITNET, and FIDONET) have reached a plateau and are now
declining in use while the Internet continues to increase in number of
hosts at an exponential rate. He similarly showed the parallel growth
in the number of Matrix users (users who use email regardless of the
underlying network protocol used for its transmission), with the Matrix
users increasing more rapidly due to the multiplicity of networks
available in the 1980s.
Quarterman next showed the growth of the Internet in countries
worldwide. As expected, the United States currently leads the rest of
the world in total number of hosts and has a growth rate similar to
that of the Internet as a whole. He attributed the slow growth of the
number of Japanese hosts to the difficulty that Japanese ISPs had in
obtaining licenses, a restriction that was eased in 1994, leading to a
large spurt in the growth rate in Japan now.
Moving on to the present day, Quarterman displayed an interesting plot
that reflects the latency vs. time of various hosts on the Internet
<https://www3.mids.org/weather/pingstats/big/data.html>. It was this graphic that persuaded Bob Metcalfe that
large latencies do not remain constant, and hence there will be no
global breakdown. (But Metcalfe still maintains that there could be a
local crash; this is a much less controversial position, because few
people would disagree that there are often localized problems in the
Internet.) Even more interesting was an animated image of a rotating
globe with growing and shrinking circles
<https://www.mids.org/weather> representing latencies in various
parts of the world during the course of a day. This image showed that
the latencies undulate on a daily basis much like the circadian rhythm
obeyed by the human body. The image also shows different patterns,
depending on which country you study. Latencies appear to increase at
noon in Japan and decrease in the afternoon in Spain.
With the past and the present behind, what lies in the future for the
Internet? Using a deliberately naive extrapolation of the data
presented, Quarterman predicted that, by the year 2005, the number of
hosts in the world will nearly equal the world's population. He also
predicted that the US will continue to have the dominant share of the
world's Internet hosts, but will eventually reach a saturation point.
But it is difficult to say where any bends in the curve will come.
His confidence in his projected growth of the Internet is based partly
on comparisons he has made between the Internet and other technologies
of the past. He presented a plot of the number of telephones,
televisions, and radios in the United States vs. time alongside the
growth of the world's population and the growth of the Internet. The
growth of the Internet has been much faster than the growth of any of
these industries. All three of the older technologies eventually
levelled off and paralleled the world's population growth, whereas the
Internet shows no signs of doing so soon. He has yet to find any
technology whose growth compares with that of the Internet. At this
point, Quarterman asked the audience to suggest other technologies
whose growth could be compared to the Internet. Ideas, both serious and
humorous, included the sales of Viagra, production of CPU chips,
automobile purchases, and size of the UNIX kernel. A grateful
Quarterman stated that no other audience had ever given him so many suggestions.
Quarterman finished by discussing possible future problems that may
hinder the growth and use of the Internet. In a scatter plot of the
number of hosts per country per capita and per gross national product,
he showed the audience that the growth of the Internet is also
dependent on economic and political conditions around the world.
Countries with large numbers of hosts tend to have higher standards of
living and less internal strife. He also stressed the importance of the
social behavior of those using the Internet. In a long discussion of
spam, Quarterman revealed that he prefers that the governments of the
world adopt a hands-off approach, leaving the policing of the net to
those who have the ability to control mail relays. You can find more
information at <https://www.mids.org/spam/>.
Will the Internet eventually collapse and fold? Stay tuned.
FREENIX TRACK
Machine-Independent DMA Framework for NetBSD
Jason R. Thorpe, NASA Ames Research Center
Summary by Keith Parkins
Jason Thorpe spoke about the virtues of and reasons for developing a
machine-independent DMA framework in the NetBSD kernel. The inspiration
seems fairly obvious: if you were involved with porting a kernel to
different architectures, wouldn't you want to keep one DMA mapping
abstraction rather than one per architecture? Given the fact that many
modern machines and machine families share common architectural
features, the implementation of a single abstraction seemed the way to
go.
Thorpe walked through the different DMA mapping scenarios, bus details,
and design considerations before unveiling the bus access interface in
NetBSD. A couple of questions were asked during the session, but most
of the answers can be found in the paper.
Thorpe seems to have followed the philosophy of spending his time on
sharpening his axe before chopping at the tree. He noted that while the
implementation of the interface took a long time, it worked on
architectures with varying cache implementations such as mips and alpha
without hacking the code.
Examples of the front-end of the interface can be found
at:<ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-current/src/sys/dev/> in the pci, isa, eisa, and tc
folders. For examples of the backend, look in the ic folder.
Linux Operating System
Theodore Ts'o, MIT; Linus Torvalds, Transmeta
Summary by Keith Parkins
They had to take down the walls that separated the Mardi Gras room into
three smaller cells for the Linux state of the union talk. Instead of
being flanked off into seats that would put them out of visual range,
audience members chose to position themselves against the back wall for
a closer view. This was not what the speakers had expected because they
had initially envisioned the talk being a BOF.
Theodore Ts'o began the aforementioned "state of the union." He started
out by citing a figure gathered by Bob Young, CEO of Red Hat Linux, 18
months earlier. The figure in question was the number of Linux users at
that time, which is not an easy figure to derive because one purchased
copy of Linux can sit on any number of machines and people are also
free to download it. The best results derived from various metrics
showed users at that time to number between three and five million.
Today, similar metrics show this number at six to seven million,
although Corel, in its announcement of a Linux port of Office Suite,
claimed the number to be eight million.
Because of these rising figures and subsequent rising interest by
commercial developers, Ts'o noted that the most exciting work in the
Linux universe was taking place in userland, not in the kernel, as had
historically been the case. Ts'o noted the development of the rival
desktop/office environments, KDE and GNOME, and new administration
tools that make it easy for "mere mortals" to maintain their systems.
Ts'o also talked briefly about the Linux Software Base, an attempt to
keep a standard set of shared libraries and filesystem layouts in every
Linux distribution so that developers don't loose interest due to
porting their software to each distribution of Linux.
Ts'o then spoke about the ext2 filesystem. The first thing he
emphasized was that although most distributions use ext2fs, it is not
the only filesystem used with Linux. He touched upon ideas such as
using b-trees everywhere to show the other work out there. While work
continues on ext2fs, Ts'o stated that their number one priority is to
ensure that any new versions do not endanger stored data. This means
extensive testing before placing the filesystem in a stable release,
probably not until 2.3 or 2.4. Features to come include metadata
logging to increase recovery time, storing capabilities in the
filesystem (a POSIX security feature to divide the setuid bit into
discrete capabilities), and a logical volume manager.
When Linus Torvalds took the floor, he too expressed his hope that the
presentation would be a BOF. In keeping with this hope, he kept his
portion of the state of the Linux union brief before opening the floor
to questions. Instead of focusing on what people can expect in future
releases, he concentrated on two differences between 2.2 and earlier
releases. He quickly noted that the drop in memory prices had
encouraged the Linux kernel developers to exploit the fact that many
machines have a lot more memory than they used to. He elaborated that
the kernel will still be frugal with memory resources, but that it
seemed poor form to not exploit this trend. He then noted that although
earlier releases had been developed for Intel and Alpha machines, 2.2
will add Sparc, Power PC, ARM, and others to the list. As Torvalds put
it, "anything that is remotely interesting, and some [things] that aren't"
will be supported.
There were many good questions and answers when the floor was opened.
When asked if Torvalds and company would make it easier for other
flavors of UNIX to emulate Linux, Torvalds replied that although he was
not trying to make matters difficult for others, he was not going to
detract time from making clean and stable kernels to make Linux easier
to emulate. He also noted that the biggest stumbling block for others
in this task would be Linux's implementation of threads.
On a question concerning his stance on licensing issues, Torvalds
stated simply that he is developing kernels because it was what he
enjoys doing. He went on to state that he personally does not care what
people do with his end product or what they call it.
When asked about the Linux kernel running on top of Mach, Torvalds
stated that he feels the Mach kernel is a poor product and that he has
not heard a good argument for placing Linux on top of it. At one point,
Torvalds had thought about considering the Mach kernel as just another
hardware port. He said that he later changed his mind, when he saw that
the kernel running natively on an architecture runs much faster. This
initial question led to a question as to whether Linux will become a
true distributed operating system. Torvalds stated that he feels it
does not make sense to do all the distribution at such a low level and
that it makes more sense to make hard decisions at a higher level with some
kernel support.
The closing comment from the floor was a thank-you to Torvalds for
bringing the world Linux. Torvalds graciously responded by saying that
while he is happy to accept the thanks, he was not as involved in the
coding process, and the thanks should go out to all the people involved
with writing the kernel and applications as well as himself.
Panel Discussion: Whither IPSec?
Moderator: Angelos D. Keromytis, University of Pennsylvania
Panelists: Hugh Daniel, Linux FreeS/WAN Project; John Ioannidis,
AT&T Labs Research; Theodore Ts'o, MIT; Craig Metz, Naval
Research Laboratory.
Summary by Kevin Fu
Angelos Keromytis moderated a lively discussion on IPSec's past,
present, and future. In particular, the panelists addressed problems of
IPSec deployment. The panel included four individuals intimately
involved with IPSec. IPSec is mandatory in all IPv6 implementations.
A jolly John Ioannidis claims to have been involved with IPSec "since
the fourteenth century." He became involved with IPSec before the name
IPSec existed. As a graduate student, Ioannidis spoke with other
security folks at an IETF BOF in 1992. Later, Matt Blaze, Phil Karn,
and Ioannidis talked about an encapsulating protocol. Finally, to win a
bet, Ioannidis distributed a floppy disk with a better implementation
of swIPe, a network-layer security protocol for the IP protocol suite.
Theodore Ts'o pioneered the ext2 filesystem for Linux, works on
Kerberos applications, and currently is the IPSec working group chair
at the IETF. He gave a short "this is your life" history of IPSec.
After the working group formed in late 1993, arguments broke out over
packet formats. However, the hard part became the management of all the
keys. Two rival solutions appeared: Photurus by Phil Karn and SKIP by
Sun Microsystems. SKIP had a zero round-trip setup time, but makes some
assumptions that were probably not applicable for wide-scale deployment
on the Internet. Then ISAKMP Oakley developed as a third camp and was
adopted by the NSA. Ironically, the ISAKMP protocol was designed to be
modular, but the implementations are not so modular. Microsoft did not
implement modularity in order to make the software more easily
exportable. Ts'o describes the current status of IPSec as the "labor"
phase for key management and procedural administrivia. Looking to the
future, Ts'o notes there is some interest in multicast. But he worries
about the trust model of multicast if 1,000 friends share a
secret, it can't be all that secret. Ts'o also stresses the difference
between host- and user-level security. Are we authenticating the host
or user? Will IPSec be used to secure VPNs and routing or the user?
Right now the answer is VPNs.
Hugh Daniel is the "mis-manager" for a free Linux version of the Secure
Wide Area Network (FreeS/WAN). Because of the lords in DC, foreigners
coded all of FreeS/WAN outside the US. There is a key management daemon
called Pluto for ISAKMP Oakley and a Linux kernel module for an IPSec
framework.
Craig Metz then gave a short slide presentation on NRL's IPSec
implementation. Conference attendees should note a change in slide 7 on
page 120 of the FREENIX proceedings: it now supports FreeBSD and
OpenBSD.
Keromytis opened with a question about deployment. People went through
lots of trouble to get IPSec standardized. What are the likely main
problems in deployment and use of IPSec? Metz answered that getting the
code in the hands of potential users is the hardest part. IPSec does
not have to be the greatest, but it has to be in the hands of the
users. IPSec does not equal VPN. IPSec can do more and solves
real-world problems. Ioannidis commented that the problem with IPSec is
that some people want perfection before releasing code. If only three
people have IPSec, it is not too useful. This is just like the fax
you need a pool of users before IPSec becomes useful. Key
management is also a hard problem.
The next question involved patches. Are patches accepted from people in
the US? Daniel replied that you can whine on the bug mailing list, but
you cannot say what line the bug is on or what the bug is. Linux
FreeS/WAN will not take patches from US citizens. Ts'o explained that
MIT develops code in the US, but does not give permission to export.
When Ts'o receives a patch from Italy, he is not able to tell if it
came from a legal export license. Besides, no one would commit such a
"violent, despicable act."
Metz was asked why the government is interested in IPSec. Metz answered
that many people in the government want the ability to buy IPSec off
the shelf. A lot of the custom-built stuff for the government leaves
something to be desired. Metz further explained, "If we can help get
IPSec on the street, the government can get higher quality network
security software for a lower cost."
Ts'o said that Microsoft NT 5.0 is shipping with IPSec. Microsoft wants
IPSec more than VPNs. Interoperability with the rest of the world will
be interesting. Microsoft has a lot of good people; UNIX people should
hurry up.
An audience member asked about the following situation: let a packet
require encryption to be sent over a link. Is there a defined ICMP
packet that says "oops, can't get encryption on this link"? What is the
kernel supposed to return to when this happens? Daniel reported that
this is not properly defined yet. Metz expanded that there is a slight
flame war now. Some believe this would allow an active attacker to
discover encryption policies. Should such a mechanism exist? The answer
is likely to be "maybe."
Ts'o responded, "Think SYN flood. Renegotiating allows for denial of
service. This is not as simple as you might think." Daniel
substantiated this with figures for bulk encryption. When you deal with
PKCS and elliptic curves, encryption can take a 500MHz alpha to its
knees. It could take five minutes! Metz mentioned a hardware solution
for things such as modular exponentiation.
Dummynet and Forward Error Correction
Luigi Rizzo, Dip. di Ingegneria dell'Informazione
Universitá di Pisa
Summary by Jason Peel
Luigi Rizzo took the floor and immediately asked the audience: "how
[may we] study the behavior of a protocol or an application in a real
network?" His answer? A flexible and cost-effective evaluation tool
dubbed "dummynet," designed to help developers study the behavior of
software on networks.
In the scrutiny of a particular protocol or application, simulation is
often not plausible it would require building a simulated model
of the system whose features may not even be known. Alternatively,
research on an actual network might be plagued as well, perhaps due to
the proper equipment not being available, or difficulties in
configuration. The solution presented in dummynet combines the
advantages of both model simulation and actual network test beds.
With a real, physical network as a factor in the communication of
multiple processes, traffic can be affected through propagation delays,
queuing delays (due to limited-capacity network channels), packet
losses (caused by bounded-size queues, and possibly noise), and
reordering (because of multiple paths between hosts). These phenomena
are replicated in dummynet by passing packets coming in to or going out
of the protocol stack through size-, rate-, and delay-configurable
queues that simulate network latency, dropped packets, and packet
reordering. Dummynet has been implemented as an extension of the
ipfw firewall code so as to take advantage of its
packet-filtering capabilities and, as such, allows configuration that
the developer may already by acquainted with.
The other tool Rizzo presented is an implementation of a particular
class of algorithm known as an erasure code. Erasure codes such as his
are used in a technique called Forward Error Correction (FEC) as an
attempt to eliminate the need for rebroadcasts caused by errors in
transmission. In certain situations, particularly asymmetric
communication channels or multicast applications with a large number of
receivers, FEC can be used to encode data redundantly, such that the
source data can be successfully reconstructed even if packets are lost.
As useful as this may seem, however, FEC has only rarely been
implemented in network protocols due to the perceived high
computational cost of encoding and decoding. With his implementation,
Rizzo demonstrated how FEC can be taken advantage of on commonly
available hardware without a significant performance hit.
To develop this linear algebra erasure code, known technically as a
"Vandermonde" code, Rizzo faced several obstacles. First, the
implementation of such a code requires highly precise arithmetic; this
was solved by splitting packets into smaller units. Second, operand
expansion results in a large number of bits; by performing computations
in a "Finite" or "Galois" field, this too was overcome. Lastly, a
systematic code one in which the encoded packets include a
verbatim copy of the source packets may at times be desired so
that no decoding effort is necessary in the absence of errors. By using
various algebraic manipulations, Rizzo was able to obtain a
systematic code.
Nearing the close of his session, Rizzo utilized a FreeBSD-powered
palmtop to demonstrate the ease of use with which dummynet can simulate
various network scenarios. Then he used this virtual test bed network
as he showed RMDP (a multicast file-transfer application making use of
FEC) in action. The crowd was enthused, and Rizzo was let off the hook
with only one question to answer, "Where can we get this?"
<https://www.iet.unipi.it/~luigi/>
Arla A Really Likable AFS-Client
Johan Danielsson, Parallelldatorcentrum KTH; Assar Westerlund, Swedish
Institute of Computer Science
Summary by Gus Shaffer
Johan and Assar gave a very exciting presentation about their new,
free, and portable AFS client, Arla, which is based on the Andrew File
System version 3.
A major part of their presentation explained the difference in
structure between Transarc's AFS and Arla. Arla exports most of its
internals to a highly portable user-land daemon, as opposed to
Transarc's massive kernel module. The presenters admitted that their
change did bring up some performance issues, but the speed of porting
was largely increased: they already support six platforms, with four
more on the way.
Johan and Assar also mentioned that students at the University of
Michigan's Center for Information Technology Incorporation
<https://www.citi.umich.edu> have incorporated disconnected client
modifications originally written for AFS, and they hope to eventually
incorporate this code into the main Arla source tree.
The most exciting announcement of the presentation concerned the other
half of client-server architecture the developers have an
almost-working fileserver! They presently need to add database servers
(volume server and protection server) to have a free, fully functional
AFS server.
The presentation drew questions from such noted people as CITI's Jim
Rees <rees@citi.umich.edu> and produced tongue-in-cheek comments
along the lines of, "Production use means someone bitches wildly if it
doesn't work."
<https://www.stacken.kth.se/projekt/arla/index.html>
ISC DHCP EDistribution
Ted Lemon, Internet Software Consortium
Summary by Branson Matheson
Ted Lemon gave a good talk regarding his Dynamic Host Configuration
Protocol implementation. About 50 people attended the discussion. He
described the benefits of a DHCP server, which include allowing users
plug-and-play ability, making things easier for network/sysadmins
regarding address assignment, and how conflicts are prevented. He also
discussed potential improvements for version 3, including
authentication, Dynamic DNS, and fail over protocol. Lemon also
mentioned that ISC is part of his implementation and that it is
assisting with standards and some financing. The question-and-answer
session was full of good comments and discussion. There was quite a bit
of talk about the different ways people have implemented a dynamic DNS
setup, how to id the client requesting an ip.
Heimdal An independent implementation of Kerberos 5
Johan Danielsson, Paralleldatorcentrum KTH; Assar Westerlund, Swedish
Institute of Computer Science
Summary by Branson Matheson
Johan Danielsson and Assar Westerlund travelled from Sweden to discuss
their implementation of a free Kerberos software. Heimdal, which is
named after the watchman on the bridge to Asgard, was developed
independently and is internationally available. They described
Kerberos/Heimdal in general and then also discussed some of the
additions and improvements that they had made including 3DES, secure
X11, IPv6, and firewall support. They also discussed some of the
problems they had with the implementation and how they solved them,
including how to get secure packets across a firewall. During the
question-and-answer session, there was lots of discussion of the S/Key
and OPIE, encryption, and proxy authentication. Although the language
barrier sometimes seemed to be a factor, this discussion went well and
was well received.
Samba as WNT Domain Controller
John Blair, University of Alabama
Summary by Steve Hanson
John Blair is the primary author of the excellent documentation for the
freeware Samba package. I find this interesting because the Samba
documentation is not only a fine example of how good the documentation
for a freeware package can be, but is also one of the best sources for
information on how Microsoft's Win95/NT file sharing works.
Unfortunately, Blair was so busy working on documentation for the Samba
project and his new book on Samba (Samba: Integrating UNIX and
Windows) that he did not complete a paper for the conference.
Therefore the talk didn't relate to a paper, but was a more general
discussion of Samba's progress toward having domain controller
capabilities, as well as Samba development in general. This was a very
interesting talk, more as a general viewpoint on the motivations of the
Samba project than anything else.
Due to the extra time created by the missing first presentation, Blair
chose to have a question-and-answer period before beginning his talk.
Many questions were asked, primarily about trust relationships with NT
servers and using Samba as a domain controller. Blair responded that
although the code for domain control exists in the current Samba
release, it isn't considered production quality. Many sites are
successfully using the current code, however. In addition to some
details on how to enable the domain controller code in Samba, Blair
presented some information on the difficulty of creating
interoperability between the NT world and UNIX. This includes mapping
32-bit ID to UNIX IDs, having to develop by reverse engineering, bugs
in NT that cause unpredictable behavior, etc. The inevitable discussion
of NT security was interwoven into the talk, particularly in regard to
new potential security issues and possible exploits that were
discovered while reverse engineering the domain controller protocols.
The balance of Blair's talk was devoted to the Samba project in
general. Several issues about the need for Samba were raised. Although
there are a number of commercial software packages allowing SMB file
sharing from UNIX, Samba holds a unique place in the world because the
code is freely available and well documented. Samba code and
documentation are the best window into determining how Microsoft
networking actually works. In some cases bugs and potential exploits
have been discovered, some of which Microsoft has fixed. Public
scrutiny of the NT world is possible only through projects such as
this. It seems unlikely that corporate America will learn to say no to
the Windows juggernaut, but at least this sort of review stands some
chance of opening the Windows world of networking to review.
Samba is also important because the UNIX platforms on which it runs are
more scalable than the current NT platforms. The release dates for NT
5.0 and Merced processors seem to be continually moving over the
horizon, so interoperable UNIX platforms at least offer a scalable
stable place to host services in the meantime. The work being done here
also raises the interesting prospect of being able to administer the NT
world from the UNIX platform.
Further information on Samba is available at
<https://samba.anu.edu.au/samba>.
Using FreeBSD for a Console Server
Branson Matheson, Ferguson Enterprises, Inc.
Summary by Branson Matheson
Branson Matheson gave a discussion of his implementation of a console
server using FreeBSD. He described the hardware and software
requirements and the problems associated with the installation of a
console server. The problems included security concerns, layout, and
planning. He went into specifics over the implementation of the
software and hardware. There was some good discussion about other
implementations during the question-and-answer session. Security seemed
to be the central theme of the questions: maintaining the security of
the consoles while giving the system administrator the necessary
privileges and functionality.
CLOSING KEYNOTE SPEECH
Reconfigured Self as Basis for Humanistic Intelligence
Steve Mann, University of Toronto
Summary by Jim Simpson
As we spin and hurtle ourselves faster toward the future, we find the
tools helping us there can now be used against us in a myriad of ways.
Steve Mann offered a very sharp, pointed, and humorous presentation
about taking technology back, through Humanistic Intelligence.
Humanistic Intelligence is the interaction between a human and a
computer, and encapsulates the two into a single functioning unit. The
ideal is for the computer to augment the reality of the human working
with it. It sounds like that goal is well on the way; Mann typed most
of his thesis while standing in line, noting his primary task was to
stand and wait in a line, but that WearComp, his implementation of
Humanistic Intelligence, allowed for a secondary task where he could be
creative. WearComp consists of a host computer on the person's body,
a pair of customized glasses and connectivity; specifics about WearComp
are at <https://www.wearcomp.org/wearhow>. Note that WearComp runs
on an OS, and not a virus. Despite large evolutionary strides, Mann
commented about the setup, "The problem with wires is you get tangled
up in them."
A few of the more interesting scenarios and uses of WearComp include
visual mediation. Say you don't wish to see a particular ad. You have
your image processor map a texture over it. Imagine if you were about
to be mugged on the street. You could simply beam back an image of the
perpetrator. Finally, and perhaps one of the most important uses is
that people could better understand each other. Mann illustrated this
with a story about being late. Whoever is waiting could simply see the
world through your eyes and, instead of being suspicious or upset, know
the person is being genuine with the explanation.
We then were treated to an excerpt from a video documentary Mann did,
called Shooting Back. It demonstrated the modern double standard
we're held to; as Mann asked about surveillance cameras in everyday
stores, he was bounced from person to person. Mann turned the tables,
and when those persons were asked how they felt being videotaped, they
had the same reaction that prompted Mann's deployment of a video
camera. What's more interesting is that while pretending to initiate
recording the other party, he'd been surreptitiously recording
everything with WearComp.
Toward the end of the session, the chair began to check his watch
nervously; it seemed almost awkward when the chair had to tell Mann the
time because Mann was well aware of the time it was happily
ticking away in the form of an xclock on the other side of his glasses.
Because this is a working product, Mann answered a few questions about
WearComp and how it has fared: What operating system do you use? Linux
2.0, RedHat, but Mann has written custom stuff like DSP. Has the system
ever cut out? Yes, there are dark crashes. The most common thing is for
the battery to die, but there is a 30-minute warning system. You don't
want to be in mediated reality, walk across the street, and then have a
system cut out the moment a truck is barreling toward you. Can you show
us what you're seeing? No, the video output is in a special format that
won't hook up to a standard VGA projector.
Free Stuff
Opinion by Peter H. Salus
[Editor's Note: While Peter is director pro tem of the Tcl/Tk
Consortium, he is not an employee of any of the companies mentioned in
this report.]
The Association held its June meeting in hot, steamy New Orleans. I
emerged from the hotel into the humid heat only twice in four days.
Inside the hotel it was cooler and there were lots of folks to talk to.
However, for the first time in a dozen years, I hardly attended any
mainline technical papers: I went to the parallel FREENIX track. I
learned about NetBSD, FreeBSD, Samba, and OpenBSD. I went to the
"Historical UNIX" session (it's 20 years since Dennis Ritchie and Steve
Johnson ported V7 to the Interdata 8 and Richard Miller and Juris
Reinfelds ported it to the Interdata 7) and to the 90-minute history
BOF that extended to nearly three hours. And I was present at the
awards, the Tcl BOF, Linus Torvalds's talk, and James Randi's
entertaining, but largely irrelevant, keynote.
There was also a session on Eric Raymond's "The Cathedral and the
Bazaar," which was largely a love-in conducted by Eric and Kirk
McKusick until the very last minutes, which were occupied by a lengthy
flame from Rob Kolstad. More heat was radiated than light was shed.
If you know me, you will see the connecting motif: ever since I saw
UNIX in the late 1970s, I have been interested in the way systems
develop: in Raymond's terms, I'm much more a bazaar person
than a cathedral architect. (And remember that although treasures can
be found in a bazaar, Microsoft products are misshapen in a cathedral
in Washington.)
UNIX was the first operating system to be successfully ported. And it
was ported to two different machines (an Interdata 7 and an Interdata
8; later Perkin-Elmer machines) virtually simultaneously independently
by teams half the planet apart. Not only that, but V7 contained
awk, lint, make, uucp (from Bell
Labs); Bill Joy's changes in data block size; Ed Gould's movement of
the buffers out of kernel space; John Lions's (New South Wales!) new
procedure for directory pathnames; Bruce Borden's symorder
program; etc., etc. A bazaar stretching from Vienna through the US to
Australia. I have outlined the contributions to the development of UNIX
in several places, but the important thing is to recognize the
bazaarlike activity in the 1970s and 1980s. With Linux, we progress
into the 1990s.
NetBSD, OpenBSD, FreeBSD, BSDI, SVR5, and the various Linuxes are the
result of this bazaar, with AT&T, Bell Labs, Western Electric, UNIX
International, X/Open, OSF, and (now) the Open Group flailing about to
get the daemons back in the licensing box. No hope.
John Ousterhout (who received the annual Software Tools User Group
Award) nodded at both open development and Eric Raymond at the Tcl BOF,
saying that he was slightly toward the cathedral side of the middle. By
this he meant that he welcomed extensions and applications to Tcl/Tk,
but that he reserved the right to decide what was included in any
"official" distribution. Because Ousterhout is an intelligence whom I
would entrust with such a role, I foresee no problem. But what if Tcl
were usurped by an evil empire?
Cygnus, RedHat, Walnut Creek, Scriptics, etc., are examples that money
can be made from "free" source. (This is blasphemy to "pure" FSFers,
who think that the taint of, say, MetroX in RedHat's Linux distribution
poisons all of RedHat. They're extremists.) Integrating free software
with solid, useful proprietary software is a good thing: it tests the
proprietary software among the wider user community, and it spreads the
free software to the users of the proprietary stuff.
This aside, I thought the two papers on IPsec (by Angelo Keromytis and
Craig Metz) were quite good. Thorpe on NetBSD and de Raadt on OpenBSD
were quite lucid, as was Matheson on FreeBSD. Blair on Samba was as
good as I had hoped. Because the other author in
the session was a no-show, we had an open Q&A and discussion for
nearly 90 minutes.
Microsoft may control 90% of the world's desktops, but all the
important developments in OSes are clearly taking place in the bazaar.
|