Eleventh Systems Administration Conference (LISA '97)
SAN DIEGO, CA
October 26-31, 1997
KEYNOTE ADDRESS
Generation X in IT
Randy Johnson and Harris Kern, R&H Associates Inc.
Summary by Carolyn M. Hennings
Randy Johnson and Harris Kern spoke about the characteristics of a
portion of today's workforce referred to as Generation X and the impact
it has on traditional IT departments. The challenge to existing IT
departments is identifying the nature of the Generation X workforce,
clarifying why these characteristics are potentially an issue, and
determining how to manage the situation in the future.
In the early 1990s, industry labelled Generation X persons born
between 1964 and 1978 as "slackers"; however, most are
entrepreneurial, like change and diversity, and are technically
literate. In contrast, the traditional IT organization was built on
control and discipline.
As technology has moved away from a single machine to a networked
computing model, the nature of the IT business has changed. The
speakers noted that IT departments had historically relinquished
control of personal computers and local area networks. IT management
has come to the realization that these are essential elements of the
success of mission-critical applications. As a result, there must be
some control.
Johnson and Kern suggested IT management focus on the following areas:
-
Teamwork. Encourage people to work together and rely on individuals to
do their jobs.
-
Communication. Improve communication within the organization and with
the customer.
-
Involvement. Rather than direction from above, involve the team in
decisions and planning.
-
People. Encourage a "can do, be smart" attitude with some discipline.
-
Process. Institute the minimum and sufficient processes to support the
organization.
They suggested that this could be considered "creating Generation Y."
These people and relationships will be needed to build successful IT
organizations. The IT department must become a true services
organization. To accomplish this, the department must win back the
responsibility for technology decisions, reculture the staff to support
diversity and change, market and sell the services, train staff
members, and focus on customer satisfaction.
The department must communicate within the IT organization and with
customers. Defining architectures, standards, support agreements, and
objectives will make great strides in this area. The definition and
support of the infrastructure from the desktop to the network, data
center, and operations is an essential step. Defining "production" and
what it means to the customer in terms of reliability, availability,
and serviceability goes a long way in opening communication and
expectations.
System management processes with standards and procedures modified from
the mainframe discipline are necessary steps. The speakers cautioned
organizations against bureaucracy and suggested focusing on producing
only "minimum and sufficient documentation." Implemen-ting deployment
methodologies and processes was strongly encouraged, as well as
developing tools for automating these processes.
REFEREED PAPERS TRACK
Session: Monitoring
Summaries by Bruce Alan Wynn
Implementing a Generalized Tool for Network Monitoring
Marcus J. Ranum, Kent Landfield, Mike Stolarchuk, Mark Sienkiewicz,
Andrew Lambeth, and Eric Wall, Network Flight Recorder Inc.
Most network administrators realize that it is impossible to make a
network unbreachable; the key to network security is to make your site
more difficult than another so would-be intruders find easier pickings
elsewhere.
In this presentation, Ranum further postulated that when a network
break-in does occur, the best reaction (after repelling the invader) is
to determine how access was gained so you can block that hole in your
security. To do this, the author presents us with an architecture and
toolkit for building network traffic analysis and event records: the
Network Flight Recorder (NFR). The name reflects the similarity of
purpose to that of an aircraft's flight recorder, or "black box," which
can be analyzed after an event to determine the root cause.
Further, he postulated that information about network traffic over time
may be used for trend analysis: identifying approaching bottlenecks as
traffic increases, monitoring the use of key applications, and even
monitoring the network traffic at peak usage periods in order to plan
the best time for network maintenance. Thus, this information would be
useful for network managers in planning their future growth.
The NFR monitors a promiscuous packet interface in order to pass
visible traffic to an internally programmed decision engine. This
engine uses filters, which are written in a high-level filter
description language, read into the engine, compiled, and preserved as
byte-code instructions for fast execution. Events that pass through the
filters are passed to a combination of statistical and logging back-end
programs. The output of these back-ends can be represented graphically
as histograms or as raw data.
Ranum can be reached at <mjr@clark.com>; the complete NFR source
code, including documentation, Java class source, decision engine, and
space manager, is currently available from <https://www.nfr.net>for noncommercial research use.
Extensible, Scalable Monitoring for Clusters of Computers
Eric Anderson, University of California, Berkeley
The Cluster Administration using Relational Databases (CARD) system is
capable of monitoring large clusters of cooperating computers. Using a
Java applet as its primary interface, CARD allows users to monitor the
cluster through their browser.
CARD monitors system statistics such as CPU utilization, disk usage,
and executing processes. These data are stored in a relational database
for ease and flexibility of retrieval. This allows new CARD subsystems
to access the data without modifying the old subsystems. CARD also
includes a Java applet that graphically displays information about the
data. This visualization tool utilizes statistical aggregation to
display increasing amounts of data without increasing the amount of
screen space used. The resulting information loss is reduced by varying
shades of the same color to display dispersion.
Anderson can be reached at <eanders@u98.cs.berkeley.edu>. CARD is
available from <https://now.cs.berkeley.edu/Sysadmin/esm/intro.html>.
Monitoring Application Use with License Server Logs
Jon Finke, Polytechnic Institute
Many companies purchase software licenses using their best estimate of
the number required. Often, the only time this number changes is when
users need additional licenses. A side effect of this is that many
companies pay for unused software licenses. In this presentation, Jon
Finke described a tool for monitoring the use of licensed software
applications by examining license server logs.
This tool evolved from one designed to track workstation usage by
monitoring entries in the wtmp files. Because most license
servers record similar information (albeit in often radically different
formats), the tool was modified to monitor license use.
Information can be displayed in a spreadsheet or as a series of linear
graphs. The graphs provide an easy visual estimate of the number of
software licenses actually in use at a given point in time, or over a
period of time. Analysis of this information can quickly uncover
unneeded licenses at your site.
Currently, the tool interfaces with Xess (a commercial spreadsheet
available from Applied Information Services), Oracle, and Simon
(available from <ftp://ftp.rpi.edu/pub/its-release/simon/README.simon>).
Finke can be contacted at <finkej@rpi.edu>.
Session: The Business of System Administration
Summaries by Brad C. Johnson
Automating 24x7 Support Response to Telephone Requests
Peter Scott, California Institute of Technology
Scott has designed a system, called helpline, that provides automated
answering of a help desk telephone during nonpeak hours and is used for
notifying on-call staff of emergencies within a short amount of time
(minutes or seconds) once a situation is logged in the system
(scheduler) database. This system was designed mainly to be cheap and
therefore mostly applicable to sites with low support budgets. The
system is comprised of source code written in Perl, the main scheduler
information base written in SGML, and two dedicated modems one
for incoming calls (for problem reporting) and one for outgoing calls
(for notification).
The rationale for creating helpline is that most other available
software that was sufficient to provide automated support cost more
than $100,000. Several tools that cost less were discovered, but they
did not provide sufficient notification methods (such as voice, pager,
and email according to a schedule). Recent entries into this market
include a Telamon product called Tel Alert, which requires proprietary
hardware, and VoiceGuide from Katalina Technologies, which runs only on
Windows. There is also available some freeware software called tpage,
but it concentrates on pagers, not on voice phones.
The key to the system is a voice-capable modem. When an incoming call
is answered by the modem daemon, it presents a standard hierarchical
phone menu a series of prerecorded sound files that are linked
to the appropriate menu choice. Independent of the phone menu system is
the notifier and scheduler component. When an emergency notification
occurs, the scheduler parses a schedule file (written in SGML) to
determine who is on call at the time, determines what profile (i.e.,
action) is appropriate based on the time and situation, and takes the
action to contact the designated on-call person. Multiple actions can
be associated with an event, and if the primary notification method
fails, alternate methods can be invoked.
Unfortunately, this software may not be completed for a long time (if
ever) because funding and staff have been assigned to other projects,
although the current state of the source code is available for review.
Send email with contact information and the reason for your request to
<jks@jpl.nasa.gov>. Additionally, in its current state, there are
some significant well-known deficiencies such as data synchronization
problems (which require specialized modem software), (over)sensitivity
to the load of the local host (a host that is assumed to be reliably
available), and virtually no available hard copy documentation.
Turning the Corner: Upgrading Yourself from "System Clerk" to "System Advocate"
Tom Limoncelli, Bell Labs, Lucent Technologies
Limoncelli believes that many administrators can be classified in one
of two ways: as a system clerk or as a system advocate. A system clerk
takes orders, focuses mainly on clerical tasks, and performs many
duties manually. A system advocate is focused on making things better,
automates redundant tasks, works issues and plans from soup to nuts,
and treats users as customers to create respectful, understanding
partnerships for resolving problems. The key to job satisfaction,
feeling better, and getting better raises is to make the transition
from clerk to advocate.
Making a successful transition to system advocate requires converting
bad (subservient) habits into good (cooperative) ones, creating spare
time for better communication and quality time for planning and
research, and automating mundane and repetitive tasks.
Although changing habits is always hard, it's important to concentrate
on getting a single success. Follow that with another and another, and
over time these experiences will accumulate and become the foundation
for good habits.
To find spare time, people need to think outside of the box and be more
critical and selective about where their time is spent. Suggestions for
regaining time include stop reading USENET, get off mailing lists, take
a time management course, filter mail (and just delete the ones that
you can't get to in a reasonable time e.g., at the end of the
week), and meet with your boss (or key customer) to prioritize your
tasks and remove extraneous activities from your workload.
Automating tasks, within the realm of a system administrator, requires
competency in programming languages such as Perl, Awk, and MAKE. These
are languages that have proven to be robust and provide the
functionality that is necessary to automate complex tasks.
Transforming the role of clerk to advocate is hard and requires a
change in attitude and working style to improve the quality of work
life, provide more value to customers, and create a more professional
and rewarding environment. However, the effort required to make this
transition is worth it. Simply put, vendors can automate the clerical
side of system administration, but no vendor can automate the value of
a system advocate.
How to Control and Manage Change in a Commercial Data Center Without
Losing Your Mind
Sally J. Howden and Frank B. Northrup, Distributed Computing
Consultants Inc.
Howden and Northrup presented a methodology to ensure rigor and control
over changes to a customer's computing environment. They (strongly)
believe that the vast majority of problems created today are caused by
change. When change occurs unsuccessfully, the result can range from
lost productivity to financial loss. Change is defined as any action
that has the potential to change the environment and must consider the
impact from software, hardware, and people. Using the rigorous method
that was outlined will lower the overall risk and time spent on
problems. They believe that this rigor is required for all changes, not
just for significant or complex ones.
There are eight main steps outlined in this methodology: (1) Establish
and document a base line for the entire environment. (2) Understand the
characteristics of the change. (3) Test the changes in both an informal
test and formal preproduction environment. (4) Fully document the
change before, during, and after implementation. (5) Review the change
with all involved parties before placing it into the production
environment. (6) Define a detailed back-out strategy if the change fails in the
production environment. (7) Provide training and education for all
parties involved in the change. (8) Periodically revisit the roles and
responsibilities associated with the change.
The authors were quite firm about testing a change in three physically
distinct and separate environments. The first phase includes (unit)
testing of the change on the host(s) involved in development. The
second phase requires testing in a preproduction environment that, in
the best case, is an exact duplicate of the production environment. The
third phase is placing the change in the actual production environment.
When pressed on the suitability of using this (heavyweight) process on
all changes, the authors stated that the highest priority activities
are to fully document change logs and to create thorough work plans.
The paper notes, however, that although this process does generate a
significant
amount of work by the administrators before a given change, it has
(over time) shown to reduce the overall time spent especially
for repeated tasks, when transferring information to other staff, when
secondary staff are on duty, and when diagnosing problems.
Session: System Design Perspectives
Summaries by Mark K. Mellis
Developing Interim Systems
Jennifer Caetta, NASA Jet Propulsion Laboratory
Caetta addressed the opportunities presented by building systems in the
real world and keeping them running in the face of budgetary
challenges.
She discussed the role of interim systems in a computing environment
systems that bridge the gap between today's operational
necessities and the upgrades that are due three years from now. She
presented the principles behind her system design philosophy, including
her extensions to the existing body of work in the area. Supporting the
more academic discourse are a number of cogent examples from her work
supporting the Radio Science Systems Group at JPL. I especially enjoyed
her description of interfacing a legacy stand-alone DSP to a
SparcStation 5 via the DSP's console serial port that exposed the
original programmer's assumption that no one would type more than 1,024
commands at the console without rebooting.
Caetta described points to consider when evaluating potential interim
systems projects, leveraging projects to provide options when the
promised replacement system is delayed or canceled, and truly creative
strategies for financing system development.
A Large Scale Data Warehouse Application Case Study
Dan Pollack, America Online
Pollack described the design and implementation of a
greater-than-one-terabyte data warehouse used by his organization for
decision support. He addressed such issues as sizing, tuning, backups,
performance tradeoffs and day-to-day operations.
He presented in a straightforward manner the problems faced by truly
large computing systems: terabytes of disk, gigabytes of RAM,
double-digit numbers of CPUs, 50 Mbyte/sec backup rates all in a
single system. America Online has more than nine million customers, and
when you keep even a little bit of data on each of them, it adds up
fast. When you manipulate that data, it is always computationally
expensive.
The bulk of the presentation discussed the design of the mass storage
IO subsystem, detailing various RAID configurations, controller
contention factors, backup issues, and nearline storage of "dormant"
data sets. It was a fascinating examination of how to balance the
requirements of data availability, raw throughput, and the state of the
art in UNIX computation systems. He also described the compromises made
in the system design to allow for manageable system administration. For
instance, if AOL strictly followed the database vendor's
recommendations, they would have needed to use several hundred file
systems to house their data set. By judicious use of very large file
systems so as to avoid disk and controller contention, they were able
to use a few large (!) file systems and stripe the two gigabyte data
files across multiple spindles, thereby preserving both system
performance and their own sanity.
Shuse At Two: Multi-Host Account Administration
Henry Spencer, SP Systems
Spencer's presentation described his experiences in implementing and
maintaining the Shuse system he first described at LISA '96. He details
the adaptation of Shuse to support a wholesale ISP business and its
further evolution at its original home, Sheridan College, and imparted
further software engineering and system design wisdom.
Shuse is a multi-host administration system for managing user accounts
in large user communities, into the tens of thousands of users. It uses
a centralized architecture. It is written almost entirely in the expect
language. (There are only about one hundred lines of C in the system.)
Shuse was initially deployed at Sheridan College in 1995.
Perhaps the most significant force acting on Shuse was its adaptation
for ISP use. Spencer described the changes needed, such as a
distributed account maintenance UI, and reflected that along with
exposing Sheridan-specific assumptions, the exercise also revealed
unanticipated synergy, with features requested by the ISP being adopted
by Sheridan.
A principal area of improvement has been in generalizing useful
facilities. Spencer observed in his paper, "Every time we've put effort
into cleaning up and generalizing Shuse's innards, we've regretted not
doing it sooner. Many things have become easier this way; many of the
remaining internal nuisances are concentrated in areas which haven't
had such an overhaul lately."
Other improvements have been in eliminating shared knowledge by making
data-transmission formats self-describing, and in the ripping out of
"bright ideas" that turned out to be dim and replacing them with
simpler approaches. These efforts have payed off handsomely by making
later changes easier.
Spencer went on to describe changes in the administrative interfaces of
Shuse, and in its error recovery and reporting.
Shuse is still not available to the general public, but Spencer
encourages those who might be interested in using Shuse to contact him
at <henry@zoo.toronto.edu>
Spencer's paper is the second in what I hope will become a series on
Shuse. As a system designer and implementor myself, I look forward to
objective presentations of experiences with computing systems. It's a
real treat when I can follow the growth of a system and learn how it
has changed in response to real-world pressures and constraints. Often
papers describe a system that has just been deployed or is in the
process of being deployed; it is rare to see how that system has grown
and what the development team has learned from it.
Session: Works in Progress
Summaries by Bruce Alan Wynn
Service Level Monitoring
Jim Trocki, Transmeta Corp.
Many system and network administrators have developed their own simple
tools for automating system monitoring. The problem, proposes Jim
Trocki, is that these tools often evolve into something unlike the
original and in fact are not "designed" at all.
Instead, Jim presents us with mon: a Perl 5 utility, developed
on Linux and tested on Solaris. mon attempts to solve 85% of
the typical monitoring problems. The authors developed mon
based upon these guidelines:
-
Simple works best.
-
Separate testing code from alert generation code.
-
Status must be tracked over time.
The mon tool accepts input from external events and "monitors"
(programs that test conditions and return a true/false value). The
mon processes then examine these data and decide which should
be presented directly to clients and which should trigger an alarm.
The authors are currently expanding the functionality of mon
to include dependency checking of events, severity escalation, alert
acknowledgments via the client, "I'm okay now" events, asynchronous
events, a Web interface, and a better name.
The current version of mon is available at
<https://consult.ml.org/~trockij/mon>.
Jim Trocki can be reached at <trockij@transmeta.com>.
License Management: LICCNTL Control License Protected
Software Tools Conveniently
Wilfried Gaensheimer, Siemens AG
Gaensheimer presented an overview of a number of tools that can help
control and monitor the use of software licenses. The tools can also
generate reports of license use over time.
For additional information on these tools, contact Gaensheimer at
<wig@HL.Siemens.DE>.
Inventory Control
Todd Williams, MacNeal-Schwendler Corp.
One of the less exciting tasks that system and network administrators
are often faced with is that of taking a physical inventory. Typical
reasons for this requirement include:
-
Maintenance contract renewal
-
Charge backs for resource use
-
Identifying the type of machinery
Williams began tracking his inventory by including comments in the
system's "hosts" files, but quickly outgrew this mechanism when devices
appeared that did not have an IP address, and when the amount of
information desired made the "hosts" table unwieldy.
Instead, Williams developed a database to track this information. He
developed procedures to keep this information up to date as machinery
moves in and out of the work site.
For additional information on these software tools for tracking
inventory, contact Todd Williams at <todd.williams@macsch.com>.
Values Count
Steve Tylock, Kodak
Although it may initially seem a surprising topic for a technical
conference, Tylock reintroduced the basic values of a Fortune 500
company:
-
respect for the dignity of the individual
-
uncompromising integrity
-
trust
-
credibility
-
continuous improvement and personal renewal
Instead of applying these to the company itself, Tylock suggested that
system and network administrators could increase their professionalism
and efficiency by applying these basic values to their daily work.
For more information on this topic, contact Steve Tylock at
<tylock@kodak.com>.
Extending a Problem-Tracking System with PDAs
Dave Barren, Convergent Group
Many system and network administrators use one type of problem-tracking
system or another. But because working on the typical system or network
problem often means working away from one's desk, administrators must
keep track of ticket status independently of the tracking system. When
administrators return to their desk, they must "dump" the information
into the tracking system, hoping that they don't mis-key data or get
interrupted by another crisis.
To help alleviate this problem, Barren suggests using a PDA to track
ticket status. Barren has developed a relatively simple program for his
Pilot that allows him to download the tickets, work the problems, track
ticket status on the Pilot, then return to his desk and upload the
changes in ticket status in one easy step.
This allows Barren to work on more tickets before returning to his desk
and increases the validity of the tracking system. Barren hopes to
encourage more users to implement this plan so that the increased
number of Pilots will allow him to upload ticket status information at
virtually any desk instead of returning to his own.
For additional information on this concept and the software tools
Barren has developed, contact him at <dcbarro@nppd.com>.
Survey of SMTP
Dave Parter, University of Wisconsin
One of the beautiful things about the Simple Mail Transfer Protocol is
that it allows people to use any number of transfer agents to deliver
electronic mail across the world. The down side is that there is a
hodgepodge of versions and "brands" of transfer agents in use, and
nobody really knows what is in use these days. Except, perhaps, Dave
Parter.
To examine this issue, Parter monitored the incoming mail at his site
for a short period of time. For each site that sent mail to his, he
tested the SMTP greeting and tried to identify the type and version of
the agent. His results:
Parter was able to identify 140 distinct versions of sendmail in use in
this small sampling.
Where, Parter asks, do we go from here with these data? He isn't sure.
If you would like to discuss these findings, or conduct your own
survey, contact Parter at <dparter@cs.wisc.edu>.
Session: Net Gains
Summaries by Mark K. Mellis
Creating a Network for Lucent Bell Labs Research South
Tom Limoncelli, Tom Reingold, Ravi Narayan, and Ralph Loura, Bell
Labs, Lucent Technologies
This presentation described how, as a result of the split of AT&T
Bell Labs Research into AT&T Labs and Lucent Bell Labs, they
transitioned from an "organically grown" network consisting of four
main user communities and ten main IP nets (out of a total of 40 class
C IP nets) to a systematically designed network with two main user
communities on four main IP nets, renumbering, rewiring, cleaning up,
and "storming the hallways" as they went.
Unlike many projects of this scope, the authors planned the work as a
phased transition, using techniques such as running multiple IP
networks on the same media and operating the legacy NIS configuration
in parallel with the new config to transition slowly to the new
configuration, rather than make all the changes during an extended down
time and discover a critical error at the end. They relate their
experiences in detail, including a comprehensive set of lessons learned
about strategy, end-user communications, and morale maintenance. ("Yell
a loud chant before you storm the hallways. It psyches you up and makes
your users more willing to get out of the way.'')
Having been faced with a network unfortunately typical in its
complexity, and real-world constraints on system downtime, this group
described their thought processes and methodologies for solving one of
the problems of our time, corporate reorganization. In the face of
obstacles such as not having access to the union-run wiring closets and
"The Broken Network Conundrum," where one must decide between fixing
things and explaining to the users why they don't work, they divided
their networks, fixed the problems, and got a cool T-shirt with a
picture of a chainsaw on it, to boot.
Some the tools constructed for this project are available at
<https://www.bell-labs.com/user/tal>.
Pinpointing System Performance Issues
Douglas L. Urner, BSDI
Urner gave us a well-structured presentation that within the context of
a case study on Web server performance optimization presents a
systematic model for tuning services from the network connection,
through the application and operating system, all the way to the
hardware. His paper is a vest-pocket text on how to make it go faster,
regardless of what "it" might be.
Urner began the paper by describing an overview of system tuning:
methodology, architecture, configuration, application tuning, and
kernel tuning. He discussed the need to understand the specifics of the
problem at hand protocol performance, application knowledge,
data collection and reduction. He then described tuning at the
subsystem level, including file system, network, kernel, and memory. He
presented a detailed explanation of disk subsystem performance, then
went on to examine CPU performance, kernel tuning, and profiling both
application and kernel code.
Urner's paper is about optimizing Web server performance, but it is
really about much more. He describes, in detail, how to look at
performance optimization in general. He encourages readers to develop
their intuition and to establish reasonable bounds on performance. By
estimating optimal performance, the system designer can determine which
of the many "knobs" in an application environment are worth "turning",
and help set reasonable expectations on what can be accomplished
through system tuning.
Session: Configuration Management
Summaries by Karl Buck
The first two papers deal with the actual implementations of tools
written to handle the specific problems. The third paper is an attempt
to get a higher level view of where configuration management is today
and make suggestions for improving existing CM models.
Automation of Site Configuration Management
Jon Finke, Rensselaer Polytechnic Institute
Finke presented his implementation of a system that not only tracks
interesting physical configuration aspects of UNIX servers, but also
stores and displays dependencies between the servers and the services
that they provide. The configuration management system has an Oracle
engine and outputs data to a Web tree, making for a very extensible,
useful tool. For instance, if a license server is to be updated, one
can find out not only all the other services that will be affected, but
also the severity of those outages and who to contact for those
services. Source code is available; see
<ftp://ftp.rpi.edu/pub/its-release/simon/README.simon>
for details.
Chaos Out of Order: A Simple, Scalable File Distribution Facility
for "Intentionally Heterogeneous" Networks
Alva L. Couch, Tufts University
The core of this paper is a file distribution tool written by Couch
called DISTR. Using DISTR, administrators of unrelated networks can use
the same file distribution system, yet retain control of their own
systems. DISTR can "export" and "import" files to and from systems
managed by other people. Frank discussion is given to the existing
limitations and potential. DISTR is available at
<ftp://ftp.eecs.tufts.edu/pub/distr>.
An Analysis of UNIX System Configuration
Remy Evard, Argonne National Laboratory
This paper is an attempt to step back and take a look at what is
available for use in UNIX configuration and file management, examine a
few case studies, and make some observations concerning the current
configuration process. Finally, Evard argues for a "stronger
abstraction" model in systems management, and makes some suggestions on
how this can be accomplished.
Session: Mail
Summaries by Mark K. Mellis
Tuning Sendmail for Large Mailing Lists
Rob Kolstad, BSDI
Kolstad delivered a paper that described the efforts to reduce delivery
latency in the <inet-access@earth.com> mailing list. This mailing
list bursts to up to 400,000 message deliveries per day. As a result of
the tuning process, latency was reduced to less than five minutes from
previous levels that reached five days.
Kolstad described himself as a member of Optimizers Anonymous, and he
shared his obsession with us. He described the process by which he and
his team analyzed the problem, gathered data on the specifics, and
iterated on solutions. He took us through several rounds of data
analysis and experimentation, and illustrated how establishing
realistic bounds on performance and pursuing those bounds can lead to
insights on the problem at hand.
Kolstad and his team eventually homed in on the approach of increasing
the parallelism to the extreme of using hundreds of concurrent sendmail
processes to deliver the list. They also reduced timeouts for
nonresponsive hosts. This, of course, required the creation of a number
of scripts to automate the parallel queue creation. These scripts are
available upon request from Kolstad, <kolstad@bsdi.com>.
Kolstad closed by noting that after the optimizations were made, the
biggest remaining problem was unavailability of recipients. He
expressed his amazement that in a mailing list dedicated to Internet
service providers, some one to three per cent of recipients were
unreachable at any point in time. Also, even with these improvements,
the mailing list traffic of mostly small messages doesn't tax even a
single T-1 to its limits.
Selectively Rejecting SPAM Using Sendmail
Robert Harker, Harker Systems
Harker offered a presentation that addressed one of the hottest topics
on the Internet today unsolicited commercial email, otherwise
known as spam. He characterizes spam, examines the different
requirements for antispam processing at different classes of sites, and
offers concrete examples of sendmail configurations that address these
diverse needs.
After his initial discussion of the nature of spam, Harker outlined the
different criteria that can be used for accepting and rejecting email.
His approach differs from others in that he spends sendmail CPU cycles
to get finer granularity in the decision to reject a message. He goes
on to treat the problem of spammers sending their wares to internal
aliases and mailing lists.
The remainder of the presentation was devoted to detailed development
of the sendmail rulesets necessary to implement these policies. He
discussed the specific rulesets and databases needed, and how to test
the results. His discussion and code are available at
<https://www.harker.com/sendmail/anti-spam>
A Better E-mail Bouncer
Richard J. Holland, Rockwell Collins
Holland presented work that was motivated by corporate reorganization:
how to handle email address namespace collisions in a constructive way.
As email usage becomes more accessible to a wider spectrum of our
society, fewer and fewer email users are able to parse the headers in a
bounced message. Holland talked about his bouncer, implemented as a
mail delivery agent, which provides a clearly written explanation of
what happened and why when an email message bounces due to an address
change. This helps the sender understand how to get a message through,
helps the recipient get a message, and helps the postmaster by
automating another portion of her workload.
The bouncer was originally implemented as a simple filter. Because of
the diversity in headers and issues related to envelope vs. header
addresses, especially in the case of bcc: addresses, the bouncer was
reimplemented as a delivery agent. The bouncer, written in Perl,
relinquishes its privilege and runs as "nobody." Many of the aspects of
bouncer operation are configurable, including the text of the
explanatory text to be returned. A particularly nice feature is the
ability to send a reminder message to the recipient's new address when
mail of bulk, list, or junk precedence is received, reminding them to
update their mailing list subscriptions with the new address.
Holland concluded by discussing alternatives to the chosen
implementation and future directions. Those interested in obtaining the
bouncer should contact Holland at <holland@pobox.com>.
INVITED TALKS TRACK
So Now You Are the Project Manager
William E. Howell, Glaxo Wellcome Inc.
Summary by Bruce Alan Wynn
Many technical experts find themselves gaining responsibility for
planning and implementing successively larger projects until one day
they realize that they have become a project manager.
In this presentation, Howell offered helpful advice on how you can
succeed in this new role without the benefit of formal training in
project management.
Howell's first suggestion is to find a mentor, someone who has
successfully managed projects for some time. Learn from that mentor not
only what the steps are in managing a project, but also the reasons why
those are the right steps.
But, as Howell points out, a mentor is not always available. What do
you do then? Howell presented a few tips on what you can do if you
can't find a mentor.
For copies of the presentation slides, contact Marie Sands at
<mms31901@glaxowellcome.com>; please include both your email and
postal addresses.
When UNIX Met Air Traffic Control
Jim Reid, RTFM Ltd.
Summary by Mike Wei
Every once in a while we see reports of mishaps of the rapidly aging
air traffic control (ATC) system in the United States. We have also
seen reports that some developing countries have ATC systems "several
generations newer" than the US system. For most of the flying public,
the ATC system is something near a total mystery on which our lives
depend. As a pilot and a system administrator, I hope I can lift the
shroud of mystery a little bit and help explain the ATC system Reid
talked about, how UNIX handles such a mission-critical system, and how
this system helps air traffic control.
The primary purpose of air traffic control is traffic separation,
although it occasionally helps pilots navigate out of trouble.
Government aviation authorities publish extensive and comprehensive
regulations on how aircraft should operate in the air and on the
ground. Air traffic control is a massively complex system of computers,
radar, controllers, and pilots that ensures proper traffic separation
and flow. Human participants (i.e., controllers and pilots) are as
essential as the computer and radar systems.
Naturally, air traffic congestion happens near major airports, called
"terminal areas." In busy terminal areas, computer-connected radar
systems provide controllers with realtime traffic situations in the
sky. Each aircraft has a device called a transponder that encodes its
identity in its radar replies, so the controllers know which aircraft
is which on the computer screen. Computer software along with
traffic controllers ensure proper separation and traffic flow by
vectoring planes within the airspace to their destinations.
Outside terminal areas, large planes usually don't fly anywhere they
want. They follow certain routes, like highways in the sky. On-route
traffic control centers control traffic along those routes. Traffic
separation is usually ensured by altitude separation or fairly large
horizontal separation. Some on-route centers have radar to help track
the traffic. For areas without radar coverage, on-route centers rely on
pilot position reports to track the traffic and usually give very large
separation margins.
This system worked fairly well for many years, until air travel reached
record levels. Two things happened. First, some terminal areas became
so congested that, during some parts of the day, the airspace just
couldn't hold any more traffic. Second, traffic among some terminal
areas reached such a level that these on-route airspaces became almost
as congested as terminal areas.
A new kind of system was developed to address the new problems. This
"slot allocation system" tries to predict what the sky will look like
in the future, based on the flight plan filed by airliners. Based on
the computer prediction, we can allocate "slots" in the sky for a
particular flight, from one terminal area to another, including the
on-route airspace in between. Every airline flight is required a flight
plan, including departure time, estimated time on-route, cruising
airspeed, planned route, destination, and alternate destination. With
the flight plan, an airplane's position in the sky is fairly
predictable.
This slot allocation system is very much like TCP congestion control in
computer networking: when the network is congested, the best way to
operate is to stop feeding new packets into it for a while. For the
same reason it's much better to delay some departures than to let
planes take off and wait in the sky if the computer system predicts
congestion sometime in the future.
The Western European airspace, according to Reid, is the busiest
airspace in the world. Instead of a single controlling authority, like
the US Federal Aviation Authority, each country has its own aviation
authorities. Before "Eurocontrol," the agency Reid worked at last year,
each country managed its airspace separately, and an airliner had to
file a flight plan for each country it had to fly over along its route.
This led to a chaotic situation when traffic volume increased.
According to Reid, there was also a problem of ATC nepotism (i.e., a
country favoring its own airliners when congestion occurred).
The Eurocontrol agency has three UNIX-based systems that serve Western
Europe. IFPS is a centralized flight plan submission and distribution
system, TACT is the realtime slot allocation system, and RPL is the
repeat flight plan system.
IFPS provides a single point of contact for all the flight plans in
Western Europe. It eliminates the inconvenience of filing multiple
flight plans. This is basically a mission-critical data entry/retrieval
system.
The TACT system provides slot allocation based on the flight plan
information in the IFPS system. It provides slots that satisfy
separation standards in the airspace above Western Europe. It controls
when an airplane can take off and which slots in the sky it can fly
through to its destination. It keeps a "mental picture" of all the air
traffic in the sky for all the
moments into some future. RPL is the repeat flight plan system.
Airlines tend to have the same flights repeatedly, and this system
simplifies filing those flight plans. The RPL system is connected with
the IFPS system and feeds it with those repeat flight plans.
This must be an awesomely impressive system with equally impressive
complexity. According to Reid, it actually works. Ever since the
adoption of the system, it has never failed. Furthermore, the increase
in traffic delay is much less than the increase in traffic volume.
Kudos for our European computer professionals!
The slot allocation system does not provide the actual traffic
separation. Realtime traffic separation must be based on actual
position data obtained from radar or pilot position report, rather
than projected position data based on flight plan. However, this
slot allocation system is an invaluable tool to help the realtime
traffic separation by avoiding congestion in the first place.
Using UNIX in such a mission-critical system is quite pioneering in an
ATC system. Most ATC systems in the US are still mainframe-based. The
system is built on multiprocessor HP T90 servers, and the code is
written in Ada.
Like most of the mission-critical systems, operation of those UNIX
systems has its idiosyncrasies. According to Reid, the system operation
suffers organizational and procedural inefficiencies. However, some of
them may well be the necessary price to pay for such a mission-critical
system. The whole system is highly redundant; almost all equipment has
a spare. The maintenance downtime is limited to one hour a month.
Change control on the system is the strictest I've ever heard of. For
new code releases, it has a test environment fed with real data, and
there's a dedicated test group that does nothing
but the testing. Any change to the production systems must be
documented as a change request and approved by a change board, which
meets once a week. Any kind of change, including fixing the sticky bit
on /tmp, needs change board approval. Reid said that it took
SA six weeks to fix the /tmp permissions on six machines
because each one needed a change request and only one change a week is
allowed on the production system. To minimize the chance of system
failure, all nonessential service on the system is turned off,
including NFS, NIS, and all other SA life-saving tools. This does add
pain to SA's daily life.
This kind of process sounds bureaucratic, and it's a far cry from a
common UNIX SA's habit. However, for this kind of system, it might be
right to be overly conservative. At least when Reid flew to the LISA
conference this year, he knew nothing bad would likely happen to
Eurocontrol due to a system administrator's mistake.
Enterprise Backup and Recovery: Do You Need a Commercial Utility?
W. Curtis Preston, Collective Technologies
Summary by Bruce Alan Wynn
Nearly every system administrator has been asked to back up
filesystems. Even those who haven't have probably been asked to recover
a missing file that was inadvertently deleted or corrupted. How can a
system administrator determine the best solution for a backup strategy?
In this presentation, Preston presented an overview of standard
utilities available on UNIX operating systems: which ones are common,
which ones are OS-specific. He then explained the capabilities and
limitations of each. In many cases, claims Preston, these "standard"
utilities are sufficient for a site's backup needs.
For sites where these tools are insufficient, Preston discussed many of
the features available in commercial backup products. Because some
features require special hardware, Preston described some of the
current tape robots and media available. Once again, he iterated the
capabilities and limitations of each.
Copies of Preston' presentation are available upon request; Preston can
be reached at <curtis@colltech.com>.
A Technologist Looks at Management
Steve Johnson, Transmeta Corp.
Summary by Bruce Alan Wynn
Employees often view their management structure as a bad yet necessary
thing. Johnson has worked in the technical arena for years, but has
also had the opportunity to manage research and development teams in a
number of companies. In this presentation, he offered his insight into
methods that will smooth the relationship between employees and
managers.
Johnson began by postulating that both employees and managers have a
picture of what the manager-employee relationship should look like, but
it is seldom a shared picture. He further postulated that a great deal
of the disconnect is a result of personality and communication styles
rather than job title.
Johnson loosely categorized people as either thinkers, feelers, or
act-ers. A thinker focuses on analyzing and understanding; a feeler
focuses on meeting the needs of others; an act-er focuses on activity
and accomplishment.
These differences in values, combined with our tendency to presume
others think as we do, cause a breakdown in communication that leads to
many of the traditional employee-manager relationship problems.
After making this point, Johnson suggested that technical people who
are given the opportunity to move into management first examine closely
what the job entails: it's not about power and authority; it's about
meeting business needs. He suggested a number of publications for
additional information on this topic.
Steve Johnson can be reached at <scj@transmeta.com>.
IPv6 Deployment on the 6bone
Bob Fink, Lawrence Berkeley National Laboratory
Summary by Mike Wei
We all know that IPv6 is the future of the Internet; there's simply no
alternative to support the explosive growth of the Internet. However,
despite years of talking, we see little IPv6 deployment. According to
Fink, the adaptation and deployment of IPv6 is currently well under
way, and it's heading in the right direction.
An experimental IPv6 network, named 6bone, was created to link up early
IPv6 adopters. It also serves as a test bed to gain operational
experiences with IPv6. Because most of the machines on the 6bone also
run regular IPv4, it provides an environment to gain experience in IPv4
to v6 transition.
The 6bone is truly a global network that links up 29 countries. Most of
the long haul links are actually IPv6 traffic tunnelled through the
existing IPv4 Internet. This strategy allows 6bone to expand
anywhere that has an Internet connection for almost no cost. On the
6bone network, there are some "islands" of network that run IPv6
natively on top of the physical network.
An important milestone was achieved in IPv6 deployment when Cisco,
along with other major router companies, committed to IPv6. According
to Fink, IPv6 will be supported by routers in the very near future, if
it's not already supported. In addition, we will start to see IPv6
support in major OS releases.
A typical host on the 6bone runs two IP stacks, the traditional v4
stack and the IPv6 stack. The IPv6 stack can run natively on top of the
MAC layer if the local network supports v6, or it can tunnel through
IPv4. The v6 stack will be automatically used if the machine talks to
another v6 host. An important component of the 6bone network will be
the new DNS that supports IPv6 addresses. The new DNS supports AAAA
record (quad-A record, because a v6 address is four times the length of
a v4 address). If a v6 host queries the new DNS server for another v6
host, an AAAA record will be returned. Because the new DNS simply maps
a fully qualified domain name to an IP address (v4 or v6), the DNS
server itself doesn't have to sit on a v6 network. It will be perfectly
normal for a dual-stack v6/v4 host to query a DNS server on the v4
network, getting a v6 address, and talk to the v6 host in IPv6.
The key to the success of IPv6 deployment is smooth transition. The
transition should be so smooth that a regular user should never know
when the IPv6 has arrived. Given the fact that the IPv4 network is so
far reaching throughout the
world, IPv6 and v4 will coexist for a very long time; the transition to
IPv6 from v4 will be gradual. Routers will be the first ones that have
IPv6 capabilities. Just like the 6bone, an IPv6 backbone can be built
by tunnelling v6 traffic through the existing v4 network or run v6
natively on the physical network when two neighboring routers both
support v6. Because v6 is just another network layer protocol, it can
run side by side with IPv4 on the same physical wire without conflict,
like IP and IPX can run together on the same Ethernet. This means that
we do not have to make a choice between v6 and v4; we can simply run
both of them during the transition period. IP hosts will gradually
become IPv6 capable when the new OS versions support it. During the
transition, those IPv6 hosts will have dual IP stacks so they can talk
to both v4 and v6 hosts. Nobody knows how long this "coexist" will
last, but it will surely last for years. When the majority of the hosts
on the Internet are doing v6, some of the hosts might choose to be v6
only. One by one, the v4 hosts will fade away from the Internet.
Will that ever happen? The answer is yes. In the next decade, the IPv4
address will be so hard to obtain that IPv6 will be a very viable and
attractive choice. We haven't seen that yet, but based on the current
Internet growth, it will happen.
The IPv6 addressing scheme is another topic Fink talked about in the
seminar. IPv6 has a 128 bit address space, which allows thousands of
addresses per square
foot if evenly spread on the earth's surface. How to make use of this
address space in a highly scalable way is a big challenge. IPv4 suffers
the problem of an explosive number of routing entries, and this problem
arises years before the exhaustion of IPv4 addresses. To address this
problem and to allow decades of future expansions, IPv6 uses an
aggregation-based addressing scheme.
3bits 13bits 32bits 16bits 64 bits
001 TLA NLA SLA Interface ID
public topology site local machine topology
The best analogy of this aggregation-based addressing is the telephone
number system. We have ten-digit phone numbers in US and Canada, with a
three-digit area code, three-digit exchange code, and the last four
digits for individual telephone lines.
The first three bits are 001. In the great tradition of TCP/IP, other
combinations are reserved for future use, in case one day we have an
interplanetary communication need that requires a different addressing
scheme. The 13-bit TLAs are top-level aggregators, designed to be given
to long-haul providers and big telcos that run backbone service. The
32-bit NLAs are next-level aggregators for various levels of ISPs. It
can be further subdivided to several levels of NLAs. The 16-bit SLAs
are for site topologies. (It's like getting a class A IPv4 address and
use 16-bit for network address.) The machine interface ID is 64 bits.
An important feature of IPv6 is autoconfiguration, in which a host can
figure out its own IPv6 address automatically. The 64-bit interface ID
is designed so that the host can use its data link layer interface
address as the host portion of the IPv6 address. Ethernet uses a 48-bit
address, and it seems adequate for globally unique addresses. Reserving
64 bits for the local machine shall accommodate any future address
method used by future physical networks.
Aggregation-based addressing is a big departure from the current IPv4
addressing. Although IPv4 has three classes of addresses, it's not a
hierarchical addressing scheme. In IPv4 (at least before the CIDR
days), all the network's addresses were created equal, which means they
could all be independently routed to any locations they chose to be.
This caused the routing entry explosion problem when the Internet grew.
Classless Inter Domain Routing (CIDR) was introduced as a stopgap
measure to address this urgent problem by introducing some hierarchy in
the IPv4 address space. IPv6 is designed at the beginning with a
hierarchical scheme. By limiting the number of bits for each
aggregator, there is an upper limit to the number of routing entries
that a router needs to handle. For example, a router at a long-haul
provider needs only to look at the 13-bit TLA portion of the address,
limiting the possible number of routing entries to 213.
Another advantage of a hierarchical-based addressing system is that
address allocation can be delegated in a hierarchical manner. The
success of DNS teaches us the important lesson that delegation of
address allocation authority is a key to scalability.
There's a price to pay to use a hierarchical addressing system. When a
site changes its providers, all the IP addresses need to
be changed. We already experience the same kind of issue in IPv4 when
we use CIDR address blocks. IPv6 tries to make address changes as
painless as possible, to have a host autoconfigure itself. The host
will use its MAC layer address as the lower portion of its IPv6 address
and use a Network Discovery protocol to find out the upper portion of
the address (routing prefixes). The whole site can be renumbered by
simply rebooting all the hosts without any human intervention.
There are still lots of problems to be discovered and addressed in
IPv6. That's exactly what the 6bone is built for. IPv6 is the future of
the Internet, and the transition to IPv6 will start in the near future.
More information on 6bone can be found on <https://www.6bone.net>
Joint Session
Panel: Is System Administration a Dead-End Career?
Moderator: Celeste Stokely, Stokely Consulting
Panelists: Ruth Milner, NRAO; Hal Pomeranz, Deer Run Associates; Wendy
Nather, Swiss Bank Warburg; Bill Howell, Glaxo Wellcome Inc.
Summary by Carolyn M. Hennings
Ruth Milner opened the discussion by responding to the question with,
"It depends." She went on to explain that it is necessary for everyone
to define "system administration" and "dead-end career" to answer this
question for themselves. In some organizations, "system administration"
leaves no room for growth. However, Ruth pointed out that if people
enjoy what they do, then maybe it should not be considered a
"dead-end."
Hal Pomeranz outlined the typical career progression for system
administrators. He described the first three years in the
career field as a time of learning while receiving significant
direction from more senior administrators. During the third through
fifth years of practicing system administration, Hal suggested that
even more learning takes place as the individual works with a greater
degree of autonomy. Hal observed that people with more than five years
of experience are not learning as much as they were, but are more
focused on producing results as well as mentoring and directing others.
Hal commented that many organizations move these senior people into
management positions and wondered how technical tracks might work.
Wendy Nather discussed the question from the angle of recruiting. Those
hiring system administrators are looking for people who have dealt with
a large number of problems as well as a variety of problems. She
pointed out that being a system administrator is a good springboard to
other career paths. Wendy outlined some of the characteristics of good
system administrators that are beneficial in other career areas: a
positive attitude, social skills, open-mindedness, and flexibility.
Bill Howell examined the financial prospects for system administrators.
He commented that there will always be a need for system
administrators. However, industry may be unable and unwilling to
continue to pay high salaries for them, and salary increases may begin
to be limited to "cost of living" increases. Bill suggested that growth
in personal income and increases in standard of living are the results
of career advancement. If salaries
do become more restricted in the future, system administration may
become a dead-end career.
Celeste then opened up the floor for questions and discussion. One
participant asked about other career options if one was not interested
in pursuing the managerial or consultant path. The panel suggested that
specializing in an area such as network or security administration
would be appropriate. Discussion ranged among topics such as motivation
for changing positions, how the size of the managed environment affects
opportunities and working relationships, the impact of Windows NT on
UNIX administrator's careers, how an administrator's relationship with
management changes with career advancement, and the importance of
promoting system administration as a profession.
BOFs
Summaries by Carolyn M. Hennings
Keeping a Local SAGE Group Active
This BOF at LISA '96 and SANS '97 inspired me to start a group in
Chicago. Chigrp's initial meeting was in early October, and I was
anxious to announce our existence. General suggestions for getting a
group started and keeping one alive were shared by attendees. If you
want more information on how to start a group, see
<https://www.usenix.org/sage/locals/>.
Documentation Sucks!
As system administrators, we all know how important documentation is,
but we hate to write it. This BOF explored some of the reasons we don't
like to write documentation, elements of good documentation, and what
we can do personally to improve our efforts in this area. About 50
people attended the BOF. Some professional technical writers
participated in the BOF and were interested in the approach sys admins
were taking in their struggle to write documentation.
|