USENIX Technical Program - Paper - Proceedings of the 12th Systems Administration Conference (LISA '98)   
[Technical Program]
Drinking from the Fire(walls) Hose: Another Approach to Very Large Mailing Lists
Strata Rose Chalup, Christine Hogan, Greg Kulosa, Bryan
McDonald, and Bryan Stansell Global Networking and Computing,
Inc.
Abstract
This paper describes a set of tools and procedures which allow
very large mailing lists to be managed with the freeware tool of the
administrator's choice. With the right approach scaling technology can
be applied to a list management tool transparently. In recent
years, many ingenious methods have been proposed for handling email
deliveries to mailing lists of several thousand subscribers.
Administration of a mailing list is not limited to message delivery,
however. Tasks such as managing subscribers, dealing with mail
bounces, and preventing list spamming also become more difficult when
applied to very large lists. As a case study, this paper
describes the process of moving the well-known "Firewalls" mailing
list from its original home at GreatCircle Associates to a new
infrastructure at GNAC. The process was thought to be straightforward
and obvious, and it soon became apparent that it was neither. We trust
that our discoveries will benefit other systems administrators
undertaking similar projects, either concerning large mailing lists or
moving complex "legacy systems."
Introduction
"And you may ask yourself, Well ... How did I get here?"
- Talking Heads
The Firewalls list began in 1992, at GreatCircle Associates. It
quickly evolved into an important forum for new ideas, in-depth
technical discussions, and impassioned flame wars. Eventually it would
encompass roughly 4500 real-time subscribers and 4900 digest
subscribers. A large number of total subscribers were "exploder" or
reflector lists passing Firewalls list traffic to unknown third
parties at companies and universities around the world. Daily
message counts ranging from a norm of 20 to peaks of 75 or more yield
message deliveries of 95,000 to 342,400. In addition, a growing
problem with spammers began raising both list traffic and the
collective blood pressure of subscribers and list administrators.
In the fall of 1997, list founder Brent Chapman joined a startup
company as a key player and realized that he would have little
attention left over for anything else. In his own words:
"Firewalls and Firewalls-Digest are very popular, high-volume
mailing lists, and they take a fair amount of time and effort to
maintain. Life in a high-profile Silicon Valley startup doesn't
leave much time for anything else, though, so Great Circle
Associates is going into hibernation. Therefore, after five and
a half years and 111+ Mbytes of discussions spanning 45,517
messages and 3018 digests, the Firewalls and Firewalls-Digest
mailing lists are moving to a new home at GNAC, which is a
consulting and managed services firm based here in the Silicon
Valley that I think highly of." [0]
Given the explosive growth of the list over the years, and the
demands on Brent's time, it is very much to his credit that the list
was still functional up to that point. Now it was GNAC's turn to re-examine the list and find out how to bring it back up to speed.
The Existing System
"Like a crystal cathedral afloat on the tide comes a mountain
of ice
on the course to collide, while
passengers sleep thinking
God's on their side.."
- Peter Schilling, "Terra Titanic"
The Firewalls list environment that GNAC inherited turned out to
be, as we expected, a complex system of many moving interdependent
parts. We quickly discovered that the core of the Firewalls list
was the expected Majordomo list manager [1] wrapped around a dual-sendmail queueing structure. To optimize the handoff between
majordomo and sendmail, Brent had set up a special outbound queue area
for list traffic. The sendmail_command and sendmail_command_flags in
majordomo.config were modified to implement a queue-only sendmail [2]
in a custom queue area. In addition to the basic Majordomo
processing of the lists, part of the "Firewalls list" functionality
was providing archives via FTP and HTTP. The Mhonarc [3] text to HTML
converter and some scripting glue took care of the web-accessible
archives, and scripts regularly copied standard Majordomo archives
into an FTP hierarchy. There were a number of watchdog scripts to
warn about majordomo list processing malfunctions (e.g., list
truncation), as well as some behind the scenes scripting that created
a meticulous "clean archive" of the list postings for paranoia's
sake. While there were certainly intricacies, we could see that
the basic structures were sound, and working well enough on the Great
Circle server.
Where Angels Fear to Tread
We had made some detailed queries about the Great Circle
server, looking for load patterns and other duties performed by the
machine. We'd chosen a similar (in fact, more heavy-duty) server for
our purpose and felt confident that it could handle anything that the
older Great Circle server had been handling. For the
configuration of the server, we decided to make minimal changes to the
operating systems and messaging configurations while the list moved.
We carefully prepared a tarball from Brent's server, installed it on
our host, and did the basic hostname customizations required to make
it run. The preliminary tests looked good. Mail queued into the right
places, appeared in archive directories, FTP storage, web pages.
Digest files grew. We were ready for prime-time. We arranged a
special "test mode" that would simulate list traffic [4] and, with
great anticipation, we turned the key. We knew that there was a heavy
spam load, and a lot of traffic, but we were on a faster server with
more disk spindles, greater memory, and a wider network pipe. Nothing
could go wrong. Go wrong. Go wrong. [5]
In the Cold Light of Day
Test messages would never make it out of the queue. The
server would chronically lock due to forking problems. There were
crashes and lockouts due to memory problems. Well, this is why we test
things in the first place. The dismal failure of the first
"production test" caused us to re-consider our stand on keeping the
list management machine and software as close to the original as
possible. We realized that we needed to go back to "square zero" and
examine the fundamental structure of how the message flow worked.
After a couple of false starts, detailed in subsequent sections, we
soon had messages turning around in record time. At that point we
turned our attention to the secondary challenges which we we had
inherited with the management of the list, bounce mail and spam.
These at least had been the focus of directed planning for future
enhancements. The amount of bounce mail that a list of the size
and volume of Firewalls can produce is phenomenal. Interspersed with
the bounce mail would be real requests from real people who needed
something from the list managers. We had to find some way to ensure
that we were responding in a timely manner to these requests without
getting overwhelmed by the bounce traffic. We also knew that the
list had become a favorite venue for spammers sending their useless
and annoying missives. Stopping the spam while still allowing
legitimate posting would pose its own challenges, due not only to the
sheer number of subscribers but to the percentage of "subscribers"
which were actually mailing lists rather than individuals. The
volume of messages to the list over time is shown in the graph below.
Shortly after GNAC took over the list the spam problem reached a
critical point, traffic to the list was high and consisted mostly of
spam and complaints about the spam. Immediately after we implemented
spam-control measures, list traffic dropped to an uncharacteristic low
for a while before resuming more normal levels, but without the spam.
Figure 1: Daily message deliveries.
Host Tuning
"Close to the middle of the network, It seems we're looking
for a center. What if it turns out to be hollow? We
could be fixing what is broken."
- S. Vega, "The Big Space"
The Great Circle server had been largely untuned, which surprised
us in light of the implied message delivery load of the Firewalls and
Firewalls-Digest lists. However, in the interest of minimizing
changes, we had gone with an "out of the box" BSD/OS config. We had
also slavishly copied the configurations of the Firewalls-specific
application services in our zeal for compatibility. As we
discovered, our server was doing much more than the original server.
As you will see below, the Great Circle server was not actually
delivering all the messages to the subscribers, but instead was using
relaying services of a large ISP for the actual delivery. Our machine
was consequently not tuned correctly to deal with the memory and
process usage profile that this task required. We first became
aware that the original machine was relatively un-tuned when we
discovered that the sticky bit was not set on the system copy of Perl.
From there we did further digging, and decided we needed to go over
the entire system and analyze it from scratch. [6,7]: Here are the
major changes we introduced. None of them is necessarily dramatic in
impact, but together they represent considerable improvement.
-
put operating system and application binaries on different disks
-
doubled our swap space
-
balanced swap between disks
-
set sticky bits on Perl, sendmail, other key system apps
-
set sticky bits on all Majordomo binaries & scripts
-
rebuilt kernel and upped syslimits.h variables (MAX_CHILD, NPROCS)
-
installed a cacheing named [8]
Later we would come to double the physical memory and
increase KMEMSIZE to handle some unusual custom processes that we will
be describing below.
Message Delivery
"I've been standing here waiting,
Mr. Postman, so-o-o patiently - for just
a card or just a letter ..."
- The Marvelettes, "Please, Mr. Postman"
Once we had the machine tuned, we turned to the list processing
itself. Message turnaround time had been in the order of days before
we took over the list. It was still at that order of magnitude, and
when the message volume was high, our outbound queue was growing
faster than mail was getting delivered. There was potential for
serious backlogs that would cripple the list. Mail basically
wasn't moving. We knew that historically the list was plagued by slow
mail, but that the queues didn't back up too badly. Why was our mail
backing up? We went back to the Great Circle server to find out. The
answer turned out to be our choice of "smart host" in the sendmail-lists.cf file, which we had blithely customized to work with the usual
GNAC environment. As we mentioned earlier, Majordomo was
configured to use a queue-only sendmail (designated "sendmail-lists") for message generation. A separate sendmail daemon would
process that queue and keep it moving. We discovered that for the sake
of expedient message processing by Majordomo, all recipients of the
message were packed into a single RCPT line. Those of you who have
dealt with sendmail extensively are wincing right now, aren't you?
We also discovered that, in a neat private arrangement dating back
several years, Brent had arranged for his servers at greatcircle.com
to have relaying capabilities through the UUNet mail servers. Thus the
Great Circle sendmail-lists.cf file merely specified "mail.uu.net"
as the smart host, causing everything to be forwarded to it for
processing. Due to the way UUNet round-robins its mail services, this
would effectively spread out the processing of these troublesome
messages with monstrous RCPT headers. Of course, GNAC did not
have the option of passing those messages to UUNET. We would have to
deliver these messages directly, 4K of RCPT addresses or not. While
GNAC has an excellent mail infrastructure, their core business does
not involve ISP-style mail for thousands of individual subscribers.
Thus GNAC did not have the quantity of dedicated mail-delivery
resources to simply toss the messages into the network and let them
go.
Chunk-Style, Just Like Home-Made
A message with over 4000 recipients can take literally days
to deliver. A sendmail instance processing the message will work its
way laboriously down the recipient list, pausing at every time-out. It
may run out of resources and die, causing a new sendmail to take up
the torch. No problem, you say, since the new sendmail instance can
use the xfNNNNN queue file to pick up where the old one left off. Yes,
but first it has to retry all the "deferred" hosts that timed out.
Even if the messages are being farmed out to multiple servers, each
individual message is going to reach individual subscribers in a
highly non-deterministic fashion. In analyzing our logs and
transfer status files, we found that message deferrals would typically
be due to remote name servers or mail servers failing to respond
before timeout. Due to the large and diverse population of the list,
we would see rather shocking ratios of failures to successful
deliveries. Many of those failures were multiple failures trying
repeatedly to deliver the same message to the same site. At an
architectural level, we knew that we had to get away from the multi-thousand RCPT lines business. We also knew that we couldn't simply
force one recipient per message without making the sendmail queue
directory so large that directory search time would became a
significant factor. Directories with over 10K nodes are generally
undesirable [6] and at over 4K subscribers we would quickly flood the
queue directory. Initially we assumed that we would get our
biggest "win" by employing a program such as mailcast [9] to batch
and sort the recipients by domain and MX record. Mailcast would simply
queue them up and send one nice copy off to the right host and we'd
shake each others' hands and go off to hoist a cold one or two.
Imagine our consternation when we discovered that in fact out of
approximately 5,000 individual subscription addresses, some 4,000 were
in fact unrelated by host, domain, or MX record. For the roughly 4,000
digest recipients, we found about 3,500 uniquely unrelated addresses.
Ouch. For us, this approach was largely indistinguishable from "one
recipient per message." Since there was not a strong natural
grouping between addresses, it seemed that arbitrary recipient
chunking would be the way to go. We immediately thought of bulkmail
[10], a mail-sending utility that can perform chunking on huge
recipient lists. As we looked into the specific configuration of
bulkmail, we found that bulkmail and majordomo were not trivially
compatible. Majordomo wants to invoke a mail command and send a
message to it. Bulkmail wants to read in files with a message and a
recipient list. We spent a couple of lunches wrangling over which of
them to hack to accept the other's view of the world, and how exactly
to structure the changes to minimize future-release porting issues.
Any way we looked at it, it looked ugly.
The Portable Queue
At this point, having gone far enough down the rathole to
smell cheese, we popped back up into the sunshine to re-examine the
original goal, namely chunking the messages into deliverable size. We
realized that we were already producing a clean, queued message with a
highly well-defined structure [11] in a place that we could control.
There was no reason that we had to perform the chunking at message
generation. We could do it right in the queue itself. Before
beginning the move of the Firewalls list, we had done some preliminary
mailflow architecture. Our original plan was to move the list without
structural changes, then to apply our idealized architecture in
careful stages. Based on the production testing, we clearly had to
accelerate things quite a bit. One of the original elements we'd
planned to introduce was time-based queuing, where messages are
recursively sifted among various queues based on how long they have
been pending [12]. Reading up on this approach, we were reminded once
again of what every postmaster knows: queue files are portable.
One of the standard postmaster rites of passage is dealing with a
major multi-day mail backlog on your bastion host. You eventually
realize that the most sane thing to do is to turn off incoming mail,
move all the queue files into a holding directory, then turn on mail
again. Meanwhile, you do a little quick scripting to sort things
based on which internal mailhubs are in the envelope headers, make a
few tarballs, and just FTP them down into the right mailhub's queue,
unpack, voila! We decided to go one step further and "MIRV" [13] the
queue files.
Split Personality
We turned off the "sendmail-lists" invocation of sendmail
and replaced it with a cron job called "qsplit." The qsplit Perl
script runs every 5 minutes and examines the sendmail-lists queue
directory. Each queue file is parsed. The unique portion of the name
(e.g., "ABC12345" in "qfABC12345") is stored as $ident and used to
generate new qf files. If the number of RCPT lines in the qf file
exceeds the qsplit variable $CHUNKSIZE, the message is processed into
multiple messages of $CHUNKSIZE recipients and zero or one messages of
less than $CHUNKSIZE. Each new qf file has a sequential number
appended to $ident. Thus the first split file from "qfABC12345"
would be "qfABC123451," then "qfABC123452" and so on. Since
sendmail will generate unique queue file identifiers within a given
sendmail queue area, using this method guarantees unique identifiers
for split queue files. Qsplit is also configured to know about an
arbitrary number of sendmail queue directories. If the number of
recipients in a parsed qf file is less than $CHUNKSIZE, qsplit will
move the message into one of the preconfigured queue directories. A
round-robin effect is achieved by keeping track of the last queue
directory into which a file has been placed and putting the next one
in the subsequent directory, wrapping around as necessary. For
efficiency, each "new" df file is merely an appropriately numbered
hard link. This is particularly important for the Firewalls-Digest
postings, where the df file can be quite large. Note that since hard
links will not work across partitions, the "sendmail-lists"
directory and the processing queue directories must be on the same
filesystem. Qsplit of course removes the original qf/df files after
the splitting is accomplished. This is why we use hard links rather
than symbolic links, since hard links have no concept of "the
original file." Splitting the qf files in this way had a
dramatic effect on message turnaround time, as the following graph
shows. For more than a week before we finally settled upon and
implemented our splitting solution, we had been manually splitting qf
files and distributing them as described. That period is seen in the
graph as a low before a final spike.
Figure 2: Message turnaround time.
Spawn 'til You Die
Each of the queue directories (10, at present) runs separate
instances of sendmail. All of the directories are managed by a shared
spawning daemon, called simply "spawn.pl". Some experimenting was
necessary to find the right timing to use within the configurations,
so various copies named things like "slow-spawn" were created for
test runs. The queue management daemon (spawn.pl) spawns as many
copies of itself as there are queue areas. It then watches its
children and re-starts any of the spawned processes that die. Each of
these child spawners is chartered with keeping ten sendmail daemons
running to process its queue area. The child spawners keep track of
their sendmail children, restarting a new sendmail whenever one dies.
There is logic built into the spawners to check configurable
variables for the load average on the system, and the amount of memory
available. If the load is too high, or memory too scarce, the child
waits until there are more resources available before starting a new
process. The initial spawning of the children processes themselves is
also subject to the same limitations. The load average limit in the
spawner is set lower than the sendmail threshold, since starting a
sendmail will cause a load average spike that might cause sendmail to
not do anything once started. In addition, a variable controls the
timing between each child spawner or new sendmail, so as to minimize
load disruption. To further speed things up, we also implemented
sendmail's host-status feature. This creates a directory structure
containing information about when a sendmail last tried to contact a
given host, and whether it was up or down. If it is down, sendmail
doesn't try that machine again unless the specified re-try timeout has
expired. We used the sendmail default of one hour. The trade-off
between the massive amount of disk access that this caused and the
saved processing and wait time has proven to be worthwhile. Setting
all of the sendmails across all of the spawn-managed queue areas to
use the same host-status caching has given us even greater efficiency.
Less than half a day after implementing this new queue processing
method, only 7,000 recipients out of the initial 300,000 were still in
the queue. All the recipients remaining in the queue were for
"problem" addresses. At some point in the future, we may implement
time-based queues as a subset of the spawn-managed queues.
Design Trade-offs
The astute reader will notice that it requires two passes of
qsplit for a message to go from its initial bloated qf file in a
holding directory to a chunked qf file in a live sendmail directory,
awaiting delivery. This means that there is guaranteed to be at least
five minutes, possibly as long as 10 minutes, between the generation
of a list message by Majordomo and the earliest possible outbound
opportunity for the message. This is quite deliberate, for two
reasons. The first, as you might guess, is simplicity of coding. Given
that we were under a deadline to announce the list changeover, and
that this level of rearchitecture had been slated for several weeks
down the road, it was an expedient choice. The second reason is
directly functional for list purposes. Historically the Firewalls list
has been plagued by flame wars of varying length and duration. We had
been told that introducing a slight delay into message propagation,
within reason, was the most expedient way to minimize the occurrence
of flamefests. Exchanging one-liners over a few hours rather than a
few minutes tends to spoil a bit of the fun and allow cooler heads to
prevail. Given the requirement for slightly delayed propagation, we
chose to retain our 5-10 minute granularity rather than try for a more
immediate delivery.
List Management Issues
"So when I dropped it in the mailbox, I sent it 'Special D'
Bright and early next morning it came right back to
me."
- E. Presley, "Return to Sender"
The outbound mail was only the tip of the iceberg. We discovered
that the Firewalls list generated an astounding 80M or more of bounce
mail daily. Keeping bounce mail from overrunning list traffic had been
the primary reason why Brent went to a dual-sendmail system for the
list years ago. The sendmail-lists was set up for outbound mail
only, with a conventional sendmail receiving incoming list traffic and
the plethora of bounces. Brent and his compatriot, Michael C. Berch,
had put together a series of scripts for winnowing the postmaster
wheat from the bouncing chaff. While the scripts identified many user
queries and passed them on for human action, the bulk of the mail was
discarded. This meant that secondary bounces would go uncorrected and
trigger new recursive bounces. Our plan from the beginning was to
isolate bounce traffic even further, putting in a third separate
sendmail structure solely for bounces. We would accomplish this by
defining a virtual interface on the Internet-facing ethernet port and
assigning it to "bounces.gnac.net". By tagging outgoing mail with
From and Reply-To addresses at this host, we could control bounce
traffic. Automation scripts have been implemented in Perl, and
have proven able to handle the formidable task of crunching through
the huge volume of mail. We originally explored queuing the bounces
via procmail as each message arrived, but quickly found that the
overhead of calling procmail for each inbound message made batch
processing of the bounce mail a better solution. The scripts, run out
of cron, are explored below.
Automation of Bounce Handling
Since we had made the decision to automate wherever
possible, we designed the script to identify and sort each message
according to its potential for automation. Thus we arrived at three
"bins" into which to toss processed bounce mail: "HUMAN,"
"AUTOMATABLE," and "JUNK." The first category, JUNK, is for
things which need throwing away. In particular, bounce messages to a
bad address frequently generate second-generation bounce messages. We
had planned on a scripting solution for these as well, but were
pleased to note the "confDOUBLE_BOUNCE" option in the new sendmail
8.9 configuration. [15] Designed to catch just such occurrences, this
option will let you specify an address, such as "| /dev/null," for
these. To save on general I/O and processing wear and tear, however,
it would be desirable to add a double-bounce ruleset and reject these
messages right at the check_compat stage. [2] The second category,
AUTOMATABLE, is for messages which will eventually be handled by a
programmatic response. A good example of this is the all-too-frequent
"I unsubscribed but am still getting mail, help, get me off this
list" query. A script will eventually be written which will pull out
the sender's name and search for it in the database, then send off a
canned reply describing message propagation and the results of the
search. Of particular interest in this category are routine
bounce messages. We are in the process of adding a bounce manager
which will extract addresses from a standard bounce message and
process them. By hashing on the address and updating a counter, the
script can quickly determine whether or not this is a repeat bounce
offender. If so, the address can be automatically removed after a
certain number of bounces. This represents a great improvement over
the old "bouncer" list functionality which required hand editing of
the list files to accomplish. The last category, HUMAN, is for
what is left. These are items that usually require human intervention,
either to answer the question posed, or to figure how this particular
item can be automated successfully as above.
Spam
"There is one thing you must be sure of, I can't take any
more!"
- Peter Gabriel, "Shock the Monkey"
Spam was a serious problem on the Firewalls list. The machine that
the list runs from does not relay spam, but spam is sent directly to
the list. When we took on the list, this was one of the issues we
intended to tackle. Based on anecdotal information, we did not expect
traditional solutions to scale to cope with the Firewalls list. As we
looked into our options, an increasing quantity of Firewalls list
traffic became discussions on how to make the list usable again from
the perspective of the subscriber community. One conventional
approach to cutting down on spam is to block certain domains or IP
address ranges from successfully sending mail through your sendmail
daemon. This can either be done through sendmail configuration, or at
the network layer using Vixie's black-hole BGP feed, or simple
filters. These approaches would not work in our case because much of
the Firewalls mail was still being forwarded from greatcircle.com, and
that mail was a mixture of real messages and spam. The simplest-sounding solution was to make the list a closed list, where only
members can post. However, there were two problems with this.
Firstly, a large number of the lists subscribers were local exploders
at remote sites, whose membership we had no way of knowing. We did not
want to prevent these people from posting, or to force them to all
subscribe directly to the list. Secondly, Brent had concerns about how
well the majordomo feature to do this would scale to a list this size,
which is the reason that he had never activated it on his version of
the list. He felt that the interlocking programs that make up
Majordomo's interpreted Perl core could not feasibly keep up with the
traffic.
Sendmail Database Approach
We also considered taking this same idea, restricting
posting to list members, to a lower level, and having sendmail do the
work via a database lookup mechanism. The members of both the
Firewalls and the Firewalls-digest list would be automatically added
to the database. In addition, a list called Firewalls-post could be
created for offsite list exploder members who wish to post. The
Firewalls-post list is maintained by majordomo so that subscribe and
unsubscribe requests can be handled automatically. All three lists
would be made into a generic key/value sendmail database at regular
intervals by a cron script. We could trigger a database lookup
only if the recipient was "firewalls" or "firewalls-digest."
Otherwise we would end up screening out routine majordomo requests or
postmaster mail. By positioning the lookup in the check_compat phase
of mail processing, we would be able to reject unauthorized postings
directly at the SMTP connect. Note that system addresses such as
"firewalls-owner" and "majordomo" need to be included to allow
normal majordomo operation. These must be qualified with the full host
and domain name in order to prevent spoofing, e.g.,
"majordomo@lists.gnac.net" rather than just "majordomo."
Using Majordomo
Clearly the Firewalls-post list that was suggested for the
sendmail solution to the "invisible subscribers" problem could also
be applied to majordomo. Just to make sure that we weren't re-implementing the wheel for no reason, we also ran some tests to
evaluate the overhead of using the majordomo "closed list" feature.
After running a set of tests using the majordomo restricted-posting lookups, we found that on our machine it took about three
seconds to perform the lookup. We decided to accept this as part of
the system overhead, and implemented this feature over the sendmail
based one. We have considered implementing the sendmail based variant
of this on general principle, however, and to evaluate its use as a
solution for other large lists not using majordomo. The final
component is the communication piece. We forewarned the list
membership that we were going to implement this feature, and gave the
message a couple of days to reach everyone. In addition, when
majordomo rejects a message due to this feature, directions on
subscribing to the Firewalls-post list are returned to the sender.
Potential Future Problems
Strictly speaking, a clever spammer could hand-set the
sender to be a legitimate sender such as "majordomo@lists.gnac.net".
It would be wise for us to include a "remote is identifying as me"
ruleset as part of this, so that this kind of spoofing would be caught
and detected with prejudice. [14] If spammers monitor the list and
start spamming under spoofed names of legitimate posters, we would
have to up the ante and turn on the sendmail features which do host
authentication via DNS. [1, 15] While this would impose more of a load
on the server, our split sendmail configurations would allow us to
implement this on the main inbound sendmail only, so that the
performance hit would not be too severe. We hope to avoid this, as
many legitimate sites have business reasons to aggregate traffic or
architect their mail infrastructure in such a way that they do not
comply with strict sendmail checking.
Futures
"All the way to Malibu
from the Land of the Talking Drum: Just look how far -
look how far we've come!"
- Don Henley, "Building the Perfect Beast"
The cutover day for the Firewalls list move was April 15th, a red-letter day in its own right. Other than Brent's dual-sendmail
structure, none of the facilities mentioned in this document existed
on that date, nor had they been planned. As of the writing of
this paper, we are processing an average of 8184 messages per day.
Turn around time for an individual message has dropped from pre-queue-split highs of 5-8 days to less than one day, and in many cases less
than half an hour:
Statistics from Wed Apr 15 17:48:17 1998
M msgsfr byt_from msgsto bytes_to Mailer
0 0 0K 504929 2241822K prog
1 0 0K 3499 25520K *file*
3 402707 1831829K 270 914K local
4 51340 268777K 4584 59086K smtp
5 74148 421120K 2122 5626K esmtp
9 64 246K 0 0K uucp-old
======================================
T 528259 2521972K 515404 2332968K
date: Wed Jun 17 18:35:30 PDT 1998
In order to have better tracking of email flow through the list,
we are intending to implement a script to take hourly snapshots of
sendmail.st, process them, and feed the data to MRTG [16]. We have to
do it that way since start/stop is impossible with so many sendmails
all the time. The script will need to aggregate the sendmail.st
files of the variously spawned sendmails. They are separate files
because sharing the same .st file could slow sendmail down
unnecessarily as it waits on locking on the sendmail.st file.
When this is implemented, we intend to make the MRTG graphs available
on the list website.
Further Processing
For lists with more "real-time" needs and less concern
about flame wars, qsplit could be rewritten to deposit split files
directly into processing queues at the time of splitting. To
improve message flow further, qsplit and spawner could be applied
recursively to create time-based queues working with the existing
spawn-managed queue directories. Messages over a certain age would be
moved to a time-based queue and then split to a smaller $CHUNKSIZE. By
employing progressively smaller chunks, one could force the qf files
down to one RCPT per message by the time they reached a particular
age. At this point, problem addresses would be identifiable
automatically by their queue position. This could enable management of
bad addresses completely outside of the traditional bounce/postmaster
processing used by most list admins.
List Exploders
We'd like to eventually add some special-case handling for
exploder lists. When we receive a generic individual user bounce via a
remote exploder, it is very difficult to find the origin exploder and
pass on the error to that list administrator. In fact, there is no
distinction made in the Firewalls or Firewalls-Digest lists between
individuals and exploders, so there are undoubtedly many exploders
which are completely opaque to us. One approach would be to
increase dramatically the number of spawn-managed processing queues
and set qsplit to always chunk RCPTs to one per message. We would
further modify qsplit to add an RFC-822 compliant custom header [11]
containing the envelope recipient to each qf file. This header line
would be preserved in any remote mailer bounces, enabling us to see to
which address the original message was sent. A bounce message whose
"user not found" error did not match the address in the X-Custom-Recipient header could trigger a custom message to the address in the
header, or be referred to a human administrator for hand-processing.
Availability
The scripts described in this paper will be made available
at https://www.lists.gnac.net after the publication of this paper in
December, 1998.
Conclusions
"Don't know much about history ..."
- Sam Cooke
Look before you leap!
Taking over the management of the Firewalls list seemed like
an attractive proposition. It should be easy - just copy over Brent's
setup on to faster equipment with better Internet connectivity and
you're done. As we soon found out, it was not that simple.
Follow First Principles
There was no magic involved in turning the list into
something that now runs smoothly. We stepped through a number of
system administration basics. When the machine was in trouble, we
looked at the hardware to see if more memory would help, and we looked
at the kernel parameters and tuned them appropriately. After that,
understanding the problems in detail and how the various solutions
would affect the system allowed us to choose the correct course of
action. Questioning our assumptions and gathering real data on which
to base our decisions also proved worthwhile. Experimenting with
different options off-line to get real data without affecting the list
is always the right approach for a production system.
Work smarter not harder
The machine was not CPU-bound, so throwing higher-end
equipment at it would not help. It was also not even coming close to
saturating our connectivity. What we needed to do was find a way to
make the system work smarter. To that end, having understood the
problems we looked at ways to split the recipient lists into smaller
chunks, and at how to get multiple sendmail processes to constantly
churn through the queues.
Communicate, communicate, communicate.
An important part of our work during the "hard times" when
we had just taken over the list was to communicate with the readership
and let everyone know what was going on, and that we were working on
fixing each of the problems that arose. There were many people
interested in helping out, and we got many interesting pointers from
folks on the list (thanks folks!). Letting people know the list of
problems that you are working on, and when you realistically expect to
have them fixed is something we all need to remember to do. The
implementor feels less pressured and the "customer" feels plugged in
and listened to.
References
[0] D. Brent Chapman, https://www.greatcircle.com/lists/firewalls
(and posted to the Firewalls list).
[1] "Majordomo: How I Manage 17 Mailing Lists Without Answering
'-request' Mail," D. Brent Chapman, USENIX, LISA VI
Proceedings, October 1992. ISBN 1-880446-47-2.
[2] Sendmail, 2nd Ed., Brian Costales with Eric Allman,
O'Reilly and Associates, 1997. ISBN 1-56592-222-0
[3] Mhonarc, a Perl successor to mail2html, https://www.oac.uci.edu/indiv/ehood/mhonarc.doc.html.
[4] Our "test mode" consisted of two parts. First, a parallel
feed of Firewalls traffic provided by Brent. Second was merely the
sending of a real message to the list recipients as part of a dry run.
The message would be a precursor to the official announcements already
drafted by Brent Chapman (Great Circle) and Christine Hogan (GNAC).
[5] Westworld, Metro-Goldwyn-Mayer, https://us.imdb.com/Title?Westworld+(1973).
[6] System Performance Tuning, Mike Loukides, O'Reilly and
Associates, 1992. ISBN 0-937175-60-9
[7] Sun Performance and Tuning, Adrian Cockcroft, Sun
Microsystems Inc., 1995. ISBN 0-13-149642-5
[8] Managing Internet Information Services, Cricket Liu,
Jerry Peek, Russ Jones, Bryan Buus and Adrian Nye, O'Reilly and
Associates, 1994. ISBN 1-56592-062-7
[9] Strata Rose, VirtualNet Consulting, Dave Ilstrup, WebAware;
unpublished work 1995.
[10] Debian bulkmail, https://molec2.dfis.ull.es/debian/Packages/stable/mail/bulkmail.html
[11] RFC-822: Standard for the Format of ARPA Internet Text
Messages, D. Crocker, August 13 1982.
[12] "Tuning Sendmail for Large Mailing Lists," Rob Kolstad,
USENIX, LISA XI Proceedings, October 1997. ISBN 1-880446-90-1.
[13] Multiple Independently Targetable Re-entry Vehicle,
https://www.janes.com/defence/resources/glossary/defres_glosmi-ml.html.
[14] Sendmail: Theory and Practice, Frederick M. Avolio
and Paul A. Vixie, Digital Press / Butterworth-Heinemann, 1995. ISBN
1-55558-127-7
[15] Eric Allman and Sendmail Inc staff, https://www.sendmail.org/ web site.
[16] MRTG, (Multi Router Traffic Grapher), Tobias Oetiker
and David Rand, https://ee-staff.ethz.ch/~oetiker/webtools/mrtg/mrtg.html.
|