|
11th Systems Administration Conference (LISA '97)
   
[Technical Program]
Automating 24x7 Support Response To Telephone RequestsAbstractDemands for uninterrupted availability of systems that were hitherto not viewed as critical to the success of the enterprise has placed a burden upon support infrastructures for those systems. The small-to-medium help desk for a system that comprises part of an enterprise information system is under increasing pressure to provide round-the-clock support, while staffing budgets may not have caught up with the fiscal realities of covering second- or third-shift on-site personnel. While the volume of calls outside normal hours may be small, a guaranteed turnaround and response to emergencies is essential for many IT operations to gain the trust of their customers. This paper describes a system for automated answering of the help desk telephone during non-peak hours, and for notifying on-call staff of emergencies within minutes. The system uses two voice-capable modems on a well-maintained computer for presenting to the caller a typical phone menu hierarchical menu which may include an option to record a message about some type of emergency. In this event, the system will notify the staff who are on call at that time for that type of emergency, contacting them by means they have specified in advance, such as voice telephone, numeric or alphanumeric pager, or other means. The system is currently undergoing testing in the Enterprise Information System File Service at JPL. MotivationAfter years of preaching the advantages of current technologies such as distributed systems, management acceptance comes with the price of making those technologies ubiquitous, easy-to-use, and reliable. The Enterprise Information System (EIS) Project at JPL was established in 1996 in an ambitious plan to establish a unified and coherent strategy and implementation of computing infrastructure to serve the laboratory's reengineering plans. Early adoption and roll-out of AFS (a distributed filesystem from Transarc), and a current migration path to DCE/DFS are some of the components of that implementation. Making this technology accessible and even transparent to a laboratory population highly diverse in its computer expertise and extremely heterogeneous in its platform usage demands a capable and responsive help mechanism. The File Service element of EIS, having rolled out its technology fairly soon, set up an e-mail alias and a mnemonic telephone number for support. The nature of space missions and international collaboration means that JPL is open 24 hours a day, every day of the year. If flight projects whose personnel might work during off-hours were to embrace the new technology (a key aim for the reduction of overall costs and achievement of stretch goals), these projects and their personnel needed to know that their emergencies during off-hours would receive prompt attention. However, the start-up nature of the File Service meant that the number of customers was too small to justify funding for continuous on-site assistance. In order to break this Catch-22, it was determined that on-call support would suffice, since response time could be fairly rapid. This led to a ``pass the pager'' scheme for the on-call person of the week (or month). This has a number of disadvantages:
It is not acceptable to require customers to memorize multiple numbers, or use different numbers at different times, or look up the current phone number over the web (since the very reason they may be calling is to report that the network is down!). SolutionWhat is needed is a mechanism for connecting the calls coming in to a single number with the multiple different and time- dependent methods of contacting on-call staff. A large enough organization could afford to staff the helpline with a person around the clock who need know no more than who to reach for a particular problem at any given time, and how. Some organizations cannot accommodate the budget necessary to pay for second- and third-shift personnel, and automation suggested itself in this case. We proposed and constructed a system for automated answering of the help desk telephone during non-peak hours, and for notifying on-call staff of emergencies within minutes or seconds. No on-site personnel are needed provided that the on-call staff at any time are within a suitable distance for emergencies that cannot be resolved from wherever they are at the time. The scheduling of the on-call staff can be done by the staff themselves, and reports generated from the on-call schedule highlight any gaps in coverage. Nothing in the design of the system is inherently specific to the particular project it serves. RequirementsThe requirements we imposed on the system were as follows (KEY: R: Required for initial system; RF: Required for some future system; D: Desirable for initial system; DF: Desirable for some future system): Schedule
Input
Output Methods
Paging - Alphanumeric
Paging - Voice
Menu System
Commercial ProductsThe first time we conducted a search for COTS products to perform this task, we were unable to find any costing under $100,000 - we did not investigate systems costing over $100,000 which might have had these capabilities. We found several inexpensive packages offering phone menu capabilities for personal computers, but not providing notification to voice, pager, and email according to a schedule. A more recent search conducted for this paper turned up the Tel Alert product from Telamon, Inc., [6] which performs the phone menu functions in addition to status monitoring and contacting by pager, and has some scheduling capability; although it requires proprietary hardware it is worth evaluation before implementing an in-house solution. VoiceGuide from Katalina Technologies [7] performs most required functions (without the scheduling sophistication), plus many others (such as caller ID detection), at a very low price, but on Windows only. It could be used to call out to an external program which implemented the scheduling algorithm (perhaps remotely on a Unix box), although if the helpline application requires a high degree of robustness, most people would not choose to run it on a Windows machine. Among freeware products, tpage [5] allows scheduling of people to be paged but not with the same flexibility and readability that this system provides, and only contacts pagers, not voice phones. ConstructionThe key to the system is a voice-capable modem such as the ZyXEL Omni 288. A daemon waits for an incoming call and when one arrives, presents a typical hierarchical phone menu to the caller (``Press 1 if you know the NAME of...''). In fact, the daemon is simply playing a series of sound files that have been placed in a particular directory and given names corresponding to the keys the user should press to select them. Subdirectories containing more sound files are used to represent deeper levels of the menu, and the traversal follows some simple rules for checking the existence of files or directories with names matching particular expressions. The directory hierarchy of the this menu looks something like Figure 1. The algorithm for processing this tree is very simple; see Figure 3. The directory tree for the phone menu is part of the hierarchy for the tool, which looks something like Figure 2 (not all files displayed). Figure 1: Menu directory hierarchy. Figure 2: Tool hierarchy. Figure 3: Tree processing algorithm. An advantage of this approach to constructing a phone menu system is that it is so easy to create additional levels that an entire support staff can be given write access, allowing them to create new sound files and menu levels if they wish. This distributes the job of maintaining the phone menu system among the people with the domain expertise appropriate to each particular part of the hierarchy. In addition, responses can be generated on the fly according to information derived at call time. The menu can call scripts that can execute arbitrary Perl commands, for instance, to ascertain the status of various institutional servers, and modify the playback content accordingly. The daemon runs each script with a Perl eval command so that a subroutine library is imported into the namespace of the script enabling it to perform certain phone menu- specific operations easily. The key point here is not that a phone menu system can respond with dynamic information, but that the design of this implementation allows it to be done so easily using only standard tools (simple Perl, and a microphone hooked up to a recording program). Listing 1: Scheduling examples. One operation a caller may wish to perform is recording a call for help. The phone menu interface for handling this is very simple: at the node where the corresponding sound file says, ``If you are experiencing trouble with...'' the associated script contains: $record_file = nextname(); record ($record_file); schedule (<situation>, $record_file, "FIL");where <situation> is a mnemonic for the type of emergency being reported (any number of situations is possible). The first line acquires a unique name for the incoming sound file, the second guides the caller through a phone menu recording menu which allows them to review, cancel, or send a message, and the third notifies the scheduler of an occurrence of that particular type of emergency. The spool directory contains files queued for processing by helplined, the daemon which handles outbound communications. A spool file looks like: SITUATION: AFSDOWN PROFILE: PETER CONDITION: NULLCONDITION COMMAND_TYPE: VOICE COMMAND_NUMBER: 42246 COMMAND_PIN: COMMAND_ADDRESS: COMMAND_PLAYFILES: ~peter/audio/ sounds/attention.au .. ./incoming/19970825113721 COMMAND_MESSAGE: On-call SchedulingThe on-call notifier and scheduler is an independent component which could be implemented entirely separately from the phone menu system; it is not necessary for emergency notifications to come from the phone menu system, since any process which executes the schedule() subroutine can make such a notification. Thus it is possible to cause on-call notification in the event of automatically- detected or triggered events such as a UPS alarm signalling power failure, a network health monitor alarm, or SNMP events. When an emergency notification occurs, the scheduler parses a schedule file to see who is on-call at that time for that type of situation. The schedule file contains statements describing four entities:
When it has determined which profiles are active for the given situation at the current time, the scheduler runs the profile to contact the on-call support person by pager, telephone, or whatever other method is specified in their profile. The schedule file is written in SGML; a DTD was created to describe schedules. This provides two advantages: firstly, that the input and editing of the file can be done through an SGML editor which guides the user through the creation of a syntactically valid entry, and can be customized to provide a high degree of user-friendliness; thus, entries can be made by the on-call support personnel themselves. Secondly, that enforcing of the required syntax by the SGML editor leaves almost no syntax checking required of the application. Whenever an emergency notification occurs, the schedule file is read through the nSGMLS [3] parser with the SGMLS.pm module [2] and rejected if a syntax error is found (of course, the file is also validated every time it is edited). The SGML schedule in Listing 1 contains examples of each of the four statements listed above. The date range definitions of the individual holidays and the personal date ranges have been elided to save space. The SGML is not designed for human parsing nor editing, and fortunately, neither is necessary. There exist a number of products which read the DTD and provided structured editing of the source, guaranteeing that the syntax is valid. A freeware example is the SGML module for EMACS. Some Commercial Off-The-Shelf (COTS) products also allow ``style sheets'' to be constructed which hide the tags completely from the user, providing WYSIWYG editing like modern HTML editors (e.g., Arbortext's Guided Editor). When formatted for human-readable output, the statements in the schedule can be more easily illustrated in some more examples: Date RangesDAY ChristmasEve: 12/24 DAY Christmas: FROM ChristmasEve TO BoxingDay DAY BobWork: WorkWeek EXCEPT BobVacationThe first example assigns a specific date (which can be qualified with a year if necessary) to the identifier ChristmasEve, the second example assigns a range of already-assigned dates to another identifier, and the third assigns a range of dates excluding a pre- defined subrange. The first two lines would have been input by an administrator to define certain well-known dates; since the year is they will not need to change fixed-date definitions such as Independence Day from year to year. The third line would have been input by Bob, a member of the support staff, defining a date range corresponding to his default work week: the administrator-defined workweek (weekdays except for holidays), minus his personal vacation (defined elsewhere). ProfilesPROFILE Bob1: VOICE 5557458 (MESSAGE ~bob/sounds/ oncall1.au, $current), IF FAILED PAGE 5553453 (PIN 3894634) PROFILE Jack3: ONLYIFALONE IXOPAGE 5550532 (MESSAGE "I've fallen and I can't get up") PROFILE Jim2: PAGE 5550974 (PIN 63459023) THEN EMAIL jim@tarkus (MESSAGE "It's dead, Jim") The first example tries to call Bob at his home, playing first an introductory sound file recorded by Bob which can both explain things to his wife if she answers, and is timed to go past the outgoing message on his answering machine if they're not in, and then plays the message let by the caller; if there is no answer at that number, it calls Bob's numeric pager. The second example calls Jack on his alphanumeric pager, but only if he's the only one on call at the time. The third example calls Jim on his numeric pager, and then, regardless of the success or failure of the page, sends him an email message. SchedulesSCHEDULE Bob1, Jim2 FOR PWRDOWN 20:00 - 05:00 DURING WorkWeek SCHEDULE Jack3 FOR NETDOWN 09:00 - 18:00, 20:00 - 22:00 DURING Weekends SCHEDULE John FOR ANYTHING 07:00 - 01:00 DURING AllWeekThe first example attaches the profiles Bob1 and Jim2 to a late-night/early morning slot during the work week (previously defined as being AllWeek EXCEPT Holidays, where Holidays was defined as the union of the defined holidays), in the case of power failures. The second example attaches the profile Jack3 to two time slots on weekends for network failures. The third example attaches John's profile to a punishing schedule in case of any situation. We can now do two things with this schedule file: we can produce a report showing the on-call : Toggling DTR to reset modem : Configuring port parameters.. (/dev/cua/a / 38400 bps) : Initializing Modem Modem Cmd: Command Mode? ATE1V1 -> OK configure: Modem was in commmand mode. Now in command mode. Modem Cmd: Initialization String AT&FS2=255S0=0S13.2=1N0&H4 -> OK Modem Cmd: Switch to VOICE mode AT+FCLASS=8 -> OK Modem Cmd: Default settings for voice mod AT+VIP -> OK Modem Cmd: Turn up DTMF sensitivity AT+VDD=3,8 -> OK Modem Cmd: Enable XON/XOFF Flow Control AT+FLO=1 -> OK : Modem Initialized Modem Cmd: Ping AT -> OK main_loop: Waiting for a call.. main_loop: MODEM: [\r][\n] main_loop: Waiting for a call.. main_loop: RING detected! Modem Cmd: Answering VOICE call AT+VLS=2 -> VCON handle_call: Call received. handle_call: Voice::init complete. menu: Menu /afs/jpl/group/ets/src/helpline/menu say_choices: Using .../cache//afs/jpl/group/ets/src/helpline/menu/begin.init.au.zyxel Modem Cmd: Start Playing AT+VTX -> CONNECT play: Playing voice file .../cache/.../menu/begin.init.au.zyxel play: Transmission Buffer Underrun Modem Cmd: Stop Playing <DLE><ETX> -> OK say_choices: Running script begin.init.cmd say_choices: Using .../cache//afs/jpl/group/ets/src/helpline/menu/1.au.zyxel say_choices: Using .../cache//afs/jpl/group/ets/src/helpline/menu/2.au.zyxel say_choices: Using .../cache//afs/jpl/group/ets/src/helpline/menu/star.au.zyxel Modem Cmd: Start Playing AT+VTX -> CONNECT play: .../cache//afs/jpl/group/ets/src/helpline/menu/1.au.zyxel .../cache//afs/jpl/group/ets/src/helpline/menu/2.au .../cache//afs/jpl/group/ets/src/helpline/ menu/star.au.zyxel Modem Cmd: Stop Recording <DLE>! -> OK say_choices: Using .../cache//afs/jpl/group/ets/src/helpline/default/record/1.au.zyxel say_choices: Using .../cache//afs/jpl/group/ets/src/helpline/default/record/2.au.zyxel say_choices: Using .../cache//afs/jpl/group/ets/src/helpline/default/record/3.au.zyxel say_choices: Using .../cache//afs/jpl/group/ets/src/helpline/default/record/pound.au.zyxel Modem Cmd: Command Mode? ATE1V1 -> OK play: Modem was in commmand mode. Now in command mode. Modem Cmd: Start Playing AT+VTX -> CONNECT play: Playing voice file .../cache/.../default/record/1.au.zyxel play: Playing voice file .../cache/.../default/record/2.au.zyxel play: Playing voice file .../cache/.../default/record/3.au.zyxel play: Playing voice file .../cache/.../default/record/pound.au.zyxel Modem Cmd: Stop Playing <DLE><ETX> -> OK process_keys Waiting 5 seconds.. MODEM:wait: DTMF digit: "#" menu: Ready to process keypress: # menu: Valid choice. process_keys Running script pound.cmd run_script: send message 33F4E5A6-11173-1 ( PETER ) Listing 2: Logfile excerpts. coverage for any period, and we can determine who is on call for a given situation at a given time. The second capability is used when an emergency call comes in or the scheduler is otherwise triggered; the first can be used by the facility manager for planning purposes and to demonstrate to senior management proven 24x7 coverage. The same module which determines which profiles are active at the current moment has routines for producing a report between two dates and rendering it in HTML; see Figure 4 for example. The horizontal bars on the GEORGE1 tracks are blue; the other horizontal bars are green. The vertical bars are red, indicating gaps in coverage when no profile is active. This report shows that there are three situations defined in the schedule file, labelled AFSDOWN, WWWDOWN, and URGENT. Figure 4: Web Page. Implementation DetailsThe system is implemented using custom Perl 5 code for answering the phone and calling out. We experimented heavily with vgetty from Gert Doering's mgetty+sendfax package [1] and liked the scripting capabilities, but ultimately were confined by problems with data synchronization and capabilities for extension in the version we had at the time. Our system nevertheless suffers from synchronization problems where sometimes keypresses are not detected or the modem goes into an unintended state, but we believe these are tractable problems. The schedule parsing is done with custom Perl 5 code using David Megginsons's SGMLS.pm SGML-parsing module [2] and James Clark's nsgmls SGML parser [3]. Turning the schedule file into a report or a list of profiles on-call at the time was made much easier with Perl 5's object-oriented features. Where sound files do not exist for messages to be played to the caller, voice synthesis is done with rsynth [4], by Nick Ing- Simmons (usually used for error messages). A logging capability allows the operator to view the progress of the system. Parts of a log of a session recording a call for help are reproduced in Listing 2. Experience in UseThe most difficult part was communicating with the modem; we have not yet figured out how to handle data overruns or why touch tones are sometimes missed. We heard at the Perl Conference that a module for modem control is in the works and that may solve this problem. The tests for failure of an outgoing call are incomplete; we would prefer more sophisticated checking for whether a call was successful. ConclusionsThe helpline system demonstrates that the tools exist to construct an apparently complex system in a domain hitherto considered the province of COTS-ware using freeware components. The use of SGML to specify the grammar of an input file provides a significant gain in user-friendly editing and syntax checking. AcknowledgmentsJosh Wilmes, of RPI, developed the modem-answering code while working as a co-op student at JPL. The work described was performed at the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration. Author InformationPeter Scott left Cambridge University with an MA in CS in 1983 and has been working at the Jet Propulsion Laboratory in Pasadena, California, since 1986. For the last two years he has been lead development engineer on the File Service Element of the Enterprise Information System. He can be reached at M/S 301-342, Jet Propulsion Laboratory, 4800 Oak Grove Drive, Pasadena, CA 91109 and at <pjscott@euclid.JPL.NASA.GOV>. References[1] mgetty+sendfax: https://www.leo.org/~doering/mgetty/index.html. [Broken link on 970925][2] SGMLS.pm: https://www.uottawa.ca/~dmeggins/SGMLSpm/sgmlspm.html. [Broken link on 970925] [3] NSGMLS: An SGML System Conforming to International Standard ISO 8879 - Standard Generalized Markup Language. https://www.jclark.com/sp/nsgmls.htm. [4] Rsynth: ftp://svr-ftp.eng.cam.ac.uk/comp.speech/sources/rsynth-2.0.tar.gz. [Broken link on 970925] [5] tpage: ftp://gumby.dsd.trw.com/pub/tpage.tar.Z. [Probably broken link on 970925] [6] Tel Alert: https://www.telamon.com/prodinfo/telalert.html. [7] VoiceGuide: https://www.ozemail.com.au/~katalina/vgidx.htm. |
This paper was originally published in the
Proceedings of the 11th Systems Administration Conference (LISA '97),
October 26-31, 1997,
San Diego, California, USA
Last changed: 12 April 2002 aw |
|