Proceedings of the 7th Tcl/Tk Conference, February 14-18, 2000, Austin, Texas, USA

Pp. 53–60 of the Proceedings

Tcl/Tk: A Strong Basis for Complex Load Testing Systems

This paper describes a Tcl/Tk-based load testing environment which was developed for Deutsche Telekom, Europe's largest telco carrier and online service provider. Deutsche Telekom uses the system to ensure a high quality of service and availability.

The Center for Internet and Data Network Platforms (ZID), a division of Deutsche Telekom, provides the network infrastructure over which the largest Online Service in Germany, T-Online, is operated. Deutsche Telekom is Europe's largest and the world's third largest telecommunications carrier, and its online service has more that 3.3 million customers - which grew at a rate of 42 per cent during the last fiscal year.

ZID is responsible for ensuring that all hardware and software components of an access network function coherently and can withstand the heavy demands of hundreds of thousands of simultaneous users, both prior to deployment and during operation. ZID must also be able to deliver a guaranteed performance, bandwidth, and availability.

ZID has used the load testing technology for many years to ensure that the online services it provides maintain the Quality of Service (QoS) level that Deutsche Telekom's customers have come to expect. The objectives for those early test systems were functional testing, throughput and general performance testing for a BTX-based netwo rk. Recently, enhancements for the testing of the latest internet technologies, such as ISDN or ADSL, have been developed. Critical test factors were defect removal before deployment, and a guaranteed highly responsive system.

Since each new development stage has been accompanied by load testing, Deutsche Telekom was able to meet these goals and deliver a top quality service to its customers. The load testing technology used has evolved along with the network technology it is required to test.

Load testing is the general term used to describe placing a network infrastructure under stress to test its behavior in a production environment. We developed a fully scriptable load testing system based on Tcl/Tk. The basic principle in the design of this system is the Automatic User, abbreviated to AT - the short form is derived from the German term "Automatischer Teilnehmer" [1]

An AT simulates the actions of a human user (e.g. hitting web pages). Every AT can generate a certain load on the network, dependent on the connection (Modem, ISDN, ADSL). The load varies from 28.8 Kbit/s (Modem) over 64 Kbit/s (ISDN) to 768 Kbit/s (ADSL). The latest release of the AT system allows a maximum of 225 ATs to be run simultaneously, which are then managed and controlled with a graphical user interface, providing a single point of control and monitoring. The number of ATs used in a testing environment is virtually unlimited and only depends on the hardware configuration.

Both the control interface and the AT are implemented in Tcl/Tk. The scripting language Tcl proved to be an ideal platform for implementing the AT, which freed us from designing a new high level language for the specification of test scenarios and implementing a runtime environment to drive such scenarios. Tcl's mechanism of slave interpreters was sufficient to achieve these goals. Tcl/Tk also supported rapid development of the graphical control interface.

To support its online services nationwide, Deutsche Telekom operates more than 180 POPs all over Germany. In each of the POPs, networks with specific hardware and software systems have been installed, which ensure the user's access to the internet. The introduction of new online services like ADSL requires the development of a new infrastructure for the access network, which has to be installed nationwide in all of the POPs. The requirements for such an access network are:

Designing such a network can be a challenging task. For providers of online services like Deutsche Telekom with 120 million internet connections per month, it is important to detect and eliminate deficiencies and performance bottlenecks before they deploy an access network nationwide. This is especially important because additional improvements can be very expensive, since all the POPs have to be upgraded and the users might not be able to use the services.

Deutsche Telekom has established two operation centers in Darmstadt and Ulm in order to test and optimize the entire system in a production-like environment prior to deployment. These centers are fully equipped with web servers, connection hardware, and all equipment for the planned network installation in a regional node (POP). Load testing is then performed using the AT on the complete network. Network components, including software, which allow access to the internet platform, are tested for their load-bearing capacity and stability.

Test results are used to make decisions on how best to optimize performance and deliver new functionality. Decisions such as replacing a router with a larger one are made. The test results are also used to verify whether or not the vendors components meet up to their claimed capabilities. The vendors providing equipment for the network often use the results produced by the AT (in conjunction with their own tools) to isolate defects, verify changes and fixes.

Once the developers at ZID are satisfied with the resulting network, it is deployed in more than 180 regional nodes throughout Germany. In addition the AT is also used in these locations to ensure their individual QoS once operational on a specific hardware basis. ZID relies heavily on the AT analysis tools to compare different load scenarios at different times of day. One of the major benefits of the AT is that all testing and analysis work is concentrated in one location and controlled from a single computer.

In the past years, the ZID has been performing load tests with various tools for developing and optimizing their access networks. Experience has shown that load tests are most effective when the real conditions are simulated as closely as possible.

Let us consider a typical user, who wants to connect from his/her PC to the internet: He or she might be a subscriber to an Internet Service Provider performing transactions such as surfing the web, doing FTP file transfers, or sending e-mail via an SMTP client. Another typical user might be using an e-commerce site to make purchases or making queries to a web-enabled database application. Therefore, her or his essential activities are:

To meet the above requirements, we opted for an open platform providing support for almost all available hardware: Linux. In terms of a flexible scripting language, we looked at Perl, Python, and Tcl. Although Perl is pretty popular in the "system administration community" it isn't as easy-to-learn and handy as the other languages (especially for end-users.) A second important point was the GUI component - an area where Tcl/Tk is a natural choice, since both other languages lack native (i.e. out-of-the-box) GUI support and use for example Tk instead of a "native" solution.

Since the idea of the AT implies managing hundreds of different processes - including inter-process communication between them - easy-to-use and powerful network support was another important criterion. Last but not least the "glue" argument was very important. Tcl is the perfect tool for tieing different components together without creating monstrous extension libraries.

A load scenario describes a complete set of transactions carried out by a specific number of users connected in a particular way to a network over a given period of time. Load testing involves managing a potentially large number of ATs executing load scenarios. It is especially important to have a means to control all aspects of testing from a central location.

The AT controller component allows the creation of an interactive environment for defining, driving, and coordinating a load test scenario. The AT controller is designed using a set of building blocks which allow a complete definition and management of a scenario. These blocks include:

The script repository contains all the scripts that have been defined by the test designer. A given script can be assigned to one or more automated users for a given scenario. The AT provides example scripts which may be extended and/or parameterized to create unique automated users. Also the AT provides libraries implementing basic behavior patterns of real online users to reach a higher level of abstraction. A high level of parameterization is possible by allowing the script, to reflect the actual behavior of a typical user without having to create a new script even when the basic actions taken are not the same. For instance, a time-out can be specified to simulate a user cancelling a web page fetch after 5 or 10 seconds. Pauses (or sleeps) can also be defined. The specific URLs fetched or the username/password required as input for certain operations can all be parameterized. The parameterization is done by the test designer either programmatically to the AT controller or via a GUI.

An automated user simulates a real user performing typical operations. A typical user might be a subscriber to an Internet Service Provider carrying out transactions such as surfing the web, doing HTTP GET or POST operations, downloading files using FTP, or sending e-mail via an SMTP client. Another typical user might be using an e-commerce site to make purchases or make queries to a database application through a web front-end. The AT creates a run-time environment in which a session script is run performing a work session. The AT builds the connection to the network, carries out the operations specified in the session script and disconnects from the network at the end of the session. Since the AT can be configured to connect via an analog or ISDN modem, a LAN or WAN, amongst others, the throughput measured is what a real user could expect to see. The AT is also responsible for forwarding test results to the message server and the log manager for monitoring and analysis. It also provides status information to the Session Control upon request, including such things as whether a session is currently executing, whether a script is loaded, etc.

The log manager receives messages from the various ATs running on a session host [2]. It adds a time-stamp to each entry and, depending on any filtering of messages that has been set, writes the message to the designated logging facility. Messages may be filtered based on the message type. Current types of messages are: trace, log, debug, performance, error.

Like the log manager, the message server receives messages from the various ATs running on a session host. It filters test results and passes them to the connected online monitors where the tester can review the progress of the load scenario and take actions based on the results. Filtering is specified by the online monitors. For instance, an online monitor may specify that it is only to receive trace messages only.

The online monitor makes it possible to take action on tests as they take place. For instance, it may be necessary to create additional load over a specific connection to stress the system in a new way based on the results being produced by the currently running sessions. Multiple online monitors may be connected to a session host which allow customized viewing of results. The number of monitors and the purpose of each is specified by the test designer. The monitors connect to the message server running on the session host. Each online monitor can filter the different types of messages processing only those of interest to it. A monitor specifies to the message server what types of messages it wishes to receive.

Several online monitors can be created, each tailored to the specific needs of a test installation. One example is a specialized online monitor listening to trace messages only, sending an e-mail and/or SMS message when a URL specified is not reachable.

Another likely configuration would use an online monitor that views only debug messages which need to be handled by the creator of the load scenario and another online monitor that views only performance messages used by the network architect to identify bottlenecks and refine the network configuration.

All online monitors can be connected to a graphic interface providing dynamic graphic views on result data. The entire dynamic interface is built using the Tcl/Tk based product GIPSY [4], allowing easy customizability and creation of dynamic data visualization.

Since Tcl has easy-to-use communication commands, building a distributed client/server system is simple. On top of standard Tcl we implemented a communications layer as basis for RPC and simple data streaming (for fast transmission of status data.) Additionally an "Application Name Server" (ANS) provides host/port lookup services.

A typical installation of the load testing system consists of one control host and a number of session hosts. The control and session hosts are connected through an ethernet LAN, which we call AT backbone. Several ATs run on each session host. The number of ATs per host depends on the connection to the ISP or the Internetwork (ISDN, Ethernet, ADSL, Modem etc.).

The current configuration at ZID uses four 4-port ethernet network adapters per PC emulating ADSL lines to an Access Concentrator (AC), providing 15 session and one backbone port. The backbone connects 15 PC running ATs to one AT controller, acting as single -point-of-control for the entire system.

As stated above, the AT simulates a real user performing online actions like HTTP get or FTP file transfers. Implemented as a software agent, the AT is built upon four major modules: communication, connection, logging and script engine.

The communication module encapsulates all communication aspects like a RPC server socket, connecting to the controller or the message server and sending them status and log messages. The AT receives and executes commands from the AT control via the RPC server socket.

The connection module provides all functions to connect to a network and authenticates a user with his name and password. The logging module implements functions to log all actions of an AT to a plain file, a database, send log messages to syslog daemon or the log server. Finally the script engine provides a runtime environment to drive session scripts.

In order to simplify the development of the session scripts and to reach a higher level of abstraction, Tcl libraries are provided. They already implement a basic behavior. This is illustrated in the example below. Listing 1 shows a sample session script. It is a simple example simulating a user who establishes a connection to the ISP via PPPoE, gets some URLs and disconnects.

If an action (e.g. access::connect ) fails, it is repeated several times (::try(max)). If it still fails, an alert is sent to the online monitors and the log manager. Then the action is repeated after a defined number of seconds (here ::try(delay)) seconds) until it is successful.

The script engine is a Tcl module providing a runtime environment for session scripts. To drive a session script the script engine first reads the Tcl source of the session script from the script repository. Then a slave interpreter is created and initialized. In the initialization phase local variables are declared in the slave interpreter, aliases to commands of the master interpreter are created, protocol extensions loaded. Finally the session script is evaluated within the slave interpreter.

Session scripts are written in Tcl using behaviour libraries. Each script has a run procedure. The run procedure may be executed multiple times to simulate multiple consecutive sessions. All statements outside of the run procedure are executed once per session.

The script writer may define trace messages that will be sent to the message server and log server. Within the script any combination of transactions using the supported protocols can be performed as well as all standard Tcl operations.

It is also possible to set variables in the script prior to running it. This allows customization of a single script logic for different users who perform the same operations but on different targets including: number of run repetitions, transaction time-out intervals, and level of tracing.

For FTP and HTTP we use extension implemented in Tcl. The http extension is built on top of the standard HTTP package coming with Tcl. Ftp is built on the FTP package of Steffan Traeger [2]. The FTP and HTTP packages had to be modified in order to measure the response time and throughput and to provide distinct time-out mechanisms for the connect and for the data transfer phase.

The management and control of both the ATs and sessions are accomplished via a graphical user interface (Figure 2) which gives the user an overview over the ATs states. The user may start any test scenario and monitor the load tests and their results.

The LED area displays three states for every AT: script status, connect state and a "runs left" counter. The script LED denotes whether a script has been loaded, started or stopped. The connect state indicates one of the following: if the AT is connecting to a network, if it is already connected, if it is about to disconnect or if it is already disconnected.

The LED area is sensitive and the user may choose the AT to which a command is sent using the mouse. Additionally, he may re-activate commands from the history list. The execution of a command is shown in the status window. The controller is connected to the ATs and may send any command to them via RPC.

This paper described a fully scriptable load testing application for networks. All crucial components are implemented with Tcl/Tk. The use of Tcl/Tk allowed a development process that is best characterized with "development by prototyping", and thus enabled us to present a working version to the customer very early in the project. As the project progressed, we could build fully working prototypes and test them under "real world conditions". This helped us to better understand the requirements and - even more important - to find and eliminate errors in the design very early. Last, but not least, we could implement changes which our customer requested after the design phase with an effort bearable for both sides.

Another important aspect of the "development by prototyping" was that the testing engineers were able to improve the design of the access network based on the test results of prototypes, even in an early development phase. A good example is the ADSL modem-access implementation history. Any access method involving substantial user configuration of their modem or personal computer can hinder rapid, widespread consumer adoption of high-speed services such as ADSL. The first approach, which was provided by a well-known vendor of network components, was a combination of a DHCP client and a HTTP-Browser. The user was required to configure DHCP on his Windows client and open a URL with his Browser to get a login page. After typing user name and password the connection to the internet was established.

As the test engineers began load tests, they detected instabilities of the access network components. The implementation was neither stable nor would it allow more than 28 concurrent logons. These results were reproducible so ZID could reject this approach and force its network solution provider to develop an alternative solution.

The next and up to now final solution for client access method was point to point over ethernet (PPPoE). Because PPPoE had been a rather new protocol there were not clients available at that time, especially not for the Linux operating system. So, we had to implement a client on our own and integrate it into the load testing system. This kind of change was only possible due to Tcl's excellent "gluing" capabilities (and the flexible architecture, of course), allowing ourselves to keep most of the existing system without any changes for the new protocol.

During later prototyping phases the system was able to detect problems concerning the stability and the bandwidth of the whole access network solution. ZID's hardware vendor and solution provider had to admit that neither the ATM switching component nor the PPPoE server performed within specification (50Mbit/s vs. 150Mbit/s, 56 vs. 90 sessions.)

Another important aspect for our customer was that we did not modify the Tcl kernel. All of the requirements have been solved with standard Tcl/Tk. C was only used in performance critical areas like the PPPoE protocol handling (i.e. handling Ethernet packets.) Using standards - like Tcl/Tk - without modifications proved to be an important issue, especially regarding the system's acceptance.

It proved that Tcl/Tk was and is the ideal solution for the implementation of the AT, since we were able to meet the customer's requirements effectively and react in a timely manner to design changes during the project. Tcl offered an easy-to-learn and flexible language that allowed us to provide a full featured language gaining more and more enthusiasm at ZID (all of the test engineers are non-programmers.) All that was possible without developing a language ourselves by simply using what was already there.

Tcl/Tk helped us to develop a commercial product in a short time and - even more important - to release it to our customer right in time. From our customer's point of view, Tcl/Tk helps in achieving his mission critical goals with minimal effort.

The FTP Library Package ftp_lib provides the client side of the File Transfer Protocol (FTP). The package extends Tcl/Tk with commands to support the file transfer protocol like OPEN, CLOSE, LIST, PUT, GET, REGET etc. It's used either to add FTP ability to existing Tcl/Tk applications or to create small FTP scripts that perform tasks without user interaction. It allows automatic up/download processes even up to the mirroring of complete FTP sites.