Aaron B. Brown and Joseph L. Hellerstein
IBM Thomas J. Watson Research Center
Hawthorne, New York, 10532
{abbrown,hellers}@us.ibm.com
After working with corporate customers, service delivery personnel, and product development groups, we have come to question the widely held belief that automation of IT systems always reduces costs. In fact, our claim is that automation can increase cost if it is applied without a holistic view of the processes used to deliver IT services. This conclusion derives from the hidden costs of automation, costs that become apparent when automation is viewed holistically. While automation may reduce the cost of certain operational processes, it increases other costs, such as those for maintaining the automation infrastructure, adapting inputs to structured formats required by automation, and handling automation failures. When these extra costs outweigh the benefits of automation, we have a situation described by human factors experts as an irony of automation--a case where automation intended to reduce cost has ironically ended up increasing it [1].
To prevent these ironies of automation, we must take a holistic view when adding automation to an IT system. This requires a technique for methodically exposing the hidden costs of automation, and an analysis that weighs these costs against the benefits of automation. The approach proposed in this paper is based on explicit representations of IT operational processes and the changes to those processes induced by automation. We illustrate our process-based approach using a running example of automated software distribution. We draw on data collected from several real data centers to help illuminate the impact of automation and the corresponding costs, and to give an example of how a cost-benefit analysis can be used to determine when automation should and should not be applied. Finally, we broaden our analysis into a general discussion of the trade-offs between manual and automated processes and offer guidance on the best ways to apply automation.
Our approach is based on the explicit representation of the processes followed by system administrators (SAs). These processes may be formal, e.g. derived from ITIL best practices [13], or informal, representing the ad-hoc methods used in practice. Regardless of their source, the first step is to document the processes as they exist before automation. Our approach accomplishes this with ``swim-lane'' diagrams--annotated flowcharts that allocate process activities across roles (represented as rows) and phases (represented as columns). Roles are typically performed by people (and can be shared or consolidated); we include automation as its own role to reflect activities that have been handed over to an automated system.
Figure 1: Boxes with heavy lines indicate process steps that contribute to variable (per-target) costs, as described in Section 3. |
Figure 1(a) shows the ``swim-lane'' representation for the manual version of our example software distribution process. In the data centers we studied, the SA responds to a request to distribute software as follows: (1) the SA obtains the necessary software resources; (2) for each server, the SA repeatedly does the following--(2a) checks prerequisites such as the operating system release level, memory requirements, and dependencies on other packages; (2b) configures the installer, which requires that the SA determine the values of various parameters such as the server's IP address and features to be installed; and (2c) performs the install, verifies the result, and handles error conditions that arise. While Figure 1(a) abstracts heavily to illustrate similarities between software installs, we underscore that a particular software install process has many steps and checks that typically make it quite different from other seemingly similar software installs (e.g., which files are copied to what directories, pre-requisites, and the setting of configuration parameters).
Now suppose that we automate the process in Figure 1(a) so as to reduce the work done by the SA. That is, in the normal case, the SA selects a software package, and the software distribution infrastructure handles the other parts of the process flow in Figure 1(a). Have we simplified IT operations?
No. In fact, we may have made IT operations more complicated. To understand why, we turn to our process-driven analysis, and update our process diagram with the changes introduced by the automation. In the software distribution case, the first update is simple: we move the automated parts of Figure 1(a) from the System Administrator role to a new Automation role. But that change is not the only impact of the automation. For one thing, the automation infrastructure is another software system that must itself be installed and maintained. (For simplicity, we assume throughout that the automation infrastructure has already been installed, but we do consider the need for periodic updates and maintenance.) Next, using the automated infrastructure requires that information be provided in a structured form. We use the term software package to refer to these structured inputs. These inputs are typically expressed in a formal structure, which means that their creation requires extra effort for package design, implementation, and testing. Last, when errors occur in the automated case, they happen on a much larger scale than for a manual approach, and hence additional processes and tools are required to recover from them.
These other impacts manifest as additional process changes, namely extra roles and extra operational processes to handle the additional tasks and activities induced by the automation. Figure 1(b) illustrates the end result for our software distribution example. We see that the automation (the bottom row) has a flow almost identical to that in Figure 1(a). However, additional roles are added for care and feeding of the automation. The responsibility of the System Administrator becomes the selection of the software package, the invocation of the automation, and responding to errors that arise. Since packages must be constructed according to the requirements of the automation, there is a new role of Software Packager. The responsibility of the packager is to generalize what the System Administrator does in the manual process so that it can be automated. There is also a role for an Infrastructure Maintainer who handles operational issues related the software distribution system (e.g., ensuring that distribution agents are running on endpoints) and the maintenance of the automation infrastructure. From inspection, it is apparent that the collection of processes in Figure 1(b) is much more complicated than the single process in Figure 1(a). Clearly, such additional complexity is unjustified if we are installing a single package on a single server. This raises the following question--at what point does automation stop adding cost and instead start reducing cost?
A rule-of-thumb for answering the question above is that automation is desirable if the variable cost of the automated process is smaller than the variable cost of the manual process. But this is wrong.
One reason why this is wrong is that we cannot ignore fixed costs for automating processes with a limited lifetime. IT operations has many examples of such limited lifetime processes. Indeed, experience with trying to capture processes in ``correlation rules'' used to respond to events (e.g., [10,5]) has shown that rules (and hence processes) change frequently because of changes in data center policies and endpoint characteristics.
Figure 2: Cumulative distribution of the number of targets (servers) on which a software package is installed over its lifetime in several data centers. A larger number of packages are installed on only a small number of targets. |
Our running example of software distribution is another illustration of limited lifetime processes. As indicated before, a software package describes a process for a specific install; it is only useful as long as that install and its target configuration remain current. The fixed cost of building a package must be amortized across the number of targets to which it is distributed over its lifetime. Figure 2 plots the cumulative fraction of the number of targets of a software package based on data collected from a several data centers. We see that a large fraction of the packages are distributed to a small number of targets, with 25% of the packages going to fewer than 15 targets over their lifetimes.
There is a second reason why the focus on variable costs is not sufficient. It is because the focus is on the variable costs of successful results. By considering the complete view of the automated processes in Figure 1(b), we see that more sophistication and people are required to address error recovery for automated software distribution than for the manual process. Using the same data from which Figure 2 is extracted, we determined that 19% of the requested installs result in failure. Furthermore, at least 7% of the installs fail due to issues related to configuration of the automation infrastructure, a consideration that does not exist if a manual process is used. This back-of-the envelope analysis underscores the importance of considering the entire set of process changes that occur when automation is deployed, particularly the extra operational processes created to handle automation failures. It also suggests the need for a quantitative model to determine when to automate a process.
Motivated by our software distribution example, we have developed a
simple version of such a model.
Let
be the fixed cost for the manual process and
be its variable cost. We use to denote lifetime of the process
(e.g., a package is distributed to targets). Then, the total cost
of the manual process is
We can make some qualitative statements about these costs. In general, we
expect that
; otherwise there is little point in
considering automation. Also, we expect that
since
careful design and testing are required to build automation, which requires
performing the manual process one or more times. Substituting into the above
equations and solving for we can find the crossover point where
automation becomes economical. That is, where .
This inequality provides insights into the importance of considering when to automate a process. IBM internal studies of software distribution have found that can exceed 100 hours for complex packages. Our intuition based on a review of these data is that for complex installs, is in the range of 10 to 20 hours, and is in the range of 1 to 5 hours (mostly because of error recovery). Assuming that salaries are the same for all the staff involved, these numbers indicate that there should be approximately 5 to 20 targets for automated software distribution to be cost effective. In terms of the data in Figure 2, these numbers mean that from 15% to 30% of the installs should not have been automated.
The foregoing cost models can be generalized further to obtain a broader understanding of the trade-off between manual and automated processes. In essence, this is a trade-off between the leverage provided by automation versus the difficulty of generalizing a manual process to an automated process.
Leverage describes the factor by which the variable costs are reduced by using automation. That is,
The generalization difficulty relates to the challenges involved
with designing, implementing, and testing
automated
versions of
manual processes. Quantitatively, is computed as the ratio
between the fixed cost of automation and the variable cost of the
manual process:
The intuition behind is that, to construct an automated process,
it is necessary to
perform
the manual process at least
once. Any work beyond that test invocation of the manual process will
result in a larger . Substituting and solving, we find that
Figure 3: Preference regions for automated and manual processes. Automated processes are preferred if there is a larger leverage for automation and/or if there is a smaller (amortized) difficulty of generalizing the manual procedure to an automated procedure (). |
Figure 3 plots versus . We see that the vertical axis () ranges from to since and . The latter constraint arises because there is little point in constructing automation that is times more costly than a manual process if the process will only be invoked times. The figure identifies regions in the ) space in which manual and automated processes are preferred. We see that if automation leverage is large, then an automated process is cost effective even if amortized generalization difficulty is close to 1. Conversely, if amortized generalization difficulty is small (close to ), then an automated process is cost effective even if automation leverage is only slightly more than 1. Last, having a longer process lifetime means that is smaller and hence makes an automated process more desirable.
This analysis suggests three approaches to reducing the cost of IT operations through automation: reduce the generalization difficulty , increase the automation leverage , and increase the process lifetime . In the case of software distribution, the most effective approaches are to increase and to reduce . We can increase by making the IT environment more uniform in terms of the types of hardware and software so that the same package can be distributed to more targets. However, two issues arise. First, increasing has the risk of increasing the impact of automation failures, causing a commensurate decrease in . Second, attempts to increase homogeneity may encounter resistance--ignoring a lesson learned from the transition from mainframes to client-server systems in the late 1980s, which was in large part driven by the desire of departments to have more control over their computing environments and hence a need for greater diversity.
To reduce cost by reducing , one approach is to adopt the concept of mass customization developed in the manufacturing industry (e.g., [9]). This means designing components and processes so as to facilitate customization. In terms of software distribution, this might mean developing re-usable components for software packages. It also implies improving the reusability of process components--for example by standardizing the manual steps used in software package installations--so that a given automation technology can be directly applied to a broader set of situations. This concept of mass-customizable automated process components represents an important area of future research.
Mass customization can also be improved at the system level by having target systems that automatically discover their configuration parameters (e.g., from a registry at a well known address). This would mean that many differences between packages would be eliminated, reducing and potentially leading to consolidation of package versions, also increasing .
Thus far, we have discussed what automation should be done. Another consideration is the adoption of automation. Our belief is that SAs require a level of trust in the automation before the automation will be adopted. Just as with human relationships, trust is gained through a history of successful interactions. However, creating such a history is challenging because many of the technologies for IT automation are immature. As a result, care must be taken to provide incremental levels of automation that are relatively mature so that SA trust is obtained. One further consideration in gaining trust in automation is that automation cannot be a "black box" since gaining trust depends in part on SAs having a clear understanding of how the automation works.
The history of the automobile provides insight into the progression we expect for IT automation. In the early twentieth century, driving an automobile required considerable mechanical knowledge because of the need to make frequent repairs. However, today automobiles are sufficiently reliable so that most people only know that automobiles often need gasoline and occasionally need oil. For the automation of IT operation, we are at a stage similar to that of the early days of the automobile in that most computer users must also be system administrators (or have one close at hand). IT operations will have matured when operational details need not be surfaced to end users.