Live Monitoring: Using Adaptive Instrumentation and Analysis to Debug and Maintain Web Applications

Emre Kıcıman and Helen J. Wang

Microsoft Research

{emrek, helenw} @ microsoft.com

Abstract

AJAX-based web applications are enabling the next generation of rich, client-side web applications, but today’s web application developers do not have the end-to-end visibility required to effectively build and maintain a reliable system. We argue that a new capability of the web application environment—the ability for a system to automatically create and serve different versions of an application to each user—can be exploited for adaptive, cross-user monitoring of the behavior of web applications on end-user desktops. In this paper, we propose a live monitoring framework for building a new class of development and maintenance techniques that use a continuous loop of automatic, adaptive application rewriting, observation and analysis. We outline two such adaptive techniques for localizing data corruption bugs and automatically placing function result caching. The live monitoring framework requires only minor changes to web application servers, no changes to application code and no modifications to existing browsers.

1 Introduction

Over the last several years, AJAX (Asynchronous JavaScript and XML) programming techniques have enabled a new generation of popular web-based applications, marking a paradigm shift in web service development and provisioning [11]. Unlike traditional web services, these new web applications combine the data preservation and integrity, storage capacity and computational power of data center(s) with a rich client-side experience, implemented as a JavaScript program shipped on demand to users’ web browsers1. This combination provides a compelling way to build new applications while moving the burden of managing an application’s reliability from end-users to the application’s own developers and operators.

Unfortunately, today’s web application developers and operators do not have the end-to-end visibility they need to effectively build and maintain a dependable system. Unlike traditional web services, running exclusively in controlled, server-side environments, a web application depends on many components outside the developer’s control, including the client-side JavaScript engine and libraries and the third-party back-end web services used by mash-up applications—web applications that combine functionality from multiple back-end web services. Of course, web application developers must also contend with the traditional bugs that occur when writing any large, complex piece of software, including logic errors, memory leaks and performance problems. When the inevitable problem does occur, the web application developer’s lack of visibility into the heterogeneous client environments and the dynamic behavior of third-party services can make reproducing and debugging the problem practically impossible.

To address these challenges, we propose a live monitoring framework that exploits a new capability of the web application environment, instant redeployability: Each time any client runs a web application, the developers and operators of the application can automatically provide the client a new, different version of the application. Our live monitoring framework (1) exploits this capability to enable dynamic and adaptive instrumentation strategies; and (2) integrates the resultant on-line observations of an application’s end-to-end behavior into the development and operations process.

Live monitoring enables a new class of techniques that use a continuous loop of automatic application rewriting, observation and analysis to improve the development and maintenance of web applications. Policy-based, automatic rewriting of application code provides the necessary visibility into end-to-end application behavior, and collecting observations on-line from live end-user desktops provides visibility into the real problems affecting clients. Distributing and sampling instrumentation across the many users of a web application provides a low-overhead instrumentation platform. Finally, using already-collected information to adapt instrumentation on-line enables efficient drill-down with specialized diagnosis techniques as problems occur.

Op	Performance (ms)
	IE 7	Firefox 1.5
Array.join	35	120
+	5100	120

Table 1: The performance of two simple methods for concatenating 10k strings varies across browsers.

2 Reliable Web Applications

The web application environment presents many of the same development and operations challenges that confront any cross-platform, distributed system. In this environment, however, there are also opportunities for a new approach to addressing these challenges.

Challenges

The root challenge to building and maintaining a reliable client-side web application is a lack of visibility into the end-to-end behavior of the program, brought about by the fact that execution of the web application is now split across multiple environments, including uncontrolled client-side and third-party environments and exacerbated by their heterogeneity and dynamics.

Non-standard Execution Environments: While the core JavaScript language is standardized as ECMAScript [7], most pieces of a JavaScript environment are not. The result is that applications have to frequently work-around subtle and not-so-subtle cross-browser incompatibilities. As a clear example, sending an XMLRPC request requires calling an ActiveX object in IE6, but a native JavaScript object in Firefox. More subtle are issues such as event propagation: e.g., given multiple registered event handlers for a mouse click, in what order are they called? Moreover, even the standardized pieces of JavaScript can have implementation differences that cause serious performance problems (see Table 1 for examples of performance variance across browsers.)

Third-Party Dependencies: All web applications have dependencies on the reliability of back-end web services. While these back-end services strive to maintain high availability, they can and do fail. Moreover, even regular updates, such as bug fixes and feature enhancements can break dependent applications. Anecdotally, such breaking upgrades do occur: Live.com updated their beta gadget API, breaking dependent developers code [13]; and, more recently, the popular social bookmark website,

del.icio.us, moved the URLs pointing to some of their public data streams, breaking dependent applications [3].

App	JavaScript (bytes)	JavaScript (LoC)
Live Maps	1MB	54K
Google Maps	200KB	20K
HousingMaps	213KB	19K
Amazon Book Reader	390KB	16K
CNN.com	137KB	5K

Table 2: Numbers on the amount of client-side code in a few major web applications, measured in bytes and lines of code (LOC)

Software Complexity: Of course, JavaScript also suffers from the traditional challenges of writing any nontrivial program². While JavaScript programs were once only simple scripts containing a few lines of code, they have grown dramatically to the point where the client-side code of cutting-edge web applications easily exceed 10k lines, as shown in Table 2. The result is that web applications suffer from the same kinds of bugs as traditional programs, including memory leaks, logic bugs, race conditions, and performance problems.

The difficulties caused by heterogeneous execution environments and dynamic third-party behavior, as well as the challenge of writing correct software can certainly be improved through more complete standardization, better web service management and careful software engineering. But, we would argue that, at a minimum, software bugs and human error will continue to give all of these challenges a long life frustrating web application developers.

Opportunities

While the above challenges are faced by most any cross-platform distributed systems, two technical features of web applications provide an opportunity for building new kinds of tools to deal with these problems:

Instant Deployability: Web applications are deployed and updated by modifying the code stored on a central web server. Modulo caching policies, clients download a fresh copy of the application each time they run it, enabling instant deployability of updates. We take advantage of this capability to serve different versions of a web application (e.g., with varying instrumentation) over time and across users.

Dynamic extensions: During their execution, JavaScript-based web applications can dynamically load and run new scripts, allowing late-binding of functionality based on current requirements. We use this to download specialized fault diagnosis routines when a web application encounters a problem.

3 Live Monitoring

The goal of live monitoring techniques is to improve developer and operator visibility into the end-to-end behavior of web applications by enabling automatic, adaptive analysis of application behavior on real end-user desktops. At the core of a live monitoring technique is a simple process:

1. Use automatic program rewriting together with instant redeployability to serve differently instrumented versions of applications over time and across users.

2. Continually gather observations of the on-line, end-to-end behavior of applications running under real workload on many end-user’s desktops.

3. As observations of application behavior are gathered and analyzed, use the results to guide the adaptation of the dynamic instrumentation policy.

4. In special cases, use the client’s ability to dynamically load scripts to enable just-in-time fault diagnosis handlers, tailored based on previously gleaned information about the specific encountered symptoms.

Our framework for live monitoring, shown in Figure 1, divides this process across several key components. The Transformer is responsible for rewriting the JavaScript application as it is sent from the web application’s servers to the end-user’s desktop. The transformer contains both generic code, such as the JavaScript and HTML parsers reusable across many live monitoring techniques, and technique-specific rewriting rules. These rules are expressed in two steps: the first step searches for target code-points matching some rule-specific filter, such as “all function call expressions” or “all variable declarations”; and the second step applies an arbitrary transformation on a target code-point by modifying the abstract syntax tree of the JavaScript program. Each rewriting rule exposes a set of discrete knobs for controlling the rewriting of target code points. For example, a rule that adds performance profiling to function calls might expose an on/off knob for each function that could be profiled.

The Controller component is responsible for the core of the technique-specific adaptation algorithm, analyzing the collected observations of application behavior and using the results of the analysis to modify the knobs exposed by the rewriting rules in the Transformer. The Log Collector is a simple component, responsible for gathering observations returned by rewritten programs; and the Dynamic Extension Generator creates special-purpose fault diagnosis handlers, based on the application’s request and configuration input from the Controller.

While some parts of this process are generic and reusable across techniques, the rest—what we call a live monitoring policy—is specific to each live monitoring technique. This policy includes the rewriting rules in the Transformer, the analysis policy in the Controller responsible for analyzing logs and modifying the knobs of the rewriting rules, and the dynamic extension generator.

Figure 1: Live Monitoring Framework.

4 Live Monitoring Policies

When developing a new policy to address a debugging or maintenance challenge, we consider several questions:

What are the appropriate rewriting rules? The first consideration when building a monitoring policy is what observations of application behavior need to be captured, and how a program can be modified to efficiently capture it. In particular, we ask what instrumentation is statically written into the code, and what functionality will be dynamically determined and downloaded as needed from the Dynamic Extension Generator.

How does the rewriting adapt over time? A second consideration is which code points in a program should initially be rewritten, and how this choice changes over time as we gather more observations of behavior. The policy should also consider whether a multi-stage approach is appropriate, where completely different rewriting rules are applied to gather different kinds of information over time.

How does the policy spread instrumentation across users? A third axis of consideration is how a policy can distribute instrumentation across many users (e.g., via sampling) and re-aggregate that information to reason about the program’s overall behavior.

How do the developers and operators interact and use live monitoring policies? The final question when designing a policy is how people will use it. Some policies may be completely automated and continuously running, whereas other live monitoring policies may only run occasionally and on the explicit request of a developer. In particular, if the policy’s application rewriting might affect the semantics of the program then human interaction is likely necessary.

We have built a prototype of our live monitoring framework, implemented several policies for debugging errors, drilling-down into performance problems, and analyzing runtime behavior to detect potentially correct cache optimizations and are exploring answers to these questions. The rest of this section describes two policies that use different styles of adaptation to address different problems. In the first example, a single rewriting rule is applied to different points in the code as we drill-down into data structure corruptions. The second example uses different rewriting rules over time, and decides where to place each rewriting rule based on observations gathered from previous application runs.

Locating Data Structure Corruption Bugs

While it can be very difficult to reproduce the steps to triggering bugs in a controlled, development environment, real users will run into the same problems again and again in a real, deployed application. We would like to capture the relevant error information and debug problems in real conditions, but adding all the necessary debugging infrastructure to the entire program can have too high an overhead. The solution is to adaptively enable the debugging infrastructure when and where in the code it is needed.

Corruption of in-memory data structures is a clear sign of a bug in an application, and can easily lead to serious problems in the application’s behavior. A straightforward method for detecting data structure inconsistencies is to use consistency checks at appropriate locations to ensure that data structures are not corrupt. A consistency check is a small piece of data-structure-specific code that tests for some invariant. E.g., a doubly-linked list data structure might be inspected for unmatched forward and backward references. While today these checks are commonly written manually, there has been recent work on automatically inferring such checks [6].

When a consistency check fails, we might suspect that a bug exists somewhere in the executed code after the last successful consistency check3. If we execute these consistency checks infrequently, we will not have narrowed down the possible locations of a bug. On the other hand, if we execute these checks too frequently, we can easily cause a prohibitive performance overhead, as well as introduce false positives if we check a data structure while it is being modified.

Using live monitoring, we can build an adaptive policy that adds and removes consistency checks to balance the need for localizing data structures with the desire to avoid excessive overhead. Initially, the policy inserts consistency checks only at the beginning and end of stand-alone script blocks and event handlers (essentially, all the entry and exit points for the execution of a JavaScript application). Assuming that any data structure that is corrupted during the execution of a script block or event handler will remain corrupted at the end of the block’s execution, we have a high confidence of detecting corruptions as they are caused by real workloads.

As these consistency checks notice data structure corruptions, the policy adds additional consistency checks in the suspect code path to “drill-down” and help localize the problem. As clients download and execute fresh copies of the application and run into the same data structure consistency problems, they will report in more detail on any problems they encounter in this suspect code path, and our adaptive policy can then drill-down again, as well as remove any checks that are now believed to be superfluous.

Several simple extensions can make this example policy more powerful. For example, performance overhead can be reduced at the expense of fidelity by randomly sampling data structure consistency across many clients. Also, if the policy finds a function that only intermittently corrupts a data structure, we can explore the program’s state in more detail with an additional rewriting rule to capture the function’s input arguments and other key state arguments and other state to help the developer narrow down the cause of a problem.

Identifying Promising Cache Placements

Even simple features of web applications are often cut because of performance problems, and the poor performance of overly ambitious AJAX applications is one of the primary complaints of end-users. Some of the blame lies with JavaScript’s nature as a scripting language not designed for building large applications: given a lack of access scoping and the ability to dynamically load arbitrary code, the scripting engine often cannot safely apply even simple optimizations, such as caching variable dereferences and in-lining functions.

With live monitoring, however, we can use a multistage instrumentation policy to detect possibly valid optimizations and evaluate the potential benefit of applying the optimization. Let us consider a simple optimization strategy: the insertion of function result caching. For this optimization strategy to be correct, the function being cached must (1) return a value that is deterministic given only the function inputs and (2) have no side-effects. We monitor the dynamic behavior of the application to cull functions that empirically do not meet the first criteria. Then, we rely on a human developer to understand the semantics of the remaining functions to double-check the remaining functions for determinism and side-effects. Finally, we use another stage of instrumentation to check whether the benefit of caching outweighs the cost.

The first stage of such a policy injects test predicates to help identify when function caching is valid. To accomplish this, the rewriting rule essentially inserts a cache, but continues to call the original function and check its return value against any previously cached results. If any client, across all the real workload of an application, reports that a cache value did not match the function’s actual return value, we know that function is not safe for optimization and remove that code location from consideration. After gathering many observations over a sufficient variety and number of user workloads, we provide a list of potentially cache-able functions to the developer of the application and ask them to use their knowledge of the function’s semantics to determine whether it might have any side-effects or unseen non-determinism. The advantage of this first stage of monitoring is that reviewing a potentially short list of possibly valid cache-able code points should be easier than inspecting all the functions for potential cache optimization.

In the second stage of our policy, we use automatic rewriting to cache the results of functions that the developer deemed to be free of side-effects. To test the cost and benefit of each function’s caching, we distribute two versions of the application: one with the optimization and one without, where both versions have performance instrumentation added. Over time, we compare our observations of the two versions and determine when and where the optimization has benefit. For example, some might improve performance on one browser but not another. Other caches might have a benefit when network latency between the user and the central service is high, but not otherwise. Regardless, testing optimizations in the context of a real-world deployment, as opposed to testing only in a controlled pre-deployment environment, allows us to evaluate performance improvement while avoiding any potential systematic biases of test workloads or differences between real-world and test environments.

5 Related Work

Several previous projects have worked on improved monitoring techniques for web services and other distributed systems [2, 1], but to our knowledge, live monitoring is the first to extend developer’s visibility into web application behavior on the end-user’s desktop. Others, including Tucek et al [15], note that moving debugging capability to the end-user’s desktop benefits from leveraging information easily available only at the moment of failure—we strongly agree. In [9] Liblit et al present an algorithm for isolating bugs in code by using randomly sampled predicates of program behavior from a large user base. We believe that the adaptive instrumentation of live monitoring can improve on such algorithms by enabling the use of active learning techniques [5] that use global information about encountered problems to dynamically control predicate sampling. Perhaps the closest in spirit to our work is ParaDyn [10], which uses dynamic, adaptive instrumentation to find performance bottlenecks in parallel computing applications.

6 Challenges and Implications

In summary, we have presented live monitoring, a framework for improving developers’ end-to-end visibility into web application behavior through a continuous, adaptive loop of instrumentation, observation and analysis. As examples, we have shown how live monitoring can be used to localize bugs and analyze runtime behavior to detect and evaluate optimization opportunities.

We still face open challenges as we look to building a practical and deployable live monitoring system, such as the privacy issues of added instrumentation. While we believe that the browser’s sandbox on web applications, together with the explicit trust users already give web services to store application-specific personal information (e-mails, purchasing habits, etc) greatly reduces the potential privacy concerns of extra instrumentation, there may be corner cases where live monitoring would pose a risk. Another challenge is to maintain the predictability—predictable behavior and performance—of web applications as we dynamically adapt our instrumentation.

If successful, however, we believe the implications of instant redeployability may go beyond monitoring and also open the door to adaptive recovery techniques, including variations of failure-oblivious computing and Rx techniques [14, 12]. In these cases, the detection of a failure and the discovery of an appropriate mitigation technique in one user’s execution of an application could be immediately applied to help other users, before they experience problems. At the moment, web applications are the most widespread platforms that have the capability for instantly redeployment. In the future, however, automatic update mechanisms and other centralized software management tools [4] might enable instant redeployability in broader domains.

References

1. M.K. Aguilera, J.C. Mogul, J.L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance Debugging for Distributed Systems of Black Boxes. In Proceedings of SOSP 2003.

2. P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In Proceedings of OSDI 2004.

3. A. Bosworth. How To Provide a Web API. http://www.sourcelabs.com/blogs/ajb/2006/08/ how_to_provide_a_web_api.html, Aug 2006.

4. R. Chandra, N. Zeldovich, C. Sapuntzakis, and M.S. Lam. The Collective: A Cache-Based System Management Architecture. In Proceedings of NSDI 2005.

5. D.A. Cohn, Z. Ghahramani, and M.I. Jordan. Active Learning with Statistical Models. Journal of Artificial Intelligence Research, 4, 1996.

6. B. Demsky, M.D. Ernst, P.J. Guo, S. McCamant, J.H. Perkins, and M. Rinard. Inference and enforcement of data structure consistency specifications. In Proceedings of ISSTA 2006.

7. ECMA. ECMAScript Language Specification 3^rd Ed. http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf, Dec 1999.

8. Google. Google web toolkit. http://code.google.com/webtoolkit/.

9. B. Liblit, M. Naik, A.X. Zheng, A. Aiken, and M.I. Jordan. Scalable Statistical Bug Isolation. In Proceedings of PLDI 2005.

10. B.P. Miller, M.D. Callaghan, J.M. Cargille, J.K. Hollingsworth, R.B. Irvin, K.L. Karavanic, K. Kunchithapadam, and T. Newhall. The Paradyn Parallel Performance Measurement Tool. IEEE Computer, 28(11), Nov 1995.

11. T. O’Reilly. What is Web 2.0. http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web20.html, Sep 2005.

12. F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: Treating bugs as allergies—a safe method to survive software failure. In Proceedings of SOSP 2005.

13. S. Rider. Recent changes that may break your gadgets. http://microsoftgadgets.com/forums/1438/ShowPost.aspx, Nov 2005.

14. M. Rinard, C. Cadar, D. Dumitran, D.M. Roy, T. Leu, and Jr. W.S. Beebee. Enhancing Server Availability and Security Through Failure-Oblivious Computing. In Proceedings of OSDI 2004.

15. J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Automatic On-line Failure Diagnosis at the End-User Site. In Proceedings of HotDep 2006.

End Notes

1. We make a distinction between a web service and a web application. The former includes only server-side components, while the latter also includes a significant client-side JavaScript component

2. Coding in JavaScript today is also made more difficult by a lack of compile-time errors and warnings, static type checking, and private scoping. We do not consider these problems fundamental, however, as current and upcoming tools, such as Google’s WebToolkit are remedying these issues [8].

3. JavaScript programs are executed within a single-thread, avoiding the possibility of a separate thread having corrupted the data structure.