There are two pieces of related work on upgrading large scale distributed systems that are worth noting: the work on upgrading the Internet routing infrastructure [9] and work on upgrading classes in object-oriented databases [5].
The Internet routing infrastructure can be viewed as a large distributed application. As many networking researchers have bemoaned, the difficulty of upgrading or incorporating new functionality into the Internet infrastructure has significantly limited the deployment of new techniques. This difficulty is the result of both a design that does not accommodate automatic deployment of new functionality and the distributed ownership of the Internet (making it difficult to reach consensus about upgrades). The Active Network [9] community spent many years attempting to address these shortcomings with unfortunately little success. However, we believe some of the important lessons from this work and the deployment of new protocols in the Internet do carry over to the area of upgrading distributed applications.
Liskov et al.'s work [5] focuses on upgrading classes in object oriented databases, where it is essential to preserve object state across upgrades. Note that our problem differs significantly from the problem considered by this effort. In our problem, we assume that the distributed application does not require persistent state. Under this assumption, which we believe will hold for many distributed applications, upgrade and rollback can be handled with significantly less developer effort.