An important capability that our distributed application life-cycle requires is the ability to upgrade to new software versions, and when necessary, to revert to previous software rapidly and easily. We refer to this as the software upgrade/rollback problem. In this section, we explore the challenges in addressing this problem and describe part of the design space of possible solutions.
Broadly, upgrades can be classified as synchronous or asynchronous. A synchronous upgrade is one in which the software on all nodes must be upgraded near-simultaneously. An asynchronous upgrade, in contrast, does not require coordination amongst nodes of the distributed application. Synchronous upgrades make more demands on the upgrade/rollback system than asynchronous upgrades.
An example of a distributed application that might use asynchronous upgrades is a web server cluster. The prototypical architecture for a web server cluster might consist of a load-balancer, a number of identical web servers, and a back-end database. Because the web servers are, in some sense, replicas, the unavailability of a single web server node does not greatly impact the overall service. Also, the web servers do not need to run the same software release in order to maintain the correctness of the overall service. Accordingly, upgrade and rollback need not be coordinated amongst the web server nodes and can be done asynchronously.
Routing applications are examples of distributed applications that may require synchronous upgrades. In contrast to the web server cluster, the nodes are not replicas, so the unavailability of a single node will force the unavailability of any resources unique to that node. In addition, the operation of the routing nodes is not independent. They must cooperate in order to provide service to their clients. Thus, we must either design upgrades to be backwards compatible, or we must coordinate upgrade (and rollback) amongst the routing nodes.