These emulation experiments demonstrate how the lease management and configuration services scale at saturation. Table 3 lists the parameters used in our experiment: for a given cluster size at a single site, one service manager injects lease requests to a broker for nodes (without lease extensions) evenly split across leases (for nodes per lease) every lease term (giving a request injection rate of ). Every lease term the site must reallocate or ``flip'' all nodes. We measure the total overhead including lease state maintenance, network communication costs, actor database operations, and event polling costs. Given parameter values we can derive the worst-case minimum lease term, in real time, that the system can support at saturation.
As explained in Section 4.2, each actor's operations are driven by a virtual clock at an arbitrary rate. The prototype polls the status of pending lease operations (i.e., completion of join/leave and setup/teardown events) on each tick. Thus, the rate at which we advance the virtual clock has a direct impact on performance: a high tick rate improves responsiveness to events such as failures and completion of configuration actions, but generates higher overhead due to increased polling of lease and resource status. In this experiment we advance the virtual clock of each actor as fast as the server can process the clock ticks, and determine the amount of real time it takes to complete a pre-defined number of ticks. We measure an overhead factor : the average lease management overhead in milliseconds per clock tick. Lower numbers are better.
Local communication. In this experiment, all actors run on a single x335 server and communicate with local method calls and an in-memory database (no LDAP). Figure 10 graphs as a function of lease term in virtual clock ticks; each line presents a different value of keeping constant at 240. The graph shows that as increases, the average overhead per virtual clock tick decreases; this occurs because actors perform the most expensive operation, the reassignment of nodes, only once per lease term leaving less expensive polling operations for the remainder of the term. Thus, as the number of polling operations increases, they begin to dominate . Figure 10 also shows that as we increase the number of leases injected per term, also increases. This demonstrates the increased overhead to manage the leases.
At a clock rate of one tick per second, the overhead represents less than 1% of the latency to prime a node (i.e., to write a new OS image on local disk and boot it). As an example from Figure 10, given this tick rate, for a lease term of 1 hour (3,600 virtual clock ticks), the total overhead of our implementation is = = seconds with =24 leases per term. The lease term represents the minimum term we can support considering only implementation overhead. For COD, these overheads are at least an order of magnitude less than the setup/teardown cost of nodes with local storage. From this we conclude that the setup/teardown cost, not overhead, is the limiting factor for determining the minimum lease term. However, overhead may have an effect on more fine-grained resource allocation, such as CPU scheduling, where reassignments occur at millisecond time scales.
Table 4 shows the effect of varying the cluster size on the overhead factor . For each row of the table, the service manager requests one lease ( =1) for nodes ( = ) with a lease term of 3,600 virtual clock ticks (corresponding to a 1 hour lease with a tick rate of 1 second). We report the average and one standard deviation of across ten runs. As expected, and increase with cluster size, but as before, remains an order of magnitude less than the setup/teardown costs of a node.
SOAP and LDAP. We repeat the same experiment with the service manager running on a separate x335 server, communicating with the broker and authority using SOAP/XML. The authority and broker write their state to a shared LDAP directory server. Table 5 shows the impact of the higher overhead on and , for =240. Using , we calculate the maximum number of node flips per millisecond = at saturation. The SOAP and LDAP overheads dominate all other lease management costs: with nodes, an x335 can process 380 node flips per second, but SOAP and LDAP communication overheads reduce peak flip throughput to 1.9 nodes per second. Even so, neither value presents a limiting factor for today's cluster sizes (thousands of nodes). Using SOAP and LDAP at saturation requires a minimum lease term of 122 seconds, which approaches the setup/teardown latencies (Section 5.1).
From these scaling experiments, we conclude that lease overhead is quite modest, and that costs are dominated by per-tick resource polling, node reassignment, and network communication. In this case, the dominant costs are LDAP access and SOAP operations and the cost for Ant to parse the XML configuration actions and log them.