Adaptivity to Changing Load

This section demonstrates the role of brokers to arbitrate resources under changing workload, and coordinate resource allocation from multiple sites. This experiment runs under emulation (as described in Section 4.2) with null resource drivers, virtual time, and lease state stored only in memory (no LDAP). In all other respects the emulations are identical to a real deployment. We use two emulated 70-node cluster sites with a shared broker. The broker implements a simple policy that balances the load evenly among the sites.

**Figure:** Scaled resource demands for one-month traces from an e-commerce website and a production batch cluster. The e-commerce load signal is smoothed with a *flop-flip* filter for stable dynamic provisioning.
$\begin{figure}\centerline{\epsfig{file=figs/s.eps}} \end{figure}$

We implemented an adaptive service manager that requests resource leases at five-minute intervals to match a changing load signal. We derived sample input loads from traces of two production systems: a job trace from a production compute cluster at Duke, and a trace of CPU load from a major e-commerce website. We scaled the load signals to a common basis. Figure 6 shows scaled cluster resource demand--interpreted as the number of nodes to request--over a one-month segment for both traces (five-minute intervals). We smoothed the e-commerce demand curve with a ``flop-flip'' filter from [6]. This filter holds a stable estimate of demand

= $E_{t-1}$ until that estimate falls outside some tolerance of a moving average ( $E_t = \beta E_{t-1} + (1-\beta)O_t$ ) of recent observations, then it switches the estimate to the current value of the moving average. The smoothed demand curve shown in Figure 6 uses a 150-minute sliding window moving average, a step threshold of one standard deviation, and a heavily damped average $\beta$ =

**Figure:** The effect of longer lease terms on a broker's ability to match guest application resource demands. The website's service manager issues requests for machines, but as the lease term increases, the broker is less effective at matching the demand.
$\begin{figure}\centerline{\epsfig{file=figs/b.eps}} \end{figure}$

**Figure:** Lease term of 12 emulated hours with brokering of 140 machines from two sites between a low-priority computational batch cluster and a high-priority e-commerce website that are competing for machines. Where there is contention for machines, the high priority website receives its demand causing the batch cluster to receive less. Short lease terms are able to closely track resource demands, while long lease terms (below) are unable to match short spikes in demand.
$\begin{figure}\centerline{\epsfig{file=figs/c.eps}}\end{figure}$

**Figure:** Lease term of 3 emulated days with brokering of 140 machines from two sites between a low-priority computational batch cluster and a high-priority e-commerce website that are competing for machines. Where there is contention for machines, the high priority website receives its demand causing the batch cluster to receive less. Short lease terms are able to closely track resource demands, while long lease terms (below) are unable to match short spikes in demand.
$\begin{figure}\centerline{\epsfig{file=figs/d.eps}}\end{figure}$

Figure 7 demonstrates the effect of varying lease terms on the broker's ability to match the e-commerce load curve. For a lease term of one day, the leased resources closely match the load; however, longer terms diminish the broker's ability to match demand. To quantify the effectiveness and efficiency of allocation over the one-month period, we compute the root mean squared error (RMSE) between the load signal and the requested resources. Numbers closer to zero are better: an RMSE of zero indicates that allocation exactly matches demand. For a lease term of 1 day, the RMSE is 22.17 and for a lease term of 7 days, the RMSE is 50.85. Figure 7 reflects a limitation of the pure brokered leasing model as prototyped: a lease holder can return unused resources to the authority, but it cannot return the ticket to the broker to allocate for other purposes.

To illustrate adaptive provisioning between competing workloads, we introduce a second service manager competing for resources according to the batch load signal. The broker uses FCFS priority scheduling to arbitrate resource requests; the interactive e-commerce service receives a higher priority. Figure 8 and Figure 9 shows the assigned slice sizes for lease terms of (a) 12 emulated hours and (b) 3 emulated days, respectively. As expected, the batch cluster receives fewer nodes during load surges in the e-commerce service. However, with longer lease terms, load matching becomes less accurate, and some short demand spikes are not served. In some instances, resources assigned to one guest are idle while the other guest saturates but cannot obtain more. This is seen in the RMSE calculated from Figure

: the website has a RMSE of (a) 12.57 and (b) 30.70 and the batch cluster has a RMSE of (a) 23.20 and (b) 22.17. There is a trade-off in choosing the length of lease terms: longer terms are more stable and better able to amortize resource setup/teardown costs improving fidelity (from Section 5.1), but are not as agile to changing demand as shorter leases.