The PARDA algorithm detects overload at the array based on average IO latency measured over a fixed time period, and adjusts the host's issue queue length (i.e., window size) in response. A separate instance of the PARDA control algorithm executes on each host.
There are two main components: latency estimation and window
size computation. For latency estimation, each host maintains an
exponentially-weighted moving average of IO latency at time
, denoted by
,
to smooth out short-term variations. The weight given to past
values is determined by a smoothing parameter
. Given a new latency observation
,
![]() |
(1) |
The window size computation uses a control mechanism
shown to exhibit stable behavior for FAST TCP:
Whenever the average latency
, PARDA decreases the window
size. When the overload subsides and
, PARDA increases
the window size. Window size adjustments are based on latency
measurements, which indicate load at the array, as well as per-host
values, which specify relative host IO share allocations.
To avoid extreme behavior from the control algorithm,
is
bounded by
. The lower bound
prevents
starvation for hosts with very few IO shares. The upper bound
avoids very long queues at the array, limiting the latency
seen by hosts
that start issuing requests after a period of inactivity. A
reasonable upper bound can be based on typical queue length values in
uncontrolled systems,
as well as the array
configuration and number of hosts.
The latency threshold
corresponds to the response time that
is considered acceptable in the system, and the control algorithm
tries to maintain the overall cluster-wide latency close to this
value.
Testing confirmed our expectation that increasing the array queue
length beyond a certain value doesn't lead to increased throughput.
Thus,
can be set to a value which is high enough
to ensure that a
sufficiently large number of requests can always be pending at the
array. We are also exploring automatic techniques for setting this
parameter based on long-term observations of latency and throughput.
Administrators may alternatively specify
explicitly, based
on cluster-wide requirements, such as supporting
latency-sensitive applications, perhaps at the cost of under-utilizing the
array in some cases.
|
Finally,
is set based on the IO shares associated with the
host, proportional to the sum of its per-VM shares. It has been shown
theoretically in the context of FAST TCP that the equilibrium window
size value for different hosts will be proportional to their
parameters [15].
We highlight two properties of the control equation, again relying on
formal models and proofs from FAST TCP. First, at
equilibrium, the throughput of host
is proportional to
, where
is the per-host allocation parameter,
and
is the queuing delay observed by the host. Second,
for a single array with capacity
and latency threshold
, the window size at equilibrium will be:
![]() |
(3) |
To illustrate the behavior of the control algorithm, we simulated a
simple distributed system consisting of a single array and multiple
hosts using Yacsim [17]. Each host runs an instance of the
algorithm in a distributed manner, and the array services requests
with latency based on an exponential distribution with a mean of
. We conducted a series of experiments with various capacities,
workloads, and parameter values.
To test the algorithm's adaptability, we experimented with three hosts
using a
share ratio,
= 200 ms, and an array capacity
that changes from 400 req/s to 100 req/s halfway through the
experiment. Figure 2 plots the throughput, window
size and average latency observed by the hosts for a period of
seconds. As expected, the control algorithm drives the system to
operate close to the desired latency threshold
. We also used
the simulator to verify that as
is varied (100 ms, 200 ms
and 300 ms), the system latencies operate close to
, and that
windows sizes increase while maintaining their proportional ratio.
Ajay Gulati 2009-01-14