Our general approach maps the problem of distributed storage management to flow control in networks. TCP running at a host implements flow control based on two signals from the network: round trip time (RTT) and packet loss probability. RTT is essentially the same as IO request latency observed by the IO scheduler, so this signal can be used without modification.
However, there is no useful analog of network packet loss in storage systems. While networking applications expect dropped packets and handle them using retransmission, typical storage applications do not expect dropped IO requests, which are rare enough to be treated as hard failures.
Thus, we use IO latency as our only indicator of congestion at the array. To detect congestion, we must be able to distinguish underloaded and overloaded states. This is accomplished by introducing a latency threshold parameter, denoted by . Observed latencies greater than may trigger a reduction in queue length. FAST TCP, a recently-proposed variant of TCP, uses packet latency instead of packet loss probability, because loss probability is difficult to estimate accurately in networks with high bandwidth-delay products [15]. This feature also helps in high-bandwidth SANs, where packet loss is unlikely and TCP-like AIMD (additive increase multiplicative decrease) mechanisms can cause inefficiencies. We use a similar adaptive approach based on average latency to detect congestion at the array.
Other networking proposals such as RED [9] are based on early detection of congestion using information from routers, before a packet is lost. In networks, this has the added advantage of avoiding retransmissions. However, most proposed networking techniques that require router support have not been adopted widely, due to overhead and complexity concerns; this is analogous to the limited QoS support in current storage arrays.
Ajay Gulati 2009-01-14