Ioannis Avramopoulos1, Jennifer Rexford2, and Robert Schapire2
Deutsche Telekom Laboratories1and Princeton University2
In this paper, we propose a different framework for adaptive routing decisions based on regret-minimizing online learning algorithms. These algorithms, as applied to routing, are appealing because adopters can independently improve their own performance while being robust to adversarial behavior. However, in contrast to approaches based on optimization theory that provide guarantees from the outset about network-wide behavior, the network-wide behavior if online learning algorithms were to interact with each other is less understood. In this paper, we study this interaction in a realistic Internet environment, and find that the outcome is a stable state and that the optimality gap with respect to the network-wide optimum is small. Our findings suggest that online learning may be a suitable framework for adaptive routing decisions in the Internet.
The global flow of Internet traffic depends on the interaction of independent networks that interconnect through routing protocols to deliver the traffic. The routing decisions are based for the most part on static information (such as number of hops to destination or business relationships with neighboring networks) ignoring dynamic performance metrics. Making routing decisions adaptive would not only improve the performance but also the security of the global routing system by enabling the end systems to route around adversaries [21].
Previous approaches to add adaptivity to routing decisions have been mostly based on optimization theory. These efforts date back to the work of Gallager [14] and have received significant attention since then (see, for example, [9,18,16]). However, such optimal routing algorithms rely on trust (assuming, for example, that routers provide feedback about performance in a truthful manner) and seek to optimize network-wide objectives possibly at the expense of the performance of individual flows. In the Internet, an adversary can exploit assumptions of trust to disrupt communication, and, therefore, Internet routing should be robust to such adversarial behavior. Furthermore, in the general interdomain setting, individual networks are not likely to seek to optimize a network-wide objective but rather their own performance.
In this paper, we propose a different framework for adding adaptivity to routing decisions, one based on the substantial body of work in learning theory and game theory on algorithms for making repeated decisions that aim to minimize regret. The regret of an algorithm is the difference between the performance of the sequence of decisions generated by the algorithm and the performance of the best fixed decision in hindsight. Several decision-making algorithms have been proposed that approach zero regret even against a fully adaptive adversary (e.g., [19,2]). The problem of routing data traffic between a source and a destination node over a set of paths can be cast as a problem of repeated decision making in which the routing algorithm must decide in a repeated fashion over which paths to forward the traffic.
Casting the problem of routing as a decision-making problem enables us to leverage recent theoretical results on regret minimization to develop a framework for making routing decisions adaptive. This framework is able to address the disadvantages of optimization-theory-based approaches. The reason is that a zero-regret routing algorithm as employed by a single source-destination pair is able to match the performance of the best path between source and destination irrespective of how the remaining pairs behave. This property is compatible with the incentives of rational adopters that prioritize optimizing their own performance over centrally designed network-wide objectives. Furthermore, the performance of the best path is matched even against a fully adaptive adversary that controls routers in a subset of the paths and behaves maliciously to disrupt communication. Using regret minimization, this disruption is prevented as long as an adversary-free path exists.
However, although previous work on regret minimization has developed algorithms that make these guarantees possible, it has largely neglected the question of what the network-wide behavior of the system would be if zero-regret routing algorithms were to interact with each other. Note that although it is certainly true that each source-destination pair is able to match the performance of the best path, it is nevertheless important to demonstrate that good performing paths do in fact exist. This is important for deployment in Internet environments where achieving the ability to counter adversaries should not affect the efficiency during normal operation. In this paper we seek to answer the question of what the network-wide behavior of a system of interacting zero-regret algorithms is through a realistic simulation study using data collected from the Internet2 backbone network.
Our findings can be summarized along two dimensions. First, we find that the outcome of the interaction is a stable state, which agrees with previous theoretical results derived in the Wardrop model of infinite traffic sources controlling infinitesimal amounts of traffic [6]. This model differs, however, from ours in that we consider finite traffic demands. Second, we find that performance at the equilibrium is close to optimal. This result is in contrast to previous theoretical results on the price of total anarchy that predict a large optimality gap in the worst case [7]. Furthermore, our findings assume that the routing algorithms only have access to end-to-end measurements and do not rely on feedback from intermediate nodes. Our results suggest that regret minimization is aptly positioned to drive Internet routing decisions.
An online decision problem can be formulated
as a repeated game between a decision
maker and the environment [8].
The game proceeds in rounds, and
at each round (or time step)
of the
time horizon
, the decision maker must
choose a probability distribution
over a set of
actions. Then the environment
(that is possibly controlled by an adversary)
chooses one reward
for each action
.
The action
of
the decision maker is drawn according to
distribution
and
the decision maker receives reward
.
The gain of action is the sum of the
action's rewards over the time horizon, i.e.,
, and the gain of
the decision maker is the sum of the received
rewards over the time horizon, i.e.,
.
The regret
of the decision maker
is defined as
(often normalized by dividing by
).
The goal of the decision maker is to
minimize the regret, and approach
the gain of the best action.
In the course of the game, the decision
maker gathers and uses as input information
about the environment. Performance depends
on the amount of information that can be
gathered. In the full information
setting, after a decision is made at
each time step the decision
maker observes
.
That is, access is given not only to
the rewards that were received
because of the actions that
were taken but also
to the
rewards that would have been received if
alternate actions had been taken.
In the multi-armed
bandit setting, at each time step
the decision maker only
observes the reward for the
action that was actually taken
.
In both settings, there exist algorithms
whose normalized regret approaches zero
as the time horizon approaches infinity
even if
the rewards are generated by a fully
adaptive adversary
who controls the environment and is able
to observe the decisions of the decision
maker.
In the full information setting,
the regret is lower by roughly a factor
proportional to
.
The problem of routing data traffic between a source node and a destination node in a network can be cast as a decision making problem as follows.
The decision maker
is an independent instance of the routing algorithm
(making routing decisions for the traffic between,
say, a given pair of source and destination nodes),
and the actions available correspond to the paths
along which data packets can be forwarded. The
probability distribution
chosen at each time step
determines the routing decisions
for the data packets and essentially
corresponds to how incoming traffic
is split over the
outgoing paths.
We call
this probability vector the
distribution vector.
The routing algorithm must decide at
each time step how to adjust
the distribution vector. The time steps
correspond to the instants that the
algorithm can revise its decision.
The reward for each
decision corresponds to some
performance metric such as
packet loss, delay, or
throughput. Such performance
metrics can be estimated in
practice through measurements
(whether based on simple active
probes or more secure measurement
techniques [3,15]).
In modeling the decision process of the routing algorithm, we furthermore take into account the following in a realistic Internet environment. First, the decisions made by the routing algorithm have an impact on future rewards. This is true not only because of the time required to deliver packets from source to destination but also because of the interaction of the decisions made by different source-destination pairs. Because of this interaction, the response of the environment to the actions of the decision maker should not be considered independent of, or oblivious to, those actions (as has been assumed by a significant body of work on regret minimization) but rather dependent on them, i.e., adaptive.
Second, we make the following observations
about the amount of information that is
available to the routing algorithm.
The full information setting
assumes that the decision maker
has access to the rewards that
would have been obtained
if alternate decisions
had been made. However,
standard measurement
techniques cannot provide
this information, i.e.,
they are not able to predict
the performance that would
have been observed if
the data traffic had been
forwarded over an alternate
path.
Therefore, in a realistic environment, the routing algorithm
may only learn the rewards for
the specific actions that were
taken, which corresponds to
the bandit setting. In the remainder
of this paper, we only consider this
latter setting and, furthermore,
assume that performance
estimates are obtained
through end-to-end
measurements.
It is worth noting that because
in practice the number of paths
available to the routing algorithm
are limited and because the regret
in the bandit setting is roughly
times worse than the regret
in the full information setting, performance
under both settings is comparable.
The learning algorithm we consider in
this paper is algorithm Exp3 [2]
(Figure 1). The
algorithm proceeds in rounds.
At each round a probability distribution
is selected over the
paths and a routing decision
is made for one unit of
traffic demand (e.g., a packet). The outcome
of the decision is a reward
that is
used to update labels
according to formula (2).
In the next round, these labels
are used to recompute the distribution
according
to formula (1). Note
that the probability
of
selecting path
depends
not only on past
performance but also on
a random component
determined by parameter
. This parameter
controls a
tradeoff between exploration
and exploitation. If
,
the algorithm only explores (and
does not exploit). If
, the
algorithm only exploits. If
,
the normalized regret of Exp3 provably approaches 0 as
approaches
.
To study the optimality and stability of zero-regret routing algorithms, our evaluation is based on an artificial and a realistic setting. The artificial setting provides a minimal environment in which to study the interaction of adaptive routing decisions, whereas the realistic setting is based on a more accurate model of an Internet environment. We start with the artificial setting in which we are able to show a positive result that motivates our investigation in the realistic setting.
Network topology and traffic demands:
In the artificial topology, shown on the
right of Figure 2, there are three
source nodes
that
must simultaneously send a unit of traffic
each to destination
. Each source
is able to access the destination
through three alternate paths each
crossing one of the nodes
.
We assume that all links have unit capacity
and, therefore, links
are the
bottleneck links.
Simulation setup: The simulation proceeds in steps (or rounds). At the beginning of a round each source
Performance metrics: In this artificial setting, we measure performance by counting the number of traffic units that are successfully delivered to the destination. Given the small size of the network, to study the stability of the interaction, we simply plot the distribution vectors as a function of the simulation step.
Figure 3
shows the probability of selecting each path
in this artificial setting as a function of
the simulation step. After a transient period
of exploration, the routing processes
converge to an outcome in which
sources ,
, and
select paths
,
, and
respectively with very high probability
(and the remaining
paths with very low probability). This
corresponds to a close-to-optimal network-wide
outcome in which
the probability of delivering three
traffic units to the destination is
very high. Furthermore, this
outcome is stable in that the
distribution vectors converge
to a fixed distribution. This
behavior was observed for a wide
range of values of parameter
.
It is worth noting that the
routing processes do not converge to
fixed routing decisions--every
combination of decisions has a
non-zero probability of arising.
This is typical for online
learning algorithms that
always explore alternate
possibilities.
Network topology and traffic demands:
To study adaptive routing
in a realistic environment we use the Internet2 network
(Figure 2) that
provides Internet connectivity for research institutions
in the US. In Internet2, the
topology, routing, and traffic
data are publicly available,
enabling us to reproduce a realistic intradomain environment.
Using these data we computed hourly traffic matrices during one week in
May 2008.
To a certain extent this environment
also approximates an interdomain setting.
An ideal approximation of an interdomain
environment
would correspond to each customer
of Internet2 independently
controlling its routing decisions.
However, we
found that data are not sufficient to
calculate a traffic matrix at this
granularity. Instead we compute
router-to-router traffic
matrices and assume
that the traffic between each
pair of routers is
controlled by an independent process. In this way,
we are able to study the
interaction of
independent
flows.
We leave a more thorough evaluation
in general interdomain environments
as future work.
We found that the network utilization
from the traffic matrices calculated
by our direct measurements was low.
To also study adaptivity under
conditions of high load we generated
synthetic traffic matrices by scaling
the real ones to achieve a
maximum link utilization of .
Multiple paths between a source and
destination were computed by successive
shortest path computations in which at
most one link was removed from the
topology in an iterative fashion.
The average number of paths
for each source-destination
pair is .
Simulation setup: The simulation proceeds in rounds. At the beginning of each round, each routing process splits its traffic demand according to its distribution vector and forwards it along the corresponding paths. That is, if the traffic demand for source-destination pair
After computing the total flow on each link, the
queuing delay is determined.
If utilization does not exceed ,
the queuing delay is determined by the
M/M/1 formula. However, similarly
to [12], if utilization exceeds
, the delay becomes proportional
to the utilization according to a large constant.
In this way, we can
assign finite delay even to links whose
utilization exceeds one.
Following the computation of the link delays, each routing process learns the end-to-end delay of each path to which it forwarded data traffic. Then each process updates its distribution vector by simulating algorithm Exp3 as if the demand consisted of discrete traffic units.
Performance metrics: The network-wide metric we use to evaluate performance is the average queuing delay over all links. We compare the average queuing delay incurred by interacting instances of the learning algorithm to the optimal average queuing delay (assuming multipath routing) that we computed using the cvx convex optimization software [1].
In the Internet2 network, evaluating
stability by plotting the distribution
vectors is unwieldy. Instead we evaluate
stability by observing over time the
``differences'' between successive
distribution vectors (and summing
such differences over all
source-destination pairs).
The measure of difference
we use is the Kullback-Leibler
divergence, a standard information-theoretic
measure of the difference between probability
distributions. The KL-divergence
between two probability vectors
and
is given by the following formula:
![]() |
Let be the set of all source-destination
pairs. For source-destination pair
let
be
its distribution vector at
round
. Then our
stability metric is the
following sum
.
Optimality:
In Figures 4 and 5,
we compare the network queuing delay incurred by interacting
instances of the learning
algorithm with that of optimal routing by plotting the
empirical cumulative distribution function (CDF) of the optimality gap.
We define the optimality gap as the ratio of the average
queuing delay of Exp3 over the optimal average
queuing delay. In Figure 4,
the ratios correspond to the traffic matrices
from our
measurements
in the Internet2 network. In
Figure 5,
the traffic matrices from our
measurements have been
scaled so that the maximum link
utilization under static shortest-path
routing according to operator-selected
IS-IS weights is . We plot the
CDF of the optimality gap at the
th,
th, and
th iteration. Observe
that the optimality gap decreases with the
iterations and almost vanishes
at
iterations. In fact, the optimality gap
is a strictly decreasing function of the simulation step.
This is shown in Figure 6 that plots the
optimality gap on a semilogarithmic scale
as a function of
the simulation step for
one particular traffic matrix (noting
that the behavior in the figure is typical).
Translating simulation steps into real time depends
on the measurement methodology. For example, if measurements
are performed through randomly sampling inbound packets, each
iteration corresponds to the time between
successive samples implying that the time
lapse between iterations is on the order
of a few . Therefore, assuming an
iteration inverval of
, the
optimality gap shrinks
in less than
(
iterations) and
becomes negligible in less than
(
iterations) as
shown in Figure 4.
In summary, the performance of Exp3 is close to optimal under normal operation while at the same time Exp3 is able to counter malicious attacks, a property that previous optimal routing algorithms do not have.
It is worth noting that although Exp3 is designed to match the performance of the best path between source and destination, in our simulation, its performance approaches that of the best distribution vector. We have verified that in our setting the gap between between optimal single-path and multipath routing is small. It is an interesting question whether Exp3 can compete against the best distribution vector in other Internet environments.
Stability: Figure 6 plots the previously defined stability metric on a logarithmic scale as a function of the iteration step for one traffic matrix (noting that the behavior in the figure is typical). The figure shows that the stability metric is a strictly decreasing function of the simulation step (that, in fact, drops about four orders of magnitude in the duration of the simulation), implying that the distribution vectors converge to a fixed distribution. Observe that the changes in the distribution vector from one iteration to the next have a diminishing effect on the network cost that becomes negligible after the cost reaches a plateau.
Traffic engineering: Traffic engineering refers to the process by which routing adapts to the traffic demands and can be performed either offline (e.g., see [12] for a survey) or online (e.g., [18,11]). Related to this paper are online traffic engineering protocols. However, these protocols are not resilient to attacks by compromised routers. Furthermore, as they are designed for deployment inside a routing domain, they are not general enough to be applicable to interdomain settings.
Regret minimization: From the perspective of an individual player, regret minimization has been studied in the general repeated game setting (e.g., [19,2,8]), and in more specific routing games. In such games, an algorithm must repeatedly choose a path between source and destination assuming an adversary controls the edge costs. Zero-regret algorithms in this setting appear in [17,5]. These studies laid the foundation for the approach proposed in this paper, but are not concerned with the stability or optimality of the interaction of zero-regret algorithms. A routing algorithm that minimizes regret from the perspective of a single source-destination pair is presented in [4], but this algorithm is not studied when multiple source-destination pairs interact.
Interaction of regret-minimizing algorithms: Previous work has studied whether zero-regret algorithms in a repeated game setting converge to a Nash equilibrium. In two-player zero-sum games, the outcome of the interaction is a minimax solution [13]. However, in general games, a Nash equilibrium may not be approachable in polynomial time by any polynomial algorithm [10]. Positive stability results are known for routing games in the Wardrop model of an infinite number of traffic sources each controlling infinitesimal traffic, in which convergence to a Nash equilibrium is shown in [6]. Different from this work, we consider finite traffic demands.
The optimality of the interaction of zero-regret algorithms is studied in [7] where it is shown that in the worst-case there is a high price in moving from centralized to independent routing decisions. Our findings suggest that, in practice, the optimality gap is small. Small optimality gap under selfish routing was previously observed in a related study on the price of anarchy in realistic Internet environments [20]. This study measured the optimality gap of Nash equilibria whereas we study the interaction of regret minimization algorithms.
In this paper, we proposed online learning algorithms as a framework for adding adaptivity to routing decisions and studied how such algorithms interact in a realistic Internet environment. We found that the outcome of the interaction is a stable state and that the optimality gap with respect to the network-wide optimum is small. We conclude that online learning may be a suitable framework for routing in the Internet.
This project has benefited from data collected in the Internet2 Observatory Project. We would like to thank Elliott Karpilovsky and the anonymous reviewers for their insightful comments. Ioannis Avramopoulos was with Princeton University while performing part of this work.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -show_section_numbers -local_icons -no_navigation regret.tex
The translation was initiated by Ioannis Avramopoulos on 2008-11-21