Next: Use of OSPFScan for
Up: Utility of the Monitor
Previous: Utility of the Monitor
9.1.1 LSAG in Day-to-day Operations
As mentioned earlier, the LSAG provides two data sources in real-time:
messages related to the topology changes and anomalous behavior,
and network topology snapshots.
Both the sources provide valuable insight into the health of a network.
We have developed a web-site for viewing LSAG messages, interacting with
network operators to make the site as simple and
user-friendly as possible. The web-site allows the operators to query
the LSAG message logs, generate statistics about the messages, and
navigate past archives. The web-site makes use of a configuration
management tool to map IP addresses into names. This web-site is now
used extensively by network support and operations on a regular basis,
and has proved invaluable during network maintenance to validate
maintenance steps as well as to monitor the impact of maintenance on
the network-wide behavior of OSPF.
Network operation groups also use the
LSAG messages for generating alarms by feeding them
into higher layer alerting systems.
This in turn allows correlation and grouping with other monitoring tools.
To prevent a deluge of alerts generated due to a high frequency of
LSAG messages, we have taken two steps.
First, we prioritize messages to help operators in the event of
``too many flashing lights''.
For example, the alerting system assigns ``RTR DOWN'' message
a higher priority than a ``RTR UP'' message.
Second, we group multiple messages into a single alarm.
For example, a fiber cut can bring down a
number of adjacencies prompting the LSAG to generate several
``ADJACENCY DOWN'' messages.
We group all these messages into a single alert to prevent
a flurry of alerts for a single underlying event.
Network operators may change OSPF link weights from their design
values to carry out maintenance tasks. We have designed a
``link-audit'' web-site that allows operators to keep track of such
link weight changes. The web-site makes use of the topology snapshots
to display the set of links whose weights differ from the design weights.
This allows operators to validate the steps carried out for
maintenance. At the end of the maintenance interval, the web-site
also allows operators to verify that weights of the affected links are
reverted back to their original values.
Below we describe a few specific cases where the LSAG served to identify
network problems.
- 1.
- Internal problem in a crucial router:
The LSAG identified an intermittent hardware problem in
a crucial router in area 0 of
the enterprise network [7].
This problem resulted in episodes lasting a few minutes during which
the problematic router would drop and re-establish adjacencies
with other routers on the LAN.
Each episode lasted only for a few minutes and there were only a
few episodes each day.
The data suggests that during the episodes
the network was at the risk of partitioning or was in
fact partitioned.
During these episodes, a second router failure could have
resulted in a catastrophic loss of connectivity.
Fortunately, a flurry of ``ADJACENCY UP'' and ``ADJACENCY DOWN''
messages recorded by the LSAG during each episode
helped operators identify the problem,
and perform preventative maintenance.
It is worth noting here that
this problem did not manifest in other network
management tools being used by the enterprise network.
- 2.
- External link flaps:
The LSAG helped identify a flapping external link in the enterprise
network [7].
One of the enterprise network routers (call it A)
maintains a link to a customer
premise router (call it B) over which it runs EIGRP.
Router A imports EIGRP routes into OSPF as external LSAs.
LSAG messages led to a
closer inspection of network conditions, which revealed that
the EIGRP session between A and B started flapping when the
link between A and B became overloaded.
This led to router A repeatedly
announcing and withdrawing EIGRP prefixes via external LSAs.
The flapping of the link between A and B persisted
nearly every day for months between 9 PM and 3 AM.
The LSAG messages (``TYPE-5 ROUTE ANNOUNCED'' and ``TYPE-5 ROUTE
WITHDRAWN'') helped network operators to identify and mitigate
the problem, though they could not completely
eliminate it as the operators did not have access to the
customer-premise router.
- 3.
- Router configuration problem:
In another case, the LSAG helped operators of the enterprise network
identify a configuration problem:
assignment of the same router-id to two routers.
This error resulted in these routers repeatedly originating
their router LSAs which showed up as a series of ``ADJACENCY
UP/DOWN'' LSAG messages.
- 4.
- Refresh LSA bug:
The LSAG helped identify
a bug in the refresh algorithm of the routers
from a particular vendor in the ISP network.
The bug resulted in a much faster refresh of summary LSAs
under certain circumstances than the RFC-mandated [1]
rate of 30 minutes. The bug was identified due to the
``LSA STORM'' messages generated by the LSAG.
At the time of writing this paper,
the vendor is investigating the bug.
It is worth noting that it would be impossible
to catch such a bug with any other class of available
network management tools.
Next: Use of OSPFScan for
Up: Utility of the Monitor
Previous: Utility of the Monitor
aman shaikh
2004-02-07