5th Annual Linux Showcase & Conference Abstract
Supermon: High-performance monitoring for Linux clusters
Ronald G. Minnich, University of Toronto
Abstract
At the ACL we are building tools that monitor a cluster,
anticipate cluster node failure, and take action before the
node fails. Actions include migrating the processes away
from the failing node, deconfiguring the node, and initiating
diagnostics on the node.
A key requirement for these tools is that they require
efficient, frequent data collection from the nodes. Current
tools for monitoring Linux systems do not provide enough
information, and what information they do provide comes
slowly and at a high cost in CPU resource consumption
and network bandwidth. The resource consumption is an
even more serious problem in a cluster, where consumption
of network bandwidth is multiplied by the number of
network nodes.
In this paper we describe Supermon, a new monitor-ing
system for Linux clusters. Supermon functions as a
server, and hence can supply data over many connections
to many clients simultaneously. Supermon also replaces
the SunRPC interface used by Linux status daemons with
a very simple text-based command and response format.
Supermon has proven to be very effective for many different
types of clients, including Perl and Java programs.
Our new system allows programs to gather cluster performance
data at up to 1000 samples/second for all the
nodes in a 128-node cluster, with no measureable impact
on cluster node performance.
|