Check out the new USENIX Web site.

Home About USENIX Events Membership Publications Students
5th Annual Linux Showcase & Conference — Abstract

Supermon: High-performance monitoring for Linux clusters

Ronald G. Minnich, University of Toronto

Abstract

At the ACL we are building tools that monitor a cluster, anticipate cluster node failure, and take action before the node fails. Actions include migrating the processes away from the failing node, deconfiguring the node, and initiating diagnostics on the node.

A key requirement for these tools is that they require efficient, frequent data collection from the nodes. Current tools for monitoring Linux systems do not provide enough information, and what information they do provide comes slowly and at a high cost in CPU resource consumption and network bandwidth. The resource consumption is an even more serious problem in a cluster, where consumption of network bandwidth is multiplied by the number of network nodes.

In this paper we describe Supermon, a new monitor-ing system for Linux clusters. Supermon functions as a server, and hence can supply data over many connections to many clients simultaneously. Supermon also replaces the SunRPC interface used by Linux status daemons with a very simple text-based command and response format. Supermon has proven to be very effective for many different types of clients, including Perl and Java programs. Our new system allows programs to gather cluster performance data at up to 1000 samples/second for all the nodes in a 128-node cluster, with no measureable impact on cluster node performance.

?Need help? Use our Contacts page.

Last changed: 22 Aug. 2003 ch
Technical Program
ALS 2001 Home
USENIX home