O'Reilly    
 Published on O'Reilly (http://oreilly.com/)
 See this if you're having trouble printing code examples


Enterprise-Wide Network Management with OpenNMS

by Tarus Balog
09/08/2005

Network management is hard. Because the average end user considers "the network" to be anything on the other side of the keyboard, a good network manager needs to understand networking hardware, such as routers and switches, as well as server hardware, operating systems, and applications.

The work involved in keeping a network running increases exponentially with its size. While it is somewhat easy to manage 10 devices, it is much harder to manage 100 or 1,000. Enterprises often consist of tens of thousands of devices, and in the past only expensive commercial applications were up to the task. From day one of its conception, OpenNMS is an enterprise-grade network management solution developed under the open source model. Its aim is to provide a viable alternative to products such as Hewlett-Packard's OpenView, CA Unicenter, and Micromuse Netcool.

Why Open Source?

Because the scope of network management is so large, it is impossible for one product to do it all. Usually an enterprise must buy several applications and then pay a consulting company to glue them all together with scripts, web pages, and the like. The problem arises when one or more of those applications changes and the customizations no longer work. Rarely is there any code management for the glue, and companies must either hold off on upgrades or repeatedly pay for consulting.

Enter OpenNMS, a platform that allows users to add network management features over time. Anyone with a useful piece of management code is welcome to submit it to the project; the application design itself allows for expansion.

Related Reading

Network Security Tools
Writing, Hacking, and Modifying Security Tools
By Nitesh Dhanjani, Justin Clarke

Its open source nature is also useful when problems arise. Anyone who has spent time managing a network knows that vendors don't always follow specifications properly. In one example, a switch vendor was sending an improperly formatted Simple Network Management Protocol (SNMP) trap. Basically, a zero had been left off an Object ID (OID), and this caused OpenNMS to discard the trap. Someone opened a support ticket with the vendor, including the Ethereal trace and the RFC showing the required information--but when the vendor did not respond, the OpenNMS developers developed a quick modification to work around the problem. The fix, which would have taken a commercial application vendor weeks to address, took only hours for OpenNMS.

Two versions of OpenNMS are always available--a production (stable) release and a development (unstable) release. Because networks differ so much, having OpenNMS freely available means that it gets tested in more scenarios than a commercial vendor could undertake in a lab. By the time the development release is ready to become production, the application has become very robust.

What Does OpenNMS Actually Do?

Currently OpenNMS focuses on three areas: service polling, data collection, and event management.

When OpenNMS began, the management buzzword was SLA (service-level agreement). People wanted to understand how available their network services were. The accepted method was to take data from their management system, such as whether the device responded to ping and perhaps some information gathered from SNMP, and then try to perform an SLA calculation. It was kind of like trying to determine the fuel economy of your car by measuring the temperature of the exhaust manifold and the speed of your fuel pump.

OpenNMS took a simpler and more accurate route. The application simulates a user, so if OpenNMS is testing the availability of a web server, it retrieves a web page. For a DNS server, it performs a lookup. In other words, it would determine how far the car went and how much fuel it used. Due to the low cost of OpenNMS (that is, free), it is possible to have multiple instances in remote locations measuring service levels. This means that the Geneva office can measure its the service level from Geneva's point of view, not that of the data center in New York.

Out of the box, OpenNMS monitors more than 25 services, including HTTP and HTTPS, DNS and DHCP, and even Citrix and Radius. Even a ping is the ICMP service. By default, it tests each service every 5 minutes, which is similar to other network management products. OpenNMS also has an interesting feature called a downtime model, in which it can change its polling frequency when it detects an outage.

Suppose you need the ability to measure a service level of 99.99 percent availability over a month. This equates to about 4 minutes and 20 seconds of allowable downtime. However, if your software polls the service only once every 5 minutes, even a single outage will detect an SLA violation (because the shortest outage measurable will be 5 minutes long). OpenNMS addresses this by temporarily increasing the polling interval to 30 seconds when it detects an outage. However, after 5 minutes it goes back to a 5-minute cycle, and it backs off even further the longer the outage lasts. (All of this is configurable.) Thus, OpenNMS could detect multiple 30-second outages that would fall within the SLA for the month.

OpenNMS can generate events when it detects outages and their resolutions, as well as availability reports on the entire network or specific subgroups of services.

In addition to service polling, OpenNMS can collect SNMP data from network devices running SNMP agents. It stores the data using RRDTool or JRobin and can display it as reports in the web-based user interface (webUI). There are configurable thresholds (such as disk space and CPU utilization) to generate events when the thresholds are met.

One important aspect of data collection on the scale of an enterprise is the need to automate as much of it as possible. It is very difficult to configure data collection on 20,000 devices manually. OpenNMS has the concept of a "system," defined by a particular System Object ID (systemOID), which matches devices with the data to collect from them. Thus, when OpenNMS discovers a Cisco router or a Windows server, it automatically begins data collection without operator intervention.

Currently, OpenNMS can collect more than 200,000 data points from 22,000 devices once every 5 minutes--a rate of approximately 2.4 million data points per hour. This limit is due to the speed at which it can write the data to disk; the collection itself takes under 2 minutes.

The last main functional area is event management and notifications. OpenNMS generates events corresponding to outage detection and exceeded thresholds, et cetera. In addition, it can receive and display external events such as SNMP traps. There are also numerous other ways of getting events into OpenNMS. An included perl script, send-event.pl, gives even novices an easy way to start using OpenNMS as their main event manager.

One user had a special emergency email address configured via procmail to accommodate the contents of any email message sent to that address, turn it into an OpenNMS event, and send it via send-event.pl to the application, which generated a notification.

Notifications trigger when OpenNMS detects specific events. They cause the sending of a notification action such as an email, page, or SMS. You can use anything that can run from the command line to send an OpenNMS notification. Notifications walk a "path" that will escalate the issue until someone acknowledges it. Thus the first step can be to send an email, and if no one acknowledges it within 5 minutes, the second step is to send a page. If no one acknowledges that page within, say, 10 minutes, the third step might be to page a manager, and so on. Notifications can also be auto-acknowledged.

For example, if a web server experiences an outage and OpenNMS detects a resolution, it will auto-acknowledge all notifications based on that event. In addition, it is possible to set an initial delay to suppress notifications for the first few minutes of an outage, in order to give the network a chance to correct itself. For those who carry a pager 24/7, this can mean the difference between a bad and good night's sleep.

Installing OpenNMS

OpenNMS currently supports most Linux distributions, Solaris, and Mac OS X. It does have several dependencies, so the best thing to do is to read the detailed installation guide. OpenNMS is mostly Java, requiring a 1.4 SDK.

Because Java 1.4 does not have an ICMP API, OpenNMS uses JNI to access a small portion of code written in C. Java 1.5 addresses this issue. Moving OpenNMS to 100 percent Java is on the road map.

Now What?

The enterprise nature of OpenNMS means that it must be highly configurable, and the sheer number of configuration options is often overwhelming to the new user. To try out OpenNMS, you need only to modify two files.

OpenNMS stores its configuration files in the $OPENNMS_HOME/etc directory, usually /opt/OpenNMS/etc or /etc/opennms. In that directory is a file called discovery-configuration.xml and another called snmp-config.xml.

The first file controls what OpenNMS will discover, and it looks like this:

<discovery-configuration threads="1" packets-per-second="1"
        initial-sleep-time="300000" restart-sleep-time="86400000"
        retries="3" timeout="800">
        <include-range retries="2" timeout="3000">
                <begin>192.168.0.1</begin>
                <end>192.168.0.254</end>
        </include-range>
        <include-url>/opt/OpenNMS/etc/include</include-url>
</discovery-configuration>

This file shows some examples of what OpenNMS can use for discovery. An OpenNMS How-To guide covers all of the options. Basically, OpenNMS performs a ping sweep of the addresses and address ranges specified in this file. For each reply it receives, it will generate a newSuspect event to pass to the capabilities daemon for service scanning.

In this case, OpenNMS will scan from 192.168.0.1 to 192.168.0.254, as well as all of the IP addresses listed in the /opt/OpenNMS/etc/include file.

If you need to monitor two subnets, use multiple include-range tags. The following configuration will scan the 192.168.1.0 subnet and 172.20.2.0 subnet:

<discovery-configuration threads="1" packets-per-second="1"
        initial-sleep-time="300000" restart-sleep-time="86400000"
        retries="3" timeout="800">
        <include-range>
                <begin>192.168.1.1</begin>
                <end>192.168.1.254</end>
        </include-range>
        <include-range>
                <begin>172.20.2.1</begin>
                <end>172.20.2.254</end>
        </include-range>
</discovery-configuration>

Note that once it discovers a machine, OpenNMS will perform a directed port scan on the device to discover what services it provides. If you have portsentry or other detection software installed, make sure to exempt the OpenNMS server from any blocks.

The second file to modify is snmp-config.xml. It controls the community string to use for SNMP queries.

<snmp-config retry="3" timeout="800"
        read-community="public" write-community="private">
        <definition version="v1">
                <specific>192.168.0.5</specific>
        </definition>
        <definition read-community="topsecret">
                <range begin="192.168.1.1" end="192.168.1.254"/>
                <range begin="172.20.2.1" end="172.20.2.254"/>
                <specific>172.20.3.1</specific>
        </definition>
</snmp-config>

The first line lists the default read community string as public and the default write community string as private. At the moment, OpenNMS does not support the SNMP SET command, so it does not use the write community string and you do not have to configure it.

Override the default values via <definition> tags. The first tag forces OpenNMS to use SNMP version 1 for 192.168.0.5. (This is sometimes necessary when the device improperly supports SNMPv2c.) OpenNMS will use version 2c if available but will also work if your hardware supports only version 1.

The second definition overrides the default read community string of public with topsecret for the 192.168.1.0 and 172.20.2.0 subnets and the specific device at 172.20.3.1.

If only one community name is in use on the network, placing it in the first line in place of public is sufficient. You can also leave it as public if you haven't changed it on your agents.

After you have modified these two files, start OpenNMS with the command:

$OPENNMS_HOME/bin/opennms.sh start

where $OPENNMS_HOME is usually /opt/OpenNMS. For Linux distributions that support /sbin/service, use the command:

service opennms start

Similarly for Debian, use:

invoke-rc.d opennms start

Finally, ensure that the Tomcat servlet container is running, and then point a browser to:

http://[OpenNMS Server]:8080/opennms(http://[OpenNMS Server]:8180/opennms for Debian)

and log in with a username of admin and a password of admin. Within 5 minutes, OpenNMS should have started to discover the network.

What's Next?

Like many open source projects, OpenNMS is a little light on documentation. The OpenNMS How-To guides are a good place to start, and the OpenNMS Wiki provides additional information. There is a large community supporting the project, and the best way to contact it is via the OpenNMS mailing lists. Be sure to read the OpenNMS mailing list FAQ first.

The current development version of OpenNMS adds support for SNMP version 3, as well a new JMX data collector for gathering information from sources such as JBoss and Tomcat. It also introduces a new "alarms" subsystem, which improves event management by reducing duplicate events and adding some automated event manipulation (such as increasing severity over time). Also planned is support for the Nagios Remote Plug-ins Executor (NRPE) via a 100 percent Java-based NRPE client.

Tarus Balog has more than 15 years of network management experience in the telecom and datacom industries.


Return to the SysAdmin DevCenter

Copyright © 2009 O'Reilly Media, Inc.