Effective Interface Monitoring with iSPI for Metrics

While iSPI for Metrics is useful for graphing, arguably the more critical use for it is in generating alerts when things are not working as expected.

I found today that Juniper does not include “FCS Lan Errors” in the ifInErrors counters which made finding an issue take a bit longer than usual. This is what I’m talking about:

Interface: ge-1/1/0, Enabled, Link is Up
Encapsulation: Ethernet, Speed: 1000mbps
Traffic statistics:                                           Current delta
  Input bytes:            42120054608900 (6704304 bps)           [16457880]
  Output bytes:           33903229061248 (11640616 bps)          [38429868]
  Input packets:             60739200100 (3676 pps)                 [82119]
  Output packets:            51621757514 (5316 pps)                [125843]
Error statistics:
  Input errors:                        0                                [0]
  Input drops:                         0                                [0]
  Input framing errors:          1435325                            [12231]
  Policed discards:                    0                                [0]
  L3 incompletes:                      0                                [0]

For effective monitoring of critical WAN interfaces, you can create an interface group using for example, a filter by interface name (or ifAlias – description if you have a naming standard), which then is set up as a monitoring policy with the thresholds you want. Alternatively you can use a Node group based policy bearing in mind you’ll be alerted for all interfaces on those nodes in the group.

Example:
thresh

You want to enable Interface Fault Polling and Interface Performance Polling to make this work, and may also want to consider lowering the default 5 minute poll period for critical infrastructure.

I tend to set the trigger count at 1, and the rearm count higher (at least 2) so that the incident doesn’t just disappear before someone’s had a chance to have a look at it when the error condition clears.

As a side note, you’ll also want to set generic thresholds… Eg: For “Routers” at least the following:

router-thresh

NNMi9 Enabling Node Component Threshold Alerts (iSPI for Metrics)

NNMi9 (and iSPI for Metrics) allows threshold alerting on node components (or interface stats), however this needs to be set up at interface or node level within the monitoring configuration menus.

Example 1: Enabling CPU Threshold Alerting
Configuration > Monitoring Configuration

mon_nodesettings

Edit Routers (or another monitoring group, which refers to the node group you want) and in the Threshold Settings Tab, Add a new count based threshold setting.

mon_newthresh

Select CPU 5Min Utilization and set thresholds as desired. It’s best to disable the low value completely by setting Low Value and Low Value Rearm to 0 as we’re only concerned with a high value. You may want to set the High Trigger Count to 2 or more rather than 1, considering the fact that config writes and other normal activity can cause CPU spikes.

mon_cpu5min

Now Save and Close down to the Monitoring Configuration menu, then either Save or Save and Close from there.

Example 2: Interface Error Thresholds with iSPI for Metrics
Configuration > Monitoring Configuration > Interface Settings
Edit the desired interface configuration:

mon_intfoncifg

Add a new Count based Threshold, Select Input Error Rate and set as follows to alert for any amount of errors on any given poll.

mon_interror

Save & Close to Monitoring Configuration, then Save again. Repeat this for Output Error rate then save and close from the Monitoring Configuration menu.

To verify Interface Error threshold monitoring is working, you can go to Monitoring > Interface Performance like so:

mon_threshstate

Now you get alerts like the following in the alarm browser!

mon_incidents

NNMi Node “No Status” Despite Monitoring Settings

NNMi 9.22: On occasion, it has been noted that a node in NNMi will absolutely refuse to come out of “No Status” despite belonging to a valid monitoring configuration group. A config poll and status poll are successful but this is never reflected in the map view or node inventory.

Note that after trying these steps you should give NNMi at least 3 x your poll rate (eg: 15 minutes) to sort itself out. Also, take a backup BEFORE doing any of this to be on the safe side.

Solution:

1) Create a new New Group called “Monitoring Policy Temp Bucket” or something similar, add the problem node(s) into this group.

2) Add this node group as the first Entry in your Monitoring Configuration > Node Settings. Ensure that you enable SNMP polling, Management Address Polling, and all SNMP Fault Monitoring checkboxes. Changing FaultPolling interval to 1 minute will also help.

3) Check Configuration Details > Monitoring Settings of the node to ensure it’s in the temp bucket and do a config poll, followed by a status poll once finished.

4) Move the node(s) out of the temporary “Bucket” and check monitoring config and try config/status poll again.

If this doesn’t work (and you’ve waited at least 15 minutes), then delete the node(s) and reload them in from the command line with nnmloadseeds.ovpl -n [ip address of node]. Wait 15 minutes.

If that hasn’t fixed it, then there is one more drastic thing that’s worked in the past: Click on the node in map view and in the status panel, use the Device profile link to edit the device profile for that device and change it (eg: from Device Category Switch to Device Category Router… Author: Customer) save, then config/status poll and change the device profile settings back again. This will almost certainly fix the problem, I have no idea why, seeing as when this occurs, it doesn’t happen for all devices of the same profile!

HP NNMi Connection Editor

As there’s no easy way to add or delete links from the NNMi GUI, and some people are averse to editing XML files, here is a simple connection editor to generate the XML required. If you’re logged in as a normal user, you should run nnmsetcmduserpw.ovpl to save a load of hassle with inputting the username and password each time or having to run things under root via sudo.

From 9.22 onwards, it seems that nnmconnedit.ovpl doesn’t like reading files in folders that aren’t owned by root (such as home dirs) – weird!

/tmp works fine though, hence the location the generated files are saved to.

This will also allow you to deal with problematic “cloud” connections (usually from FDB discovery) by specifying more than 2 endpoints.

#!/bin/sh
#
# XML Corrections file builder for NNMI
#
# sol@subnetzero.org
#
defineconn() {
OPER=""
CONNS=0
while [ -z "$OPER" ];
do
   printf "[a]dd or [d]elete? "
   read RESP
   case $RESP in
        a)      OPER="add";;
        d)      OPER="delete";;
        *)      echo "Unknown option.";;
   esac
   echo "operation: $OPER"
done

while [ "$CONNS" -eq "0" ];
do
   printf "Number of endpoints (default 2)? "
   read CONNS
   case $CONNS in
        [2-9])  echo "Endpoints set to $CONNS";;
        *)      CONNS=2 ;
                echo "Endpoints set to $CONNS";;
   esac
   echo "operation: $OPER"
done

XML="$XML
   <connection>
       <operation>$OPER</operation>"
ELEM=1
while [ "$ELEM" -le "$CONNS" ];
do
   printf "     Node$ELEM:"; read NODE
   printf "Interface$ELEM:"; read INTF

   XML="$XML
       <node>$NODE</node>
       <interface>$INTF</interface>"

ELEM=$(( $ELEM + 1 ))
done

XML="$XML
    </connection>"
}

##############
#
# Starts here

OUTFILE=/tmp/connections_$USER.xml
XML="<connectionedits>"

echo " *** NNMI Connection Edit XML Generator *** "
defineconn

while [ -z "$FINISHED" ];
do
printf "define another? (y/n): "; read YESNO
case $YESNO in
        y|Y)    defineconn;;
        n|N)    FINISHED=true;;
        *)      echo "Aborting";
                exit;;
esac
done

printf "Closing connectionedits tag\n"
XML="$XML
</connectionedits>"

echo "$XML" > $OUTFILE
echo "Completed. XML is written to $OUTFILE"
echo "Run /opt/OV/bin/nnmconnedit.ovpl -f $OUTFILE"

Example output, deleting a connection that was decommissioned.

[sol@nnmi-server ~]$ nnmiconntool
 *** NNMI Connection Edit XML Generator ***
[a]dd or [d]elete? d
operation: delete
Number of endpoints (default 2)?
Endpoints set to 2
operation: delete
     Node1:NYC04A01
Interface1:Gi2/47
     Node2:NYC04B01
Interface2:Gi5/2
define another? (y/n): n
Closing connectionedits tag
Completed. XML is written to /tmp/connections_sol.xml
Run /opt/OV/bin/nnmconnedit.ovpl -f /tmp/connections_sol.xml
[sol@nnmi-server ~]$  /opt/OV/bin/nnmconnedit.ovpl -f /tmp/connections_sol.xml
Connection 1 was successfully deleted.

[sol@nnmi-server ~]$

nnmcluster failure and issues

NNMi nnmcluster sometimes does not behave as expected. The following gotchas and procedures may be helpful.

Note that in ALL cases, you should have a recent backup in place in case of unexpected results.

First Scenario: nnmcluster is effectively “disconnected” on the primary, showing no status whatsoever.

The secondary has become active, yet the primary still functions as normal (ovstatus -c shows all fine and it works as normal.) nnmcluster command suggests that only the remote member is active. You are now unable to shutdown cleanly.
Solution:

As nnmcluster -shutdown and nnmcluster -halt won’t respond on the primary (as the system thinks you are not in cluster mode) run nnmcluster -shutdown on the secondary and then on the primary do the following:

vi /var/opt/OV/shared/nnm/conf/props/nms-cluster.properties

Comment out the cluster name with #, then run

ovstop

BE PATIENT as it may take a few minutes to shut down.

Again, on the primary, UNcomment the cluster name in the nms-cluster.properties file.

Move the sentinel file

mv /var/opt/OV/shared/nnm/node-cluster.sentinel /var/opt/OV/shared/nnm/node-cluster.sentinel.orig

Check if nnmcluster is still running and using the port required (it’ll essentially be a “detached” process here).

lsof –i :7810

KILL the PID with kill -9 [PID]

Then run

nnmcluster -daemon

Run nnmcluster, wait for it to be in active state, then run nnmcluster -daemon on the standby.

Second scenario: Unable to get both members up in cluster.

i) Check that cluster name is the same in both config files. Ensure that NO white space is trailing the cluster name in either file.

ii) run nnmcluster -shutdown on both members

iii) Follow the steps above in finding the PID binding to 7810

lsof –i :7810

Kill this PID if it exists, then restart both sides. If this still doesn’t work, then repeat, deleting /var/opt/OV/shared/nnm/node-cluster.sentinel file on both servers before the restart.