HP NNMi9 False Redundancy Group alerts/Cisco Nexus Duplicate Discoveries

It seems that HP are getting around to fixing the issue of duplicate Cisco Nexus nodes being discovered (it’s not in patch 5) but until then, it’s possible to work around this. Duplicate discoveries play havoc with RRG alerting which isn’t funny when someone’s woken up in the middle of the night for it.

To stop NNMi discovering duplicate nodes once you have the devices in your topology, do the following:

1) Create an Auto-Discovery rule with the lowest ordering in the list (eg: No-AutoDiscover-Rule)
2) Uncheck all options on the left hand pane and uncheck Enable Ping Sweep
3) Add the IP addresses of all the mgmt0 interfaces in the management VRF (Type: Include in rule)

High CPU on Nexus 5K and no SNMP response – snmpd

Strange issue with SNMP not responding today on a nexus 5K. Tried removing and re-adding the SNMP config, removing the acl altogether that we use to control access and still no joy.

Upon checking CPU usage, it seemed quite high. show proc cpu sort output showed that snmpd was quite busy:

PID    Runtime(ms)  Invoked   uSecs  1Sec    Process
-----  -----------  --------  -----  ------  -----------
 4559           59  991518226      0   44.5%  snmpd
 4605          179        87   2065    9.0%  netstack
 1178         2091  1733135010      0    1.0%  kirqd
    1          157  25653477      0    0.0%  init
    2          837   3474116      0    0.0%  migration/0
    3          600  3970856252      0    0.0%  ksoftirqd/0

I was sure I’d dealt with this before and it seems that I was hitting a bug.

Official word is that: There is a memory leak in one of the processes called libcmd that is used by SNMP. Workaround is entering the hidden command:

no snmp-server load-mib dot1dbridgesnmp

The best solution, however, would be to perform a software upgrade to 5.0(3)N2(2) or later where this is fixed.

HP NNMi LDAP user account Can’t log in

I had a strange one today where a user set as a Directory service account couldn’t log in. No matter which computer or browser the user was coming from, he couldn’t log in and got the odd message below. The only thing that had happened is that he’d changed his password recently:

ldaperror

I’ve had issues in the past where the AD password has been changed to one that’s been used previously – NNMi must do some sort of caching – but this was different. I was assured that the password hadn’t been used before.

Solution:

1) Delete the User account
2) Recreate the user account and mappings but create it as a local user (uncheck Directory service account and put in a temporary password)
3) Log on with the user name/password from a PC that the user hasn’t used before (ie: your PC)
4) Change account back to a Directory Service Account

The problem should be rectified. Clearing temporary internet files, cookies etc didn’t seem to help at all so again, I think there’s some caching going on within NNMi that breaks things.

Very annoying and I’ll be raising a bug report.

Effective Interface Monitoring with iSPI for Metrics

While iSPI for Metrics is useful for graphing, arguably the more critical use for it is in generating alerts when things are not working as expected.

I found today that Juniper does not include “FCS Lan Errors” in the ifInErrors counters which made finding an issue take a bit longer than usual. This is what I’m talking about:

Interface: ge-1/1/0, Enabled, Link is Up
Encapsulation: Ethernet, Speed: 1000mbps
Traffic statistics:                                           Current delta
  Input bytes:            42120054608900 (6704304 bps)           [16457880]
  Output bytes:           33903229061248 (11640616 bps)          [38429868]
  Input packets:             60739200100 (3676 pps)                 [82119]
  Output packets:            51621757514 (5316 pps)                [125843]
Error statistics:
  Input errors:                        0                                [0]
  Input drops:                         0                                [0]
  Input framing errors:          1435325                            [12231]
  Policed discards:                    0                                [0]
  L3 incompletes:                      0                                [0]

For effective monitoring of critical WAN interfaces, you can create an interface group using for example, a filter by interface name (or ifAlias – description if you have a naming standard), which then is set up as a monitoring policy with the thresholds you want. Alternatively you can use a Node group based policy bearing in mind you’ll be alerted for all interfaces on those nodes in the group.

Example:
thresh

You want to enable Interface Fault Polling and Interface Performance Polling to make this work, and may also want to consider lowering the default 5 minute poll period for critical infrastructure.

I tend to set the trigger count at 1, and the rearm count higher (at least 2) so that the incident doesn’t just disappear before someone’s had a chance to have a look at it when the error condition clears.

As a side note, you’ll also want to set generic thresholds… Eg: For “Routers” at least the following:

router-thresh

Nexus 5K SNMP Config

Below are a few examples of setting up SNMP on Cisco Nexus 5Ks. Unfortunately, SNMPv3 is still missing some functionality, such as the ability to restrict access to a defined set of subnets or hosts via an ACL. I won’t go into another rant about how much of a headache SNMPv3 can be with various management systems, I’ll just provide (non-exhaustive) examples:

SNMPv2 with access control

! Create an ACL for allowed sources
ip access-list SNMPMGMT
 permit ip 10.243.100.0/28 192.168.100.1/32

! Create community - default access is ro
snmp-server community MYSTRING group network-operator
snmp-server community MYSTRING use-acl SNMPMGMT
snmp-server host 10.243.21.104 use-vrf management

! Where to send traps to
snmp-server host 10.243.100.4 traps version 2c MYSTRING

! Enable some traps
snmp-server enable traps config ccmCLIRunningConfigChanged
snmp-server enable traps link cisco-xcvr-mon-status-chg
snmp-server enable traps bridge newroot
snmp-server enable traps bridge topologychange

SNMPv3 AuthPriv – SHA/AES – Unable to use an ACL!

! Enable privacy for all SNMP users
snmp-server globalEnforcePriv

! Create User
snmp-server user MYUSER network-operator auth sha MYAUTHKEY priv aes-128 MYPRIVKEY localizedkey

! Where to send traps and which VRF to use
snmp-server host 10.243.100.4 traps version 3 priv MYUSER
snmp-server host 10.243.100.4 use-vrf management

! Enable traps below as per above example.