It seems that HP are getting around to fixing the issue of duplicate Cisco Nexus nodes being discovered (it’s not in patch 5) but until then, it’s possible to work around this. Duplicate discoveries play havoc with RRG alerting which isn’t funny when someone’s woken up in the middle of the night for it.
To stop NNMi discovering duplicate nodes once you have the devices in your topology, do the following:
1) Create an Auto-Discovery rule with the lowest ordering in the list (eg: No-AutoDiscover-Rule)
2) Uncheck all options on the left hand pane and uncheck Enable Ping Sweep
3) Add the IP addresses of all the mgmt0 interfaces in the management VRF (Type: Include in rule)
Strange issue with SNMP not responding today on a nexus 5K. Tried removing and re-adding the SNMP config, removing the acl altogether that we use to control access and still no joy.
Upon checking CPU usage, it seemed quite high. show proc cpu sort output showed that snmpd was quite busy:
PID Runtime(ms) Invoked uSecs 1Sec Process
----- ----------- -------- ----- ------ -----------
4559 59 991518226 0 44.5% snmpd
4605 179 87 2065 9.0% netstack
1178 2091 1733135010 0 1.0% kirqd
1 157 25653477 0 0.0% init
2 837 3474116 0 0.0% migration/0
3 600 3970856252 0 0.0% ksoftirqd/0
I was sure I’d dealt with this before and it seems that I was hitting a bug.
Official word is that: There is a memory leak in one of the processes called libcmd that is used by SNMP. Workaround is entering the hidden command:
no snmp-server load-mib dot1dbridgesnmp
The best solution, however, would be to perform a software upgrade to 5.0(3)N2(2) or later where this is fixed.
I had a strange one today where a user set as a Directory service account couldn’t log in. No matter which computer or browser the user was coming from, he couldn’t log in and got the odd message below. The only thing that had happened is that he’d changed his password recently:
I’ve had issues in the past where the AD password has been changed to one that’s been used previously – NNMi must do some sort of caching – but this was different. I was assured that the password hadn’t been used before.
1) Delete the User account
2) Recreate the user account and mappings but create it as a local user (uncheck Directory service account and put in a temporary password)
3) Log on with the user name/password from a PC that the user hasn’t used before (ie: your PC)
4) Change account back to a Directory Service Account
The problem should be rectified. Clearing temporary internet files, cookies etc didn’t seem to help at all so again, I think there’s some caching going on within NNMi that breaks things.
Very annoying and I’ll be raising a bug report.
While iSPI for Metrics is useful for graphing, arguably the more critical use for it is in generating alerts when things are not working as expected.
I found today that Juniper does not include “FCS Lan Errors” in the ifInErrors counters which made finding an issue take a bit longer than usual. This is what I’m talking about:
Interface: ge-1/1/0, Enabled, Link is Up
Encapsulation: Ethernet, Speed: 1000mbps
Traffic statistics: Current delta
Input bytes: 42120054608900 (6704304 bps) 
Output bytes: 33903229061248 (11640616 bps) 
Input packets: 60739200100 (3676 pps) 
Output packets: 51621757514 (5316 pps) 
Input errors: 0 
Input drops: 0 
Input framing errors: 1435325 
Policed discards: 0 
L3 incompletes: 0 
For effective monitoring of critical WAN interfaces, you can create an interface group using for example, a filter by interface name (or ifAlias – description if you have a naming standard), which then is set up as a monitoring policy with the thresholds you want. Alternatively you can use a Node group based policy bearing in mind you’ll be alerted for all interfaces on those nodes in the group.
You want to enable Interface Fault Polling and Interface Performance Polling to make this work, and may also want to consider lowering the default 5 minute poll period for critical infrastructure.
I tend to set the trigger count at 1, and the rearm count higher (at least 2) so that the incident doesn’t just disappear before someone’s had a chance to have a look at it when the error condition clears.
As a side note, you’ll also want to set generic thresholds… Eg: For “Routers” at least the following:
Below are a few examples of setting up SNMP on Cisco Nexus 5Ks. Unfortunately, SNMPv3 is still missing some functionality, such as the ability to restrict access to a defined set of subnets or hosts via an ACL. I won’t go into another rant about how much of a headache SNMPv3 can be with various management systems, I’ll just provide (non-exhaustive) examples:
SNMPv2 with access control
! Create an ACL for allowed sources
ip access-list SNMPMGMT
permit ip 10.243.100.0/28 192.168.100.1/32
! Create community - default access is ro
snmp-server community MYSTRING group network-operator
snmp-server community MYSTRING use-acl SNMPMGMT
snmp-server host 10.243.21.104 use-vrf management
! Where to send traps to
snmp-server host 10.243.100.4 traps version 2c MYSTRING
! Enable some traps
snmp-server enable traps config ccmCLIRunningConfigChanged
snmp-server enable traps link cisco-xcvr-mon-status-chg
snmp-server enable traps bridge newroot
snmp-server enable traps bridge topologychange
SNMPv3 AuthPriv – SHA/AES – Unable to use an ACL!
! Enable privacy for all SNMP users
! Create User
snmp-server user MYUSER network-operator auth sha MYAUTHKEY priv aes-128 MYPRIVKEY localizedkey
! Where to send traps and which VRF to use
snmp-server host 10.243.100.4 traps version 3 priv MYUSER
snmp-server host 10.243.100.4 use-vrf management
! Enable traps below as per above example.