NNMi 9.2x AF Cluster Reformation DB Start Failure

I was trying to reform an application failover cluster after having to restore a backup and was running into issues. The primary came up absolutely fine but the secondary kept coming up with status “STANDBY_DB_FAILED”.

Looking into the /var/opt/OV/shared/nnm/databases/Postgresql/pg_log files, initially I had this after the DBZIP was received, extracted and the DB start was attempted:

2017-09-01 13:14:48.980 GMT: :4875:0LOG:  database system was interrupted; last known up at 2017-09-01 12:18:23 GMT
2017-09-01 13:14:49.129 GMT: :4875:0LOG:  entering standby mode
2017-09-01 13:14:49.497 GMT: :4875:0LOG:  restored log file "0000000D000005F700000095" from archive
2017-09-01 13:14:49.589 GMT: postgres:4971:0FATAL:  the database system is starting up
2017-09-01 13:14:49.855 GMT: :4875:0LOG:  restored log file "0000000D000005F700000093" from archive
2017-09-01 13:14:49.878 GMT: :4875:0FATAL:  could not access status of transaction 168395881
2017-09-01 13:14:49.878 GMT: :4875:0DETAIL:  Could not read from file "pg_clog/00A0" at offset 155648: Success.
2017-09-01 13:14:49.880 GMT: :4873:0LOG:  startup process (PID 4875) exited with exit code 1
2017-09-01 13:14:49.880 GMT: :4873:0LOG:  aborting startup due to startup process failure

Shutting down and starting in standalone then stopping again made a bit more progress – I saw some cleanup activities before it started but still got this:

2017-09-01 13:28:08.503 GMT: :21253:0LOG:  database system was shut down at 2017-09-01 13:26:27 GMT
2017-09-01 13:28:08.649 GMT: :21253:0LOG:  entering standby mode
2017-09-01 13:28:08.799 GMT: :21253:0WARNING:  WAL was generated with wal_level=minimal, data may be missing
2017-09-01 13:28:08.799 GMT: :21253:0HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
2017-09-01 13:28:08.799 GMT: :21253:0FATAL:  hot standby is not possible because wal_level was not set to "hot_standby" on the master server
2017-09-01 13:28:08.799 GMT: :21253:0HINT:  Either set wal_level to "hot_standby" on the master, or turn off hot_standby here.
2017-09-01 13:28:08.800 GMT: :21251:0LOG:  startup process (PID 21253) exited with exit code 1
2017-09-01 13:28:08.800 GMT: :21251:0LOG:  aborting startup due to startup process failure

In the end, what worked was the following:

1) On the Active member, within nnmcluster, trigger a manual DB backup with dbsync
2) On the Standby member, I removed the /var/opt/OV/shared/nnm/node-cluster.sentinel file – this may or may not have helped
3) On the Standby server, I restarted nnmcluster (nnmcluster -daemon)

In summary, it looks like perhaps there was some corruption or other stale data in the initial backup because the errors I saw on the standby weren’t present on the primary.


  Local?    NodeType  State                     OvStatus     Hostname/Address
  ------    --------  -----                     --------     ----------------------------
* LOCAL     DAEMON    ACTIVE_NNM_RUNNING        RUNNING      servera/servera-22547
  (SELF)    ADMIN     n/a                       n/a          servera/servera-64198
  REMOTE    DAEMON    STANDBY_READY             DB_RUNNING   serverb/serverb-40214

Do NOT install HP OMI agent v12 on NNMi9 Servers!

As per title, do NOT install/upgrade OMI agent on NNMi boxes to v12, as this will overwrite files in the /opt/OV/nonOV/perl directory. NNMi9 relies on Per 5.8.8 however 5.16.0 files will be installed when the OMI agent is installed.

Likely the first issues you will see will be NNMi self monitoring alerts, and then you’ll notice that you can’t run any of the .ovpl commands any more, getting a message like this:

Can't locate OVNNMvars.pm in @INC (@INC contains: /opt/OV/nonOV/perl/a/lib/site_perl/5.16.0/x86_64-linux-thread-multi /opt/OV/nonOV/perl/a/lib/site_perl/5.16.0 /opt/OV/nonOV/perl/a/lib/5.16.0/x86_64-linux-thread-multi /opt/OV/nonOV/perl/a/lib/5.16.0 /opt/OV/nonOV/perl/a/lib/site_perl/5.16.0/x86_64-linux-thread-multi /opt/OV/nonOV/perl/a/lib/site_perl/5.16.0 /opt/OV/nonOV/perl/a/lib/site_perl .) at /opt/OV/bin/nnmversion.ovpl line 19.
 BEGIN failed--compilation aborted at /opt/OV/bin/nnmversion.ovpl line 19.

It wasn’t fun getting back to a known state but we managed in the end by looking at what files the agent had installed/modified and restoring backups. Avoid the pain by not getting into the situation in the first place – also make sure you’re not just backing up data files but binaries as well.

The latest version of NNM10 uses Perl 5.16.0 I believe so this shouldn’t be an issue. I’d personally still avoid installing any product that uses the same file structure.

NNMi 9.24 Custom Poller Bus Adapter Errors

I was getting the following alerts regarding the custom poller on NNMi which was annoying users with the constant on/off status alerts.

“The Performance SPI Custom Poller Bus Adapter has status Critical because the average input queue duration…..”
“The CustomPoller Export Bus Adapter has status Minor because file space limit (2,000 MB) has been reached and older export data files are being removed to make room for new files.”

After some digging, it seems that the older files weren’t being deleted so this issue had actually crept up over time.

Apparently this was fixed by patch 5 but a workaround until then is to delete older files manually as follows. I strongly suggest running the find command without the “| xargs rm” part first to verify that you are indeed only finding regular files within the correct directory.

# cd /var/opt/OV/shared/nnm/databases/custompoller/export/final
# find . -mtime +365 -type f | xargs rm

Check the usage is under the size limit:

du -hs .
649M    .

HP NNMi9 False Redundancy Group alerts/Cisco Nexus Duplicate Discoveries

It seems that HP are getting around to fixing the issue of duplicate Cisco Nexus nodes being discovered (it’s not in patch 5) but until then, it’s possible to work around this. Duplicate discoveries play havoc with RRG alerting which isn’t funny when someone’s woken up in the middle of the night for it.

To stop NNMi discovering duplicate nodes once you have the devices in your topology, do the following:

1) Create an Auto-Discovery rule with the lowest ordering in the list (eg: No-AutoDiscover-Rule)
2) Uncheck all options on the left hand pane and uncheck Enable Ping Sweep
3) Add the IP addresses of all the mgmt0 interfaces in the management VRF (Type: Include in rule)

HP NNMi LDAP user account Can’t log in

I had a strange one today where a user set as a Directory service account couldn’t log in. No matter which computer or browser the user was coming from, he couldn’t log in and got the odd message below. The only thing that had happened is that he’d changed his password recently:

ldaperror

I’ve had issues in the past where the AD password has been changed to one that’s been used previously – NNMi must do some sort of caching – but this was different. I was assured that the password hadn’t been used before.

Solution:

1) Delete the User account
2) Recreate the user account and mappings but create it as a local user (uncheck Directory service account and put in a temporary password)
3) Log on with the user name/password from a PC that the user hasn’t used before (ie: your PC)
4) Change account back to a Directory Service Account

The problem should be rectified. Clearing temporary internet files, cookies etc didn’t seem to help at all so again, I think there’s some caching going on within NNMi that breaks things.

Very annoying and I’ll be raising a bug report.