NNMi 9.2x AF Cluster Reformation DB Start Failure

I was trying to reform an application failover cluster after having to restore a backup and was running into issues. The primary came up absolutely fine but the secondary kept coming up with status “STANDBY_DB_FAILED”.

Looking into the /var/opt/OV/shared/nnm/databases/Postgresql/pg_log files, initially I had this after the DBZIP was received, extracted and the DB start was attempted:

2017-09-01 13:14:48.980 GMT: :4875:0LOG:  database system was interrupted; last known up at 2017-09-01 12:18:23 GMT
2017-09-01 13:14:49.129 GMT: :4875:0LOG:  entering standby mode
2017-09-01 13:14:49.497 GMT: :4875:0LOG:  restored log file "0000000D000005F700000095" from archive
2017-09-01 13:14:49.589 GMT: postgres:4971:0FATAL:  the database system is starting up
2017-09-01 13:14:49.855 GMT: :4875:0LOG:  restored log file "0000000D000005F700000093" from archive
2017-09-01 13:14:49.878 GMT: :4875:0FATAL:  could not access status of transaction 168395881
2017-09-01 13:14:49.878 GMT: :4875:0DETAIL:  Could not read from file "pg_clog/00A0" at offset 155648: Success.
2017-09-01 13:14:49.880 GMT: :4873:0LOG:  startup process (PID 4875) exited with exit code 1
2017-09-01 13:14:49.880 GMT: :4873:0LOG:  aborting startup due to startup process failure

Shutting down and starting in standalone then stopping again made a bit more progress – I saw some cleanup activities before it started but still got this:

2017-09-01 13:28:08.503 GMT: :21253:0LOG:  database system was shut down at 2017-09-01 13:26:27 GMT
2017-09-01 13:28:08.649 GMT: :21253:0LOG:  entering standby mode
2017-09-01 13:28:08.799 GMT: :21253:0WARNING:  WAL was generated with wal_level=minimal, data may be missing
2017-09-01 13:28:08.799 GMT: :21253:0HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
2017-09-01 13:28:08.799 GMT: :21253:0FATAL:  hot standby is not possible because wal_level was not set to "hot_standby" on the master server
2017-09-01 13:28:08.799 GMT: :21253:0HINT:  Either set wal_level to "hot_standby" on the master, or turn off hot_standby here.
2017-09-01 13:28:08.800 GMT: :21251:0LOG:  startup process (PID 21253) exited with exit code 1
2017-09-01 13:28:08.800 GMT: :21251:0LOG:  aborting startup due to startup process failure

In the end, what worked was the following:

1) On the Active member, within nnmcluster, trigger a manual DB backup with dbsync
2) On the Standby member, I removed the /var/opt/OV/shared/nnm/node-cluster.sentinel file – this may or may not have helped
3) On the Standby server, I restarted nnmcluster (nnmcluster -daemon)

In summary, it looks like perhaps there was some corruption or other stale data in the initial backup because the errors I saw on the standby weren’t present on the primary.


  Local?    NodeType  State                     OvStatus     Hostname/Address
  ------    --------  -----                     --------     ----------------------------
* LOCAL     DAEMON    ACTIVE_NNM_RUNNING        RUNNING      servera/servera-22547
  (SELF)    ADMIN     n/a                       n/a          servera/servera-64198
  REMOTE    DAEMON    STANDBY_READY             DB_RUNNING   serverb/serverb-40214

Do NOT install HP OMI agent v12 on NNMi9 Servers!

As per title, do NOT install/upgrade OMI agent on NNMi boxes to v12, as this will overwrite files in the /opt/OV/nonOV/perl directory. NNMi9 relies on Per 5.8.8 however 5.16.0 files will be installed when the OMI agent is installed.

Likely the first issues you will see will be NNMi self monitoring alerts, and then you’ll notice that you can’t run any of the .ovpl commands any more, getting a message like this:

Can't locate OVNNMvars.pm in @INC (@INC contains: /opt/OV/nonOV/perl/a/lib/site_perl/5.16.0/x86_64-linux-thread-multi /opt/OV/nonOV/perl/a/lib/site_perl/5.16.0 /opt/OV/nonOV/perl/a/lib/5.16.0/x86_64-linux-thread-multi /opt/OV/nonOV/perl/a/lib/5.16.0 /opt/OV/nonOV/perl/a/lib/site_perl/5.16.0/x86_64-linux-thread-multi /opt/OV/nonOV/perl/a/lib/site_perl/5.16.0 /opt/OV/nonOV/perl/a/lib/site_perl .) at /opt/OV/bin/nnmversion.ovpl line 19.
 BEGIN failed--compilation aborted at /opt/OV/bin/nnmversion.ovpl line 19.

It wasn’t fun getting back to a known state but we managed in the end by looking at what files the agent had installed/modified and restoring backups. Avoid the pain by not getting into the situation in the first place – also make sure you’re not just backing up data files but binaries as well.

The latest version of NNM10 uses Perl 5.16.0 I believe so this shouldn’t be an issue. I’d personally still avoid installing any product that uses the same file structure.

Updating TACACs on an older Wildpackets Omnipliance

I had to update the TACACs server details on an old OmniPliance which proved to be quite confusing. I found the location of the settings but each time we restarted the service it reverted back to the way it was.

Quite a simple solution in the end… The procedure is below.

$ ssh root@omnipliance1
# service omnid stop
# vi /etc/omni/engineconfig.xml

Edit the following line in engineconfig.xml

             

Quit with :wq! and restart the service.

# service omnid start

NNMi 9.24 Custom Poller Bus Adapter Errors

I was getting the following alerts regarding the custom poller on NNMi which was annoying users with the constant on/off status alerts.

“The Performance SPI Custom Poller Bus Adapter has status Critical because the average input queue duration…..”
“The CustomPoller Export Bus Adapter has status Minor because file space limit (2,000 MB) has been reached and older export data files are being removed to make room for new files.”

After some digging, it seems that the older files weren’t being deleted so this issue had actually crept up over time.

Apparently this was fixed by patch 5 but a workaround until then is to delete older files manually as follows. I strongly suggest running the find command without the “| xargs rm” part first to verify that you are indeed only finding regular files within the correct directory.

# cd /var/opt/OV/shared/nnm/databases/custompoller/export/final
# find . -mtime +365 -type f | xargs rm

Check the usage is under the size limit:

du -hs .
649M    .

Spoofing SNMP Traps for testing

Somehow I missed the fact that JunOS allows you to spoof SNMP traps. I discovered this recently and must say it’s very handy, especially when testing new NNMi or other NMS incident configurations. It helpfully populates the varbinds for you with preset values (although you can specify them if desired).

This is done as follows from the JunOS command line:

user@SRX> request snmp spoof-trap ospfNbrStateChange variable-bindings "ospfNbrState = 8"

You can look up varbind names in the appropriate MIB.

A bit easier than the traditional equivalent in shell:

/usr/bin/snmptrap -v 2c -c mycommunity nms.mydomain.com:162 '' .1.3.6.1.2.1.14.16.2.0.2 \
.1.3.6.1.2.1.14.1.1 a 0.0.0.0 \
.1.3.6.1.2.1.14.10.1.1 a "1.2.3.4" \
.1.3.6.1.2.1.14.10.1.2 i "0" \
.1.3.6.1.2.1.14.10.1.3 a "2.3.4.5" \
.1.3.6.1.2.1.14.10.1.6 i "8"