NNMi 9.2x AF Cluster Reformation DB Start Failure

I was trying to reform an application failover cluster after having to restore a backup and was running into issues. The primary came up absolutely fine but the secondary kept coming up with status “STANDBY_DB_FAILED”.

Looking into the /var/opt/OV/shared/nnm/databases/Postgresql/pg_log files, initially I had this after the DBZIP was received, extracted and the DB start was attempted:

2017-09-01 13:14:48.980 GMT: :4875:0LOG:  database system was interrupted; last known up at 2017-09-01 12:18:23 GMT
2017-09-01 13:14:49.129 GMT: :4875:0LOG:  entering standby mode
2017-09-01 13:14:49.497 GMT: :4875:0LOG:  restored log file "0000000D000005F700000095" from archive
2017-09-01 13:14:49.589 GMT: postgres:4971:0FATAL:  the database system is starting up
2017-09-01 13:14:49.855 GMT: :4875:0LOG:  restored log file "0000000D000005F700000093" from archive
2017-09-01 13:14:49.878 GMT: :4875:0FATAL:  could not access status of transaction 168395881
2017-09-01 13:14:49.878 GMT: :4875:0DETAIL:  Could not read from file "pg_clog/00A0" at offset 155648: Success.
2017-09-01 13:14:49.880 GMT: :4873:0LOG:  startup process (PID 4875) exited with exit code 1
2017-09-01 13:14:49.880 GMT: :4873:0LOG:  aborting startup due to startup process failure

Shutting down and starting in standalone then stopping again made a bit more progress – I saw some cleanup activities before it started but still got this:

2017-09-01 13:28:08.503 GMT: :21253:0LOG:  database system was shut down at 2017-09-01 13:26:27 GMT
2017-09-01 13:28:08.649 GMT: :21253:0LOG:  entering standby mode
2017-09-01 13:28:08.799 GMT: :21253:0WARNING:  WAL was generated with wal_level=minimal, data may be missing
2017-09-01 13:28:08.799 GMT: :21253:0HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
2017-09-01 13:28:08.799 GMT: :21253:0FATAL:  hot standby is not possible because wal_level was not set to "hot_standby" on the master server
2017-09-01 13:28:08.799 GMT: :21253:0HINT:  Either set wal_level to "hot_standby" on the master, or turn off hot_standby here.
2017-09-01 13:28:08.800 GMT: :21251:0LOG:  startup process (PID 21253) exited with exit code 1
2017-09-01 13:28:08.800 GMT: :21251:0LOG:  aborting startup due to startup process failure

In the end, what worked was the following:

1) On the Active member, within nnmcluster, trigger a manual DB backup with dbsync
2) On the Standby member, I removed the /var/opt/OV/shared/nnm/node-cluster.sentinel file – this may or may not have helped
3) On the Standby server, I restarted nnmcluster (nnmcluster -daemon)

In summary, it looks like perhaps there was some corruption or other stale data in the initial backup because the errors I saw on the standby weren’t present on the primary.


  Local?    NodeType  State                     OvStatus     Hostname/Address
  ------    --------  -----                     --------     ----------------------------
* LOCAL     DAEMON    ACTIVE_NNM_RUNNING        RUNNING      servera/servera-22547
  (SELF)    ADMIN     n/a                       n/a          servera/servera-64198
  REMOTE    DAEMON    STANDBY_READY             DB_RUNNING   serverb/serverb-40214

Juniper MX480 PSU Gripes

This has happened three times in a row to me. It seems that Juniper MX480 PSUs are not very graceful when they blow. I now think it’s unwise to ask anyone to try “re-seating the PSU and power cable” once a PSU has failed because this has a tendency to make the PDU trip, knocking out everything on that power strip. Ouch! Better to just RMA the thing.

Not so much of a problem when all kit in the rack has dual PSUs, but it’s worth bearing in mind in case there is any single PSU kit that isn’t on a static transfer switch in there! Lucky the environment I deal with is doesn’t involve the latter. :)

Faulty PSU shows as:

PEM 0:
  State:     Present
  AC input:  Out of range (1 feed expected, 1 feed connected)
  Capacity:  2050 W (maximum 2050 W)
  DC output: 0 W (zone 0, 0 A at 0 V, 0% of capacity)

ASA stuck on “Booting system, please wait…” after power cycle.

A Cisco ASA5550 was stuck on “Booting system, please wait…” after a power cycle (physically turning off and on, not just a reload). It was impossible to break into ROMMON from here.

After taking the cover off and doing some experimentation, the issue was found to be a faulty DIMM slot. We removed all DIMMs and replaced them one by one. Upon placing a DIMM in slot 1, the firewall failed to boot again. We swapped DIMMs 1 and 2, still no joy.

Removing the DIMM from slot 1 again meant that the firewall came back up.

Solution: If an RMA is going to take time, bring the firewall back up with less memory. Otherwise, swap the thing out straight away as you don’t know what else the power cycle has fried!

You can make your life easier doing the replacement by moving the old ASA’s compact flash card to the external (disk1) slot of the new one. You can then get the right OS on the replacement quickly at least.

NEWASA# copy disk1:/[IOS-image-name.bin] disk0:/[IOS-image-name.bin]
NEWASA# boot system disk0:/[IOS-image-name.bin]
NEWASA# wr mem

Note: you can copy the running-config on the old ASA to a visible file on CF (eg: copy running-config disk0:/myconfig.cfg), but copying that from disk1 to running-config on the replacement tends to not work very well if you have TACACS config in place. Good old copy and paste from a backup is the way forward.

Cisco 4900M Upgrade failure issue.

So you’ve got a Cisco 4900M switch to upgrade. You’ve uploaded the new image to flash, and set the boot statement correctly. However, your switch still won’t come up on the new version of code you’ve uploaded, no matter how many times you reload. This makes no sense at all and you begin to tear your hair out.

4900M-SW-1#dir
Directory of bootflash:/
 
    6  -rw-    25442405   Sep 7 2012 12:55:14 +01:00  cat4500e-entservicesk9-mz.122-53.SG1.bin
    7  -rw-    25646261   Sep 8 2012 13:31:20 +01:00  cat4500e-entservicesk9-mz.122-53.SG2.bin
    8  -rw-       41218  Jan 14 2013 11:10:26 +01:00  extra_logs.txt
    9  -rw-    25936915  Mar 18 2013 10:00:40 +01:00  cat4500e-entservicesk9-mz.122-54.SG1.bin

4900M-SW-1(config)# boot system flash bootflash:cat4500e-entservicesk9-mz.122-54.SG1.bin

The issue here is easily missed. The problem is that the 4900M platforms seem to ship with their config register set to 0x2101 by default, which means load the first image found in flash. Doing a show version will verify this. Fix this with:

4900M-SW-1(config)# config-register 0x2102
4900M-SW-1(config)#^Z

Using “show bootvar” will verify your config register has been changed. After reload, your switch should now reload with the correct image (whatever is specified in your boot statement). You will see the config register is now correct from “show version”:

cisco WS-C4900M (MPC8548) processor (revision 2) with 524288K bytes of memory.
Processor board ID XXXXXXXXX
MPC8548 CPU at 1.33GHz, Cisco Catalyst 4900M
Last reset from Reload
10 Virtual Ethernet interfaces
36 Gigabit Ethernet interfaces
16 Ten Gigabit Ethernet interfaces
511K bytes of non-volatile configuration memory.
 
Configuration register is 0x2102

I should say here that 4900Ms are very strange beasts and have caused me a few issues in the past. More code related than hardware, though. An older version caused an outage when uploading an IOS image via FTP thanks to a process deciding to hog 100% of the CPU (confirmed bug from cisco). Call me paranoid, but I now avoid uploading new images during the day if I can help it!