NNMi 9.2x AF Cluster Reformation DB Start Failure

I was trying to reform an application failover cluster after having to restore a backup and was running into issues. The primary came up absolutely fine but the secondary kept coming up with status “STANDBY_DB_FAILED”.

Looking into the /var/opt/OV/shared/nnm/databases/Postgresql/pg_log files, initially I had this after the DBZIP was received, extracted and the DB start was attempted:

2017-09-01 13:14:48.980 GMT: :4875:0LOG:  database system was interrupted; last known up at 2017-09-01 12:18:23 GMT
2017-09-01 13:14:49.129 GMT: :4875:0LOG:  entering standby mode
2017-09-01 13:14:49.497 GMT: :4875:0LOG:  restored log file "0000000D000005F700000095" from archive
2017-09-01 13:14:49.589 GMT: postgres:4971:0FATAL:  the database system is starting up
2017-09-01 13:14:49.855 GMT: :4875:0LOG:  restored log file "0000000D000005F700000093" from archive
2017-09-01 13:14:49.878 GMT: :4875:0FATAL:  could not access status of transaction 168395881
2017-09-01 13:14:49.878 GMT: :4875:0DETAIL:  Could not read from file "pg_clog/00A0" at offset 155648: Success.
2017-09-01 13:14:49.880 GMT: :4873:0LOG:  startup process (PID 4875) exited with exit code 1
2017-09-01 13:14:49.880 GMT: :4873:0LOG:  aborting startup due to startup process failure

Shutting down and starting in standalone then stopping again made a bit more progress – I saw some cleanup activities before it started but still got this:

2017-09-01 13:28:08.503 GMT: :21253:0LOG:  database system was shut down at 2017-09-01 13:26:27 GMT
2017-09-01 13:28:08.649 GMT: :21253:0LOG:  entering standby mode
2017-09-01 13:28:08.799 GMT: :21253:0WARNING:  WAL was generated with wal_level=minimal, data may be missing
2017-09-01 13:28:08.799 GMT: :21253:0HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
2017-09-01 13:28:08.799 GMT: :21253:0FATAL:  hot standby is not possible because wal_level was not set to "hot_standby" on the master server
2017-09-01 13:28:08.799 GMT: :21253:0HINT:  Either set wal_level to "hot_standby" on the master, or turn off hot_standby here.
2017-09-01 13:28:08.800 GMT: :21251:0LOG:  startup process (PID 21253) exited with exit code 1
2017-09-01 13:28:08.800 GMT: :21251:0LOG:  aborting startup due to startup process failure

In the end, what worked was the following:

1) On the Active member, within nnmcluster, trigger a manual DB backup with dbsync
2) On the Standby member, I removed the /var/opt/OV/shared/nnm/node-cluster.sentinel file – this may or may not have helped
3) On the Standby server, I restarted nnmcluster (nnmcluster -daemon)

In summary, it looks like perhaps there was some corruption or other stale data in the initial backup because the errors I saw on the standby weren’t present on the primary.

  Local?    NodeType  State                     OvStatus     Hostname/Address
  ------    --------  -----                     --------     ----------------------------
* LOCAL     DAEMON    ACTIVE_NNM_RUNNING        RUNNING      servera/servera-22547
  (SELF)    ADMIN     n/a                       n/a          servera/servera-64198
  REMOTE    DAEMON    STANDBY_READY             DB_RUNNING   serverb/serverb-40214
Tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *