F5 HA Device Group Woes

HA Device Groups. Great when they work, but I’m not sure why F5 have to make setting this up such a headache. There’s something lacking in the QA department when setting up a device group for HA causes so many problems for so many people.

This can be immensely frustrating and can lead to hair loss so here’s my checklist to get things working.

Prerequisites:

- NTP must be sync'd or things won't work. (verify with ntpstat from command line)
- Make backups of the current configs before trying to fix things!

Caveats:

- When things aren't right, the below can cause unexpected failovers - arrange an outage window!

Procedure:

i)   Add HA (Failover) VLAN on (both boxes) and assign to interface(s)
ii)  Add HA (Failover) Self IPs (both boxes) - Make sure Port Lockdown is "Allow Default"
iii) Add peer with Device Trust -> Peer List -> Add (both boxes)
iii) On each box, under Devices (self):
          - Set Device Connectivity -> ConfigSync -> Failover LAN IP
          - Set Device Connectivity -> Network Failover -> Failover LAN IP (and also MGMT if desired)
          - Set Device Connectivity -> Connection Mirroring -> Failover LAN IP (and secondary if desired)
iv) Create a Device group with Device Groups -> Create (eg: FAILOVER-GROUP) (both boxes) 
    with type sync-failover (not automatic sync)
v)  Add peer to device group (Select automatic sync if required)
vi) Sync to peer(s) – select “overwrite config”

If this fails and devices still say disconnected, then:

i)   Reset Device Trust on both boxes 
           - Device Management > Device Trust > Reset Device Trust [retain current authority]
ii)  REBOOT the box with full_box_reboot at command line (obviously needs an outage window if live)
iii) Delete the peers on both boxes, and if necessary, clear out any membership of 
        device groups. Also delete the device group.
iv)  Start from/verify step iii in the first section above again.

Don’t forget to set up your routes before setting up nodes that aren’t locally attached to your external LAN. ;)

As an aside, the latest problem was with the secondary which had gone a bit nuts and had a weird error message when trying to sync, something along the lines of “couldn’t load node 71….etc”. We decided to reset it to defaults with tmsh load /sys config default before doing the above.

F5 KB article for restoring default config:

https://support.f5.com/kb/en-us/solutions/public/13000/100/sol13127.html