I had a strange issue where one of the members of an SRX cluster dropped out unexpectedly. No changes made and nothing was touched physically.
When looking on the console, the cluster status was primary but none of the physical interfaces existed, control links were down and the fxp was down too – so basically zero network connectivity.
adminuser@JCLFWL02> show chassis cluster status Cluster ID: 1 Node Priority Status Preempt Manual failover Redundancy group: 0 , Failover count: 1 node0 0 lost n/a n/a node1 100 primary no no Redundancy group: 1 , Failover count: 1 node0 0 lost n/a n/a node1 0 primary no no
The logs unearthed some nasty looking messages starting with this:
Mar 15 08:31:54 JCLFWL02 (FPC Slot 1, PIC Slot 0) SPC1_PIC0 kernel: kld_map_v: 0xffffffff8c000000, kld_map_p: 0xc000000 Mar 15 08:31:54 JCLFWL02 (FPC Slot 1, PIC Slot 0) SPC1_PIC0 kernel: dog: ERROR - reset of uninitialized watchdog Mar 15 08:31:54 JCLFWL02 (FPC Slot 1, PIC Slot 0) SPC1_PIC0 kernel: Copyright (c) 1996-2014, Juniper Networks, Inc.
show chassis hardware output indicated that FPCs were there but no PICs detected! Serial numbers removed for confidentiality.
adminuser@JCLFWL02> show chassis hardware node1: -------------------------------------------------------------------------- Hardware inventory: Item Version Part number Serial number Description Chassis XXXXXXXXXXXX SRX 1400 Midplane REV 03 711-031012 XXXXXXXX SRX1k Backplane PEM 0 rev 03 740-032015 XXXXXXXXXXXXX AC Power Supply CB 0 REV 13 750-032544 XXXXXXXX SRX1K-RE-12-10 Routing Engine BUILTIN BUILTIN Routing Engine CPP BUILTIN BUILTIN Central PFE Processor Mezz REV 09 710-021035 XXXXXXXX SRX HD Mezzanine Card FPC 0 REV 17 750-032536 XXXXXXXX SRX1k 1GE SYSIO FPC 1 REV 12 750-032543 XXXXXXXX SRX1k Dual Wide NPC+SPC Support Card FPC 3 REV 19 710-017865 XXXXXXXX BUILTIN NPC Fan Tray -N/A- -N/A- -N/A- SRX 1400 Fan Tray
show chassis fpc pic-status indicated much the same.
adminuser@JCLFWL02> show chassis fpc pic-status node1: -------------------------------------------------------------------------- Slot 0 Offline SRX1k 1GE SYSIO Slot 1 Offline SRX1k Dual Wide NPC+SPC Support Card Slot 3 Offline BUILTIN NPC
The short story is that we tried power-off, re-seating the SYSIOC and this brought everything back for a few hours before everything died again.
Replacement of the SYSIOC was required – not an issue as the config is stored on the RE, although Juniper do have a caveat article mentioning that the control links may not come back up once this is done. They recommend to reapply the cluster node member config
eg: (change cluster ID and node number as appropriate)
set chassis cluster cluster-id 1 node 1 reboot
As an additional note, it seems this card takes care of all sorts of internal communications causing some odd alarms!
adminuser@JCLFWL02> show chassis alarms node1: -------------------------------------------------------------------------- 9 alarms currently active Alarm time Class Description 2016-03-15 14:18:22 UTC Major FPC 3 misconfig 2016-03-15 14:18:22 UTC Major FPC 1 misconfig 2016-03-15 14:18:22 UTC Major FPC 0 misconfig 2016-03-15 14:06:32 UTC Major Fan Tray Failure 2016-03-15 14:06:22 UTC Major Muliple FANs Stuck 2016-03-15 14:06:11 UTC Major FPC 3 offline due to CPP disconnect 2016-03-15 14:06:11 UTC Major FPC 1 offline due to CPP disconnect 2016-03-15 14:06:11 UTC Major FPC 0 offline due to CPP disconnect 2016-03-15 14:06:07 UTC Major Host 0 fxp0 : Ethernet Link Down adminuser@JCLFWL02> show chassis environment node1: -------------------------------------------------------------------------- Class Item Status Measurement Temp PEM 0 Absent PEM 1 Absent Routing Engine 0 OK Routing Engine 1 Absent CB 0 Intake OK 32 degrees C / 89 degrees F CB 0 Exhaust A OK 37 degrees C / 98 degrees F CB 0 Mezz OK 34 degrees C / 93 degrees F FPC 0 Intake OK 32 degrees C / 89 degrees F FPC 0 Exhaust A OK 31 degrees C / 87 degrees F FPC 1 Intake OK 28 degrees C / 82 degrees F FPC 1 Exhaust A OK 28 degrees C / 82 degrees F FPC 1 XLR Testing FPC 3 Intake OK 28 degrees C / 82 degrees F FPC 3 Exhaust A OK 29 degrees C / 84 degrees F Fans Fan 1 Check Fan 2 Check
Another side note – I wondered why one control link had been installed as fibre and another had been installed as copper, and came across this from Juniper:
NOTE: When you use ge-0/0/11 as a control port, you must use a fiber SFP transceiver, but you can use copper or fiber SFP transceiver on ge-0/0/10.