Dead SYSIOC card in SRX1400

I had a strange issue where one of the members of an SRX cluster dropped out unexpectedly. No changes made and nothing was touched physically.

When looking on the console, the cluster status was primary but none of the physical interfaces existed, control links were down and the fxp was down too – so basically zero network connectivity.

adminuser@JCLFWL02> show chassis cluster status
Cluster ID: 1
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 1
    node0                   0           lost           n/a      n/a
    node1                   100         primary        no       no

Redundancy group: 1 , Failover count: 1
    node0                   0           lost           n/a      n/a
    node1                   0           primary        no       no

The logs unearthed some nasty looking messages starting with this:

Mar 15 08:31:54  JCLFWL02 (FPC Slot 1, PIC Slot 0) SPC1_PIC0 kernel: kld_map_v: 0xffffffff8c000000, kld_map_p: 0xc000000
Mar 15 08:31:54  JCLFWL02 (FPC Slot 1, PIC Slot 0) SPC1_PIC0 kernel: dog: ERROR - reset of uninitialized watchdog
Mar 15 08:31:54  JCLFWL02 (FPC Slot 1, PIC Slot 0) SPC1_PIC0 kernel: Copyright (c) 1996-2014, Juniper Networks, Inc.

show chassis hardware output indicated that FPCs were there but no PICs detected! Serial numbers removed for confidentiality.

adminuser@JCLFWL02> show chassis hardware
node1:
--------------------------------------------------------------------------
Hardware inventory:
Item             Version  Part number  Serial number     Description
Chassis                                XXXXXXXXXXXX      SRX 1400
Midplane         REV 03   711-031012   XXXXXXXX          SRX1k Backplane
PEM 0            rev 03   740-032015   XXXXXXXXXXXXX     AC Power Supply
CB 0             REV 13   750-032544   XXXXXXXX          SRX1K-RE-12-10
  Routing Engine          BUILTIN      BUILTIN           Routing Engine
  CPP                     BUILTIN      BUILTIN           Central PFE Processor
  Mezz           REV 09   710-021035   XXXXXXXX          SRX HD Mezzanine Card
FPC 0            REV 17   750-032536   XXXXXXXX          SRX1k 1GE SYSIO
FPC 1            REV 12   750-032543   XXXXXXXX          SRX1k Dual Wide NPC+SPC Support Card
FPC 3            REV 19   710-017865   XXXXXXXX          BUILTIN NPC
Fan Tray         -N/A-    -N/A-        -N/A-             SRX 1400 Fan Tray

show chassis fpc pic-status indicated much the same.

adminuser@JCLFWL02> show chassis fpc pic-status
node1:
--------------------------------------------------------------------------
Slot 0   Offline      SRX1k 1GE SYSIO
Slot 1   Offline      SRX1k Dual Wide NPC+SPC Support Card
Slot 3   Offline      BUILTIN NPC

The short story is that we tried power-off, re-seating the SYSIOC and this brought everything back for a few hours before everything died again.

Replacement of the SYSIOC was required – not an issue as the config is stored on the RE, although Juniper do have a caveat article mentioning that the control links may not come back up once this is done. They recommend to reapply the cluster node member config

Juniper KB Article Here

eg: (change cluster ID and node number as appropriate)

set chassis cluster cluster-id 1 node 1 reboot

As an additional note, it seems this card takes care of all sorts of internal communications causing some odd alarms!

adminuser@JCLFWL02> show chassis alarms
node1:
--------------------------------------------------------------------------
9 alarms currently active
Alarm time               Class  Description
2016-03-15 14:18:22 UTC  Major  FPC 3 misconfig
2016-03-15 14:18:22 UTC  Major  FPC 1 misconfig
2016-03-15 14:18:22 UTC  Major  FPC 0 misconfig
2016-03-15 14:06:32 UTC  Major  Fan Tray Failure
2016-03-15 14:06:22 UTC  Major  Muliple FANs Stuck
2016-03-15 14:06:11 UTC  Major  FPC 3 offline due to CPP disconnect
2016-03-15 14:06:11 UTC  Major  FPC 1 offline due to CPP disconnect
2016-03-15 14:06:11 UTC  Major  FPC 0 offline due to CPP disconnect
2016-03-15 14:06:07 UTC  Major  Host 0 fxp0 : Ethernet Link Down

adminuser@JCLFWL02> show chassis environment
node1:
--------------------------------------------------------------------------
Class Item                           Status     Measurement
Temp  PEM 0                          Absent
      PEM 1                          Absent
      Routing Engine 0               OK
      Routing Engine 1               Absent
      CB 0 Intake                    OK         32 degrees C / 89 degrees F
      CB 0 Exhaust A                 OK         37 degrees C / 98 degrees F
      CB 0 Mezz                      OK         34 degrees C / 93 degrees F
      FPC 0 Intake                   OK         32 degrees C / 89 degrees F
      FPC 0 Exhaust A                OK         31 degrees C / 87 degrees F
      FPC 1 Intake                   OK         28 degrees C / 82 degrees F
      FPC 1 Exhaust A                OK         28 degrees C / 82 degrees F
      FPC 1 XLR                      Testing
      FPC 3 Intake                   OK         28 degrees C / 82 degrees F
      FPC 3 Exhaust A                OK         29 degrees C / 84 degrees F
Fans  Fan 1                          Check
      Fan 2                          Check

Another side note – I wondered why one control link had been installed as fibre and another had been installed as copper, and came across this from Juniper:

NOTE: When you use ge-0/0/11 as a control port, you must use a fiber SFP transceiver, but you can use copper or fiber SFP transceiver on ge-0/0/10.