Dead SYSIOC card in SRX1400

I had a strange issue where one of the members of an SRX cluster dropped out unexpectedly. No changes made and nothing was touched physically.

When looking on the console, the cluster status was primary but none of the physical interfaces existed, control links were down and the fxp was down too – so basically zero network connectivity.

adminuser@JCLFWL02> show chassis cluster status
Cluster ID: 1
Node                  Priority          Status    Preempt  Manual failover

Redundancy group: 0 , Failover count: 1
    node0                   0           lost           n/a      n/a
    node1                   100         primary        no       no

Redundancy group: 1 , Failover count: 1
    node0                   0           lost           n/a      n/a
    node1                   0           primary        no       no

The logs unearthed some nasty looking messages starting with this:

Mar 15 08:31:54  JCLFWL02 (FPC Slot 1, PIC Slot 0) SPC1_PIC0 kernel: kld_map_v: 0xffffffff8c000000, kld_map_p: 0xc000000
Mar 15 08:31:54  JCLFWL02 (FPC Slot 1, PIC Slot 0) SPC1_PIC0 kernel: dog: ERROR - reset of uninitialized watchdog
Mar 15 08:31:54  JCLFWL02 (FPC Slot 1, PIC Slot 0) SPC1_PIC0 kernel: Copyright (c) 1996-2014, Juniper Networks, Inc.

show chassis hardware output indicated that FPCs were there but no PICs detected! Serial numbers removed for confidentiality.

adminuser@JCLFWL02> show chassis hardware
node1:
--------------------------------------------------------------------------
Hardware inventory:
Item             Version  Part number  Serial number     Description
Chassis                                XXXXXXXXXXXX      SRX 1400
Midplane         REV 03   711-031012   XXXXXXXX          SRX1k Backplane
PEM 0            rev 03   740-032015   XXXXXXXXXXXXX     AC Power Supply
CB 0             REV 13   750-032544   XXXXXXXX          SRX1K-RE-12-10
  Routing Engine          BUILTIN      BUILTIN           Routing Engine
  CPP                     BUILTIN      BUILTIN           Central PFE Processor
  Mezz           REV 09   710-021035   XXXXXXXX          SRX HD Mezzanine Card
FPC 0            REV 17   750-032536   XXXXXXXX          SRX1k 1GE SYSIO
FPC 1            REV 12   750-032543   XXXXXXXX          SRX1k Dual Wide NPC+SPC Support Card
FPC 3            REV 19   710-017865   XXXXXXXX          BUILTIN NPC
Fan Tray         -N/A-    -N/A-        -N/A-             SRX 1400 Fan Tray

show chassis fpc pic-status indicated much the same.

adminuser@JCLFWL02> show chassis fpc pic-status
node1:
--------------------------------------------------------------------------
Slot 0   Offline      SRX1k 1GE SYSIO
Slot 1   Offline      SRX1k Dual Wide NPC+SPC Support Card
Slot 3   Offline      BUILTIN NPC

The short story is that we tried power-off, re-seating the SYSIOC and this brought everything back for a few hours before everything died again.

Replacement of the SYSIOC was required – not an issue as the config is stored on the RE, although Juniper do have a caveat article mentioning that the control links may not come back up once this is done. They recommend to reapply the cluster node member config

Juniper KB Article Here

eg: (change cluster ID and node number as appropriate)

set chassis cluster cluster-id 1 node 1 reboot

As an additional note, it seems this card takes care of all sorts of internal communications causing some odd alarms!

adminuser@JCLFWL02> show chassis alarms
node1:
--------------------------------------------------------------------------
9 alarms currently active
Alarm time               Class  Description
2016-03-15 14:18:22 UTC  Major  FPC 3 misconfig
2016-03-15 14:18:22 UTC  Major  FPC 1 misconfig
2016-03-15 14:18:22 UTC  Major  FPC 0 misconfig
2016-03-15 14:06:32 UTC  Major  Fan Tray Failure
2016-03-15 14:06:22 UTC  Major  Muliple FANs Stuck
2016-03-15 14:06:11 UTC  Major  FPC 3 offline due to CPP disconnect
2016-03-15 14:06:11 UTC  Major  FPC 1 offline due to CPP disconnect
2016-03-15 14:06:11 UTC  Major  FPC 0 offline due to CPP disconnect
2016-03-15 14:06:07 UTC  Major  Host 0 fxp0 : Ethernet Link Down

adminuser@JCLFWL02> show chassis environment
node1:
--------------------------------------------------------------------------
Class Item                           Status     Measurement
Temp  PEM 0                          Absent
      PEM 1                          Absent
      Routing Engine 0               OK
      Routing Engine 1               Absent
      CB 0 Intake                    OK         32 degrees C / 89 degrees F
      CB 0 Exhaust A                 OK         37 degrees C / 98 degrees F
      CB 0 Mezz                      OK         34 degrees C / 93 degrees F
      FPC 0 Intake                   OK         32 degrees C / 89 degrees F
      FPC 0 Exhaust A                OK         31 degrees C / 87 degrees F
      FPC 1 Intake                   OK         28 degrees C / 82 degrees F
      FPC 1 Exhaust A                OK         28 degrees C / 82 degrees F
      FPC 1 XLR                      Testing
      FPC 3 Intake                   OK         28 degrees C / 82 degrees F
      FPC 3 Exhaust A                OK         29 degrees C / 84 degrees F
Fans  Fan 1                          Check
      Fan 2                          Check

Another side note – I wondered why one control link had been installed as fibre and another had been installed as copper, and came across this from Juniper:

NOTE: When you use ge-0/0/11 as a control port, you must use a fiber SFP transceiver, but you can use copper or fiber SFP transceiver on ge-0/0/10.

Splunk Field Extractions for Juniper SRX

The first two were found elsewhere on the web but I noticed there was no deny event extraction so made my own.

To make your SRX send syslogs, the following example can be modified. You might find it easier to use local facilities to split out your logs by type using syslog-ng. Be sure to monitor performance as you enable logging – lots of logging on a extremely busy firewall may generate a fair bit of extra CPU overhead.

SRX Config

    syslog {
         host 192.168.1.100 {
            any any;
            match RT_FLOW_SESSION;
            facility-override local5;
            source-address 10.0.1.254;
        }
    }

Set your desired policies to log… eg:

edit security policies from-zone trust to-zone untrust
set policy web-traffic-outbound then log session-init session-close
set policy default-drop-trust-untrust then log session-init session-close

Splunk Extractions

Create events

RT_FLOW_SESSION_CREATE:\ssession\screated\s(?P<srx_src_ip>\d+\.\d+\.\d+\.\d+)\/(?P<srx_src_port>\d+)\D+(?P<srx_dst_ip>\d+\.\d+\.\d+\.\d+)\/(?P<srx_dst_port>\d+)\s(?P<srx_svc_name>\S+)\s(?P<srx_nat_src_ip>\d+\.\d+\.\d+\.\d+)\/(?P<srx_nat_src_port>\d+)\D+(?P<srx_nat_dst_ip>\d+\.\d+\.\d+\.\d+)\/(?P<srx_nat_dst_port>\d+)\s(?P<srx_src_nat_rule_name>\S+)\s(?P<srx_dst_nat_rule_name>\S+)\s(?P<srx_protocol_id>\d+)\s(?P<srx_policy_name>\S+)\s(?P<srx_src_zone>\S+)\s(?P<srx_dst_zone>\S+)\s(?P<srx_sess_id>\d+) 

Close events

RT_FLOW_SESSION_CLOSE:\ssession\sclosed\s(?P<srx_closed_reason>[^:]+)\D+(?P<srx_src_ip>\d+\.\d+\.\d+\.\d+)\/(?P<srx_src_port>\d+)\D+(?P<srx_dst_ip>\d+\.\d+\.\d+\.\d+)\/(?P<srx_dst_port>\d+)\s(?P<srx_svc_name>\S+)\s(?P<srx_nat_src_ip>\d+\.\d+\.\d+\.\d+)\/(?P<srx_nat_src_port>\d+)\D+(?P<srx_nat_dst_ip>\d+\.\d+\.\d+\.\d+)\/(?P<srx_nat_dst_port>\d+)\s(?P<srx_src_nat_rule_name>\S+)\s(?P<srx_dst_nat_rule_name>\S+)\s(?P<srx_protocol_id>\d+)\s(?P<srx_policy_name>\S+)\s(?P<srx_src_zone>\S+)\s(?P<srx_dst_zone>\S+)\s(?P<srx_sess_id>\d+)\s(?P<srx_pkts_from_client>\d+)\((?P<srx_bytes_from_client>\d+)\)\s(?P<srx_pkts_from_server>\d+)\((?P<srx_bytes_from_server>\d+)\)\s(?P<srx_sess_elapsed_time>\d+) 

Deny Events

RT_FLOW_SESSION_DENY:\ssession\sdenied\s(?P<srx_src_ip>\d+\.\d+\.\d+\.\d+)\/(?P<srx_src_port>\d+)\D+(?P<srx_dst_ip>\d+\.\d+\.\d+\.\d+)\/(?P<srx_dst_port>\d+)\s(?P<srx_svc_name>\S+)\s(?P<srx_protocol_id>\d+)\((?P<srx_icmp_type>\d+)\)\s(?P<srx_policy_name>\S+)\s(?P<srx_src_zone>\S+)\s(?P<srx_dst_zone>\S+) 

Policy Action Field (common)

RT_FLOW:\s\S+:\ssession\s(?P<srx_policy_action>\S+) 

Next personal project: Write a decent Splunk app for SRX!