Packet loss across Nexus 7Ks – Faulty Module

Due to the load balancing algorithms involved in Nexus VPC/Port-channel setups, reports of packet loss or poor performance can be difficult to nail down. One customer may report issues, while others on the same kit seem to be absolutely fine. In one case experienced, the issue was found to be a faulty module running links down to the 5Ks which had passed self-test on boot but had started erroring.

If you’re not logging at debug level to your messages file/syslog server, it’s possible to miss the following error:

Event:E_DEBUG, length:126, at 954366 usecs after Thu Mar  6 09:15:00 2014
    [103] module 7 reported warning on ports 7/3-7/3 due to Metro FCS/CRC/ECC interrupt in  device 79  device error code 0xc4f03201

You will be able to check these with:

LON-N7K-C-A# show module internal event-history errors

You can verify with the following command whether errors are in fact, increasing.

LON-N7K-C-A# attach module 7
module-7# show hardware internal statistics device mac errors port 1 | i Egress
3631 Egress Port-3 VSL Dropped Packet Count        0000000358199089   1 -
4610 Egress Source Forwarding Check Drop           0000000063023737   1,3,5,7 -
4611 Egress Bundle Check Drop                      0000000062298576   1,3,5,7 -
4612 Egress LTL/CSH Capture Drop                   0000000000000001   1,3,5,7 -
4613 Egress CBL Drop                               0000000000000020   1,3,5,7 -
module-7#  sh hardware internal statistics device mac errors port 1 | i Egress
3631 Egress Port-3 VSL Dropped Packet Count        0000000358199107   1 -
4610 Egress Source Forwarding Check Drop           0000000063023739   1,3,5,7 -
4611 Egress Bundle Check Drop                      0000000062298578   1,3,5,7 -
4612 Egress LTL/CSH Capture Drop                   0000000000000001   1,3,5,7 -
4613 Egress CBL Drop                               0000000000000020   1,3,5,7 –

Confirmation like this means that the module will need swapping out. In this case, it was an N7K-M132XP-12

Is it safe to offline/online an RPS 675?

Short answer: No, not during business hours unless you’re feeling brave.

Long answer: At least with Cisco 3750s, these devices have been notorious in my experience for causing switch(es) attached to reboot when flipped over, despite mains power being fine. They (along with Cisco 3750s) have also had a high failure rate so the sooner you can get rid of them and move to hardware with dual PSUs, the better.

Seeing as the RPSs can only run one switch at a time, it doesn’t make much sense to be cheap and use them as backup for multiple switches.

There is also another major issue with using RPSs: If the switch is running on the RPS DC output, you most likely won’t know if the mains power supply in the switch is OK until you offline the RPS. Not fun if you decide to try this during the day and the switch PSU has blown.

Cisco SVI equivalent config on Juniper MX Series

Scenario: How to configure a vlan with routed SVI on Juniper MX series with VRRP, trunked to a second MX and with a single (eg: server) access port in that vlan, with the network advertised by OSPF (set as passive). VSTP configuration is used here (equivalent to rapid-pvst in cisco land).

Set up the Vlan

bridge-domains {
   TestVlan {
       vlan-id 100;                                                       
        routing-interface irb.100;
    }
}

Configure the access, trunk to other switch and irb (SVI) interfaces. VRRP is configured as an equivalent to Cisco HSRP setup.

interfaces {
   ge-4/0/5 {
        description "Test Vlan 100 Server";
        unit 0 {
            family bridge {
                interface-mode access;
                vlan-id 100;
            }
        }
    }

    xe-5/2/0 {
        description "To MX2 Switch xe-5/2/0 Layer2 Link";
        mtu 2000;
        unit 0 {
            family bridge {
                interface-mode trunk;
                vlan-id-list [ 100 ];
            }
        }
    }

    irb {
        unit 100 {
        description "Test Vlan 100 SVI";
            family inet {
                mtu 1500;
                address 192.168.1.2/24 {
                    vrrp-group 100 {
                        virtual-address 192.168.1.1;
                        priority 120;
                        authentication-type simple;
                        authentication-key test100;
                        preempt {
                            hold-time 10;
                        }
                        accept-data;
                    }
                }
            }
        }
    }
}

NB: If you want to do interface tracking on the vrrp group, an example is given below. This is applied within the vrrp group stanza. Note that the priority-cost is the equivalent of the cisco decrement command, so 21 here would reduce 120 to 99, causing the secondary (cost 100) to become the master.

Tip: To load track config in easily..

me@mx1# edit
me@mx1# edit interfaces irb unit 100 family inet address 192.168.1.2/24 vrrp-group 100
me@mx1# load merge terminal relative

Interface tracking config:

track {
    interface xe-5/3/0 {
        priority-cost 21;
    }
    priority-hold-time 10;
}

NB: If you want to do route tracking on the vrrp group, an example is given below. This is applied within the vrrp group stanza. Note that the priority-cost is the equivalent of the cisco decrement command, so 21 here would reduce 120 to 99, causing the secondary (cost 100) to become the master.

Route tracking config:

track {
    route 10.0.0.0/24 routing-instance default priority-cost 21
}

Set as OSPF passive so we advertise but won’t form adjacencies over the irb interface.

protocols {
    ospf {
        area 0.0.0.0 {
            interface irb.100 {
                passive;
            }
        }
    }
}

Set up spanning tree – we are using VSTP (equivalent of Rapid-PVST – maximum of 256 vlans supported). We block any edge ports that we receive BPDUs on, and set this switch as the root bridge (priority 0). “edge” is the equivalent of “spanning-tree portfast” here. “no-root-port” does what it says on the tin.

protocols{
    vstp {
        bpdu-block-on-edge;
        vlan 100 {
            bridge-priority 0;
            interface ge-4/0/5 {
                edge;
                no-root-port;
            }
            interface xe-5/2/0;
        }
    }
}

Equivalent configuration needs to be done on the second MX with the following changes:

irb interface: address 192.168.1.3/24
irb interface vrrp: priority lower than 120 (eg: 100)
vstp vlan config: bridge-priority 4096

Gotcha: Bear in mind that if you are trunking to Cisco kit, then the default MTU on the Juniper won’t allow things to work. Any trunks to Cisco kit can be configured as follows or you can enable flexible-vlan-tagging:

Example 1:

ge-4/1/9 {
    description " To Cisco SW01 Gi0/1 Trunk ";
    mtu 1522;
    unit 0 {
        family bridge {
            interface-mode trunk;
            vlan-id-list [ 100 101 102 103 ];
        }
    }
}

Example 2:

ge-4/1/9 {
    description " To Cisco SW01 Gi0/1 Trunk ";
    flexible-vlan-tagging;
    unit 0 {
        family bridge {
            interface-mode trunk;
            vlan-id-list [ 100 101 102 103 ];
        }
    }
}

You can see here that flexible-vlan-tagging changes the MTU for you:

Physical interface: ge-4/1/9, Enabled, Physical link is Up
  Interface index: 187, SNMP ifIndex: 545
  Description:  To Cisco SW01 Gi0/1 Trunk 
  Link-level type: Ethernet, MTU: 1522, Speed: 1000mbps, BPDU Error: None, MAC-REWRITE Error: None, Loopback: Disabled,

Gotcha: The MTU on the trunk between switches also needs to be raised to allow for Vlan tagging if it isn’t already. It’s quite likely a large MTU may be desirable between the two anyway (support for Jumbo frames etc).

Useful commands for monitoring/troubleshooting

show interfaces <interface name>
show bridge-domain
show bridge mac-table
show spanning-tree bridge
show spanning-tree interface
show spanning-tree statistics interface
show vrrp brief
show ospf interface
show ospf neighbor

Re-enable a BPDU blocked interface:

clear error bpdu interface <interface name>