TACACS on 4431 Management Interface

Getting TACACs working via the Cisco 4431 Management interface threw up a few issues and took a few tweaks. The final issue I found was that referencing all servers with the tacacs+ keyword doesn’t work, you have to reference the TACACS group with the servers defined within it.

Below is a working configuration example for TACACs via the management port in the Mgmt-intf vrf. I’ve also included a non-exhaustive couple of examples to get a few other things working.

! Mgmt interface config
!
interface GigabitEthernet0
 description ** Mgmt intf **
 vrf forwarding Mgmt-intf
 ip address 192.168.0.1 255.255.255.0
 negotiation auto
!
!
! Default route for Management VRF
!
ip route vrf Mgmt-intf 0.0.0.0 0.0.0.0 192.168.0.254
!
!
! Define source interface at global level
!
ip tacacs source-interface GigabitEthernet0
!
! aaa config
!
aaa new-model
!
!
! Server-private restricts only within this VRF.
! VRF forwarding and source interface need to be configured
! within the aaa group context too.
!
aaa group server tacacs+ TACACS
 server-private 10.0.0.100 key MYKEY
 server-private 10.0.1.100 key MYKEY
 ip vrf forwarding Mgmt-intf
 ip tacacs source-interface GigabitEthernet0
!
! Fail to enable password if TACACS is not working in this config.
!
aaa authentication login REMOTE_ACCESS group TACACS enable
aaa authentication enable default group TACACS enable
aaa accounting exec REMOTE_ACCESS
 action-type stop-only
 group TACACS
!
aaa accounting commands 15 REMOTE_ACCESS
 action-type stop-only
 group TACACS
!
aaa session-id common
!
!
! Apply to vtys and console if you need to.
!
line vty 0 4
 accounting commands 15 REMOTE_ACCESS
 accounting exec REMOTE_ACCESS
 login authentication REMOTE_ACCESS
line vty 5 15
 accounting commands 15 REMOTE_ACCESS
 accounting exec REMOTE_ACCESS
 login authentication REMOTE_ACCESS

Syslog

logging source-interface GigabitEthernet0 vrf Mgmt-intf
logging host 10.0.0.101 vrf Mgmt-intf

TFTP (auto write after wr mem)

ip tftp source-interface GigabitEthernet0

archive
 path tftp://10.0.0.101/configs/$h-
 write-memory

SNMP Traps

snmp-server trap-source GigabitEthernet0
snmp-server host 10.0.0.101 vrf Mgmt-intf version 2c MYCOMMUNITY

Recovering a Cisco 3850 that’s stuck in Boot Loop/ROMMON

This wasn’t even after an upgrade, just a reload, but the below messages were displayed before an unmounts and reboot of the switch:

Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2)
Last heartbeat 0.00 secs ago

PID: 9225
Exit code: signal 11 (no core)

Quick steps:

1) Set up a local switch with the correct image on is as a tftp server to expedite fix as tftp is painful.

tftp-server flash:[image-name.bin]

2) Configure an interface with an IP to allow for direct connection to the management port of the broken switch.

interface GigabitEthernet1/0/47
 description -= temp for tftp to other switch =-
 no switchport
 ip address 192.168.0.1 255.255.255.0

3) Connect (or ask someone on site to connect) the interface to the management port of the broken switch (duh!)

4) On broken switch, hold MODE for 10 seconds to interrupt boot loop, or just wait for 5 or so failures for it to drop to ROMMON.

5) Set up IP info. GW only necessary if you’re tftping from a server outside the local subnet.

switch: set IP_ADDR 192.168.0.2/255.255.255.0

switch: set DEFAULT_ROUTER 192.168.0.1

6) Check for emergency files (you’re looking for cat3k_caa-recovery.bin or similar.

switch: dir sda9:

7) Ping the tftp server

switch: ping 192.168.0.1
ping 192.168.0.1 with 32 bytes of data ...
Up 1000 Mbps Full duplex (port  0) (SGMII)
Host 192.168.0.1 is alive.

8) Start the tftp emergency install. On a local connection this will take 10-20 mins.

switch: emergency-install tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin
The bootflash will be erased during install operation, continue (y/n)?y
Starting emergency recovery (tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin)...
Reading full image into memory......................done
Nova Bundle Image
--------------------------------------
Kernel Address    : 0x6042e5d8
Kernel Size       : 0x31794f/3242319
Initramfs Address : 0x60745f28
Initramfs Size    : 0xdbec9d/14412957
Compression Format: .mzip

Bootable image at @ ram:0x6042e5d8
Bootable image segment 0 address range [0x81100000, 0x81b80000] is in range [0x80180000, 0x90000000].
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
File "sda9:cat3k_caa-recovery.bin" uncompressed and installed, entry point: 0x811060f0
Loading Linux kernel with entry point 0x811060f0 ...
Bootloader: Done loading app on core_mask: 0xf

### Launching Linux Kernel (flags = 0x5)



Initiating Emergency Installation of bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin


Downloading bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...

Validating bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...
Installing bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...
Verifying bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...
Package cat3k_caa-base.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-drivers.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-infra.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-iosd-universalk9.SPA.150-1.EZ3.pkg is Digitally Signed
Package cat3k_caa-platform.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-wcm.SPA.10.1.130.0.pkg is Digitally Signed
Preparing flash...
Syncing device...
Emergency Install successful... Rebooting
Restarting system.

Once this is done it’ll try and boot again. You need to disable manual boot.

The system is not configured to boot automatically.  The
following command will finish loading the operating system
software:

    boot

switch: set MANUAL_BOOT no
switch: boot

Also remember to change the confreg value if it’s not 0x102 on the 3850. In this case it wasn’t needed. (show version last line)


Configuration register is 0x102

Don’t forget to remove the tftp-server config and temporary stuff. :)

GRE over IPSEC between Juniper and Cisco Router

This caused headaches as it needed slightly different configuration to normal. ip mtu not being set here was the cause of things sort-of-but-not-quite-working.

Normally with Cisco to Cisco over IPSEC we’d add “ip tcp adjust mss-1392” to the Tunnel interfaces either side.

This is the config that worked in the end.

GRE Juniper router side
=======================

interfaces {
    lo0 {
        unit 0 {
            family inet {
                address 192.168.255.1/32;
            }
        }
    }

    gr-1/1/10 {
        unit 2 {
            clear-dont-fragment-bit;
            description "-= Gre Tunnel to Remote Office =-";
            tunnel {
                source 192.168.255.1;
                destination 192.168.255.2;
            }
            family inet {
                mtu 1400;
                address 10.0.0.2/30;
            }
        }
    }
}

routing-options
    static {
        route 192.168.255.2/32 next-hop [IPSEC FW Addr];
    }
}


---------


GRE Cisco Side
==============

interface Loopback1
 description * Loopback for GRE Tunnel source/endpoint *
 ip address 192.168.255.2 255.255.255.255


interface Tunnel2
 description * GRE Tunnel to Juniper GR-1/1/10.2 *
 ip address 10.0.0.1 255.255.255.252
 ip mtu 1400
 load-interval 30
 tunnel source Loopback1
 tunnel destination 192.168.255.1
 hold-queue 2000 in
 hold-queue 2000 out

! Route GRE endpoint via IPSEC FW
ip route 192.168.255.1 255.255.255.255 [IPSEC FW Addr]

Reference config for normal Cisco – Cisco.

interface Tunnel2
 description * GRE Tunnel to Remote site int tunnel2 *
 bandwidth 2000
 ip address 10.0.0.1 255.255.255.252
 ip tcp adjust-mss 1392
 load-interval 30
 tunnel source Loopback1
 tunnel destination 192.168.16.13
 hold-queue 2000 in
 hold-queue 2000 out

Nexus FEX Bouncing

I came across an odd problem where a FEX was bouncing and was asked to look at it. The logs were a flood of interfaces going up and down and FEX status messages, however buried in amongst the logs and quite easy to miss was the following, less frequent syslog message:

%SATCTRL-FEX132-2-SATCTRL_FEX_MISCONFIG: FEX-132 is being configured as 131 on different switch

Pretty obvious clue there. Configuration was correct for the uplinks on both 5Ks:

interface Ethernet1/13
  switchport mode fex-fabric
  fex associate 131
  channel-group 131

interface Ethernet1/14
  switchport mode fex-fabric
  fex associate 132
  channel-group 132

Checking the serial numbers of the attached FEXes confirmed the problem:


First 5K

FEX: 131 Description: FEX213 - CAB 28   state: Offline
  FEX version: 7.1(3)N1(1) [Switch version: 7.1(3)N1(1)]
  FEX Interim version: 7.1(3)N1(1)
  Switch Interim version: 7.1(3)N1(1)
  Extender Serial: FOC00011122

FEX: 132 Description: FEX214 - CAB 28   state: Online
  FEX version: 7.1(3)N1(1) [Switch version: 7.1(3)N1(1)]
  FEX Interim version: 7.1(3)N1(1)
  Switch Interim version: 7.1(3)N1(1)
  Extender Serial: FOC12345678

Second 5K


FEX: 131 Description: FEX213 - CAB 28   state: Registered
  FEX version: 7.1(3)N1(1) [Switch version: 7.1(3)N1(1)]
  FEX Interim version: 7.1(3)N1(1)
  Switch Interim version: 7.1(3)N1(1)

FEX: 132 Description: FEX214 - CAB 28   state: Online
  FEX version: 7.1(3)N1(1) [Switch version: 7.1(3)N1(1)]
  FEX Interim version: 7.1(3)N1(1)
  Switch Interim version: 7.1(3)N1(1)
  Extender Serial: FOC00011122

As we can see above, the same FEX is associated with FEX131 on the first 5K and FEX132 on the second 5K. The solution was to verify which serial number was which FEX in the cabinets and to swap the cables for the two ports around on the incorrectly patched 5K. Looks like someone had been doing some patching and put things back in the wrong way around! O_o

Replacing a failed Nexus 5K and some bugs.

Being given the task of replacing a failed Nexus 5596UP (no console output, powers up with fans but no lights except amber on the mgmt. module at the back), I quickly ran into some annoying problems trying to configure the FEX uplinks before actually racking it and plugging it in. I wanted to get as much config done beforehand as possible to minimize any interruptions – I was also a bit nervous as this unit was in the VPC primary role before it failed.

The 5Ks are running in Active/Active mode similar to this diagram:

active-active

Fex definitions, port channels and other config went in fine, eg:

fex 102
  pinning max-links 1
  description "-=FEX-102=-"
  type N2248T

interface port-channel102
  description -= vPC 102 to FEX-102 / SW01 e1/1,e1/2 & SW02 e1/1,e1/2 =-
  switchport mode fex-fabric
  fex associate 102
  spanning-tree port-priority 128
  spanning-tree port type network
  vpc 102

But when trying to add the channel-group to the FEX ports (e1/1 and e1/2), it failed:

interface Ethernet1/1
  channel-group 102
command failed: port not compatible [port allowed VLAN list]

Removing the port channel allowed me to configure only one of the ports with the channel-group, giving the same error on the second member (as the port channel interface is automatically created).

The only way to get around this was to remove the port channel and use a range command:

no interface port-channel102

interface Ethernet1/1-2
 channel-group 102

Then re-add the Port channel 102 config. Obviously you’re stuffed on this version if you split the uplinks across different ASICs.

This seems to be down to a buggy version of NX-OS: version 5.1(3)N1(1a), which is quite old and already has some other bug warnings against it! Not much choice in this case given the replacement has to run the same config as the peer.

Also, FEX ports can only be configured by pre-provisioning.

Eg:

slot 102
 provision model N2K-C2248T

Cisco links for replacement (some important steps such as role priority):
Replacement Documentation
Pre-provisioning

Full Replacement Procedure
We aren’t running config sync so role priority wasn’t a problem – I didn’t need to change the priority on the replacement to a higher value. If using config sync, I’d follow Cisco’s guidance. In summary, here are the steps I took (may be a little paranoid but resulted in no outage when the replacement was brought online):

Preparation:

1) Label all cables carefully then disconnect and unrack faulty unit
2) Rack replacement – do NOT connect any cables apart from console
3) Set a management address on the mgmt0 port (I used a different IP to be safe) and set a default route in the management vrf if your FTP server is in a different subnet. eg:

vrf context management
  ip route 0.0.0.0/0 10.243.0.254

4) Connect management cable
5) Upgrade/downgrade code to same as the peer – FTP copy example below with images in /nos-images on FTP server.

copy ftp://user:pass@10.0.0.1//nos-images/[kickstart-filename].bin bootflash:
copy ftp://user:pass@10.0.0.1//nos-images/[nxos-system-filename].bin bootflash:

install all kickstart bootflash:[kickstart-filename] system bootflash:[nxos-system-filename]

6) Shut down the mgmt0 port
7) Pre-provision FEXes, otherwise you won’t be able to paste the config for the FEX interfaces.
Be sure to specify the correct model.. N2K-C2248T is correct for N2K-C2248TP-1GE

slot 102
provision model N2K-C2248T 
slot 103
provision model N2K-C2248T
…etc

Some config may look like it’s in a different order to the peer when pre-provisioned but it will sort itself out when the FEXes are online.
8) Apply the correct config via console (see note above regarding channel-group bug) – double check any long lines such as spanning-tree vlan config. Remove the additional management VRF default route from step 3 if it’s different to the config you put in.
9) Shut down all ports (including mgmt) – use range commands ie: int e1/1-48
10) Connect all cables to ports

NB: Cisco say to change the vpc role priority to a higher value on the replacement but this was not necessary as we’re not using config sync. Also, the VPC peer role does not pre-empt, so if replacing primary, for example, the secondary will stay in “secondary, operational primary” state.

Re-introduction:

Monitor console messages for anything strange from the Dist layer with term mon. You should see VPCs come up OK. I also set several pings running to hosts in the row that are connected to the now single-homed FEXes.
1) Open up mgmt port from console which will now have the correct IP. Check you can ping peer unit.
2) Open up all infrastructure liks to core/dist layer and the peer unit
3) Open FEX links (I did one pair of links at a time to catch any cabling issues – paranoid!)
4) Test!

Useful commands:

show vpc role
show fex
show vpc consistency-parameters vpc xxx
show vpc peer-keepalive
show port-channel database int poXXX