Adj resolve request: Failed to resolve… [ Cisco 3750X ]

After a migration to a pair of 3750Xs I was getting a bunch of disconcerting ARP errors in the logs. After a bit of digging, this appears to be a known bug. Error message content examples are shown below:

Nov  4 12:41:00: %ADJ-3-RESOLVE_REQ: Adj resolve request: Failed to resolve 172.17.22.57 Vlan21
Nov  4 12:41:18: %ADJ-3-RESOLVE_REQ: Adj resolve request: Failed to resolve 172.17.24.53 Vlan28

This was fixed/worked around with:

no ip cef optimize neighbor resolution

TACACS on 4431 Management Interface

Getting TACACs working via the Cisco 4431 Management interface threw up a few issues and took a few tweaks. The final issue I found was that referencing all servers with the tacacs+ keyword doesn’t work, you have to reference the TACACS group with the servers defined within it.

Below is a working configuration example for TACACs via the management port in the Mgmt-intf vrf. I’ve also included a non-exhaustive couple of examples to get a few other things working.

! Mgmt interface config
!
interface GigabitEthernet0
 description ** Mgmt intf **
 vrf forwarding Mgmt-intf
 ip address 192.168.0.1 255.255.255.0
 negotiation auto
!
!
! Default route for Management VRF
!
ip route vrf Mgmt-intf 0.0.0.0 0.0.0.0 192.168.0.254
!
!
! Define source interface at global level
!
ip tacacs source-interface GigabitEthernet0
!
! aaa config
!
aaa new-model
!
!
! Server-private restricts only within this VRF.
! VRF forwarding and source interface need to be configured
! within the aaa group context too.
!
aaa group server tacacs+ TACACS
 server-private 10.0.0.100 key MYKEY
 server-private 10.0.1.100 key MYKEY
 ip vrf forwarding Mgmt-intf
 ip tacacs source-interface GigabitEthernet0
!
! Fail to enable password if TACACS is not working in this config.
!
aaa authentication login REMOTE_ACCESS group TACACS enable
aaa authentication enable default group TACACS enable
aaa accounting exec REMOTE_ACCESS
 action-type stop-only
 group TACACS
!
aaa accounting commands 15 REMOTE_ACCESS
 action-type stop-only
 group TACACS
!
aaa session-id common
!
!
! Apply to vtys and console if you need to.
!
line vty 0 4
 accounting commands 15 REMOTE_ACCESS
 accounting exec REMOTE_ACCESS
 login authentication REMOTE_ACCESS
line vty 5 15
 accounting commands 15 REMOTE_ACCESS
 accounting exec REMOTE_ACCESS
 login authentication REMOTE_ACCESS

Syslog

logging source-interface GigabitEthernet0 vrf Mgmt-intf
logging host 10.0.0.101 vrf Mgmt-intf

TFTP (auto write after wr mem)

ip tftp source-interface GigabitEthernet0

archive
 path tftp://10.0.0.101/configs/$h-
 write-memory

SNMP Traps

snmp-server trap-source GigabitEthernet0
snmp-server host 10.0.0.101 vrf Mgmt-intf version 2c MYCOMMUNITY

Recovering a Cisco 3850 that’s stuck in Boot Loop/ROMMON

This wasn’t even after an upgrade, just a reload, but the below messages were displayed before an unmounts and reboot of the switch:

Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2)
Last heartbeat 0.00 secs ago

PID: 9225
Exit code: signal 11 (no core)

Quick steps:

1) Set up a local switch with the correct image on is as a tftp server to expedite fix as tftp is painful.

tftp-server flash:[image-name.bin]

2) Configure an interface with an IP to allow for direct connection to the management port of the broken switch.

interface GigabitEthernet1/0/47
 description -= temp for tftp to other switch =-
 no switchport
 ip address 192.168.0.1 255.255.255.0

3) Connect (or ask someone on site to connect) the interface to the management port of the broken switch (duh!)

4) On broken switch, hold MODE for 10 seconds to interrupt boot loop, or just wait for 5 or so failures for it to drop to ROMMON.

5) Set up IP info. GW only necessary if you’re tftping from a server outside the local subnet.

switch: set IP_ADDR 192.168.0.2/255.255.255.0

switch: set DEFAULT_ROUTER 192.168.0.1

6) Check for emergency files (you’re looking for cat3k_caa-recovery.bin or similar.

switch: dir sda9:

7) Ping the tftp server

switch: ping 192.168.0.1
ping 192.168.0.1 with 32 bytes of data ...
Up 1000 Mbps Full duplex (port  0) (SGMII)
Host 192.168.0.1 is alive.

8) Start the tftp emergency install. On a local connection this will take 10-20 mins.

switch: emergency-install tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin
The bootflash will be erased during install operation, continue (y/n)?y
Starting emergency recovery (tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin)...
Reading full image into memory......................done
Nova Bundle Image
--------------------------------------
Kernel Address    : 0x6042e5d8
Kernel Size       : 0x31794f/3242319
Initramfs Address : 0x60745f28
Initramfs Size    : 0xdbec9d/14412957
Compression Format: .mzip

Bootable image at @ ram:0x6042e5d8
Bootable image segment 0 address range [0x81100000, 0x81b80000] is in range [0x80180000, 0x90000000].
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
File "sda9:cat3k_caa-recovery.bin" uncompressed and installed, entry point: 0x811060f0
Loading Linux kernel with entry point 0x811060f0 ...
Bootloader: Done loading app on core_mask: 0xf

### Launching Linux Kernel (flags = 0x5)



Initiating Emergency Installation of bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin


Downloading bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...

Validating bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...
Installing bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...
Verifying bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...
Package cat3k_caa-base.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-drivers.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-infra.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-iosd-universalk9.SPA.150-1.EZ3.pkg is Digitally Signed
Package cat3k_caa-platform.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-wcm.SPA.10.1.130.0.pkg is Digitally Signed
Preparing flash...
Syncing device...
Emergency Install successful... Rebooting
Restarting system.

Once this is done it’ll try and boot again. You need to disable manual boot.

The system is not configured to boot automatically.  The
following command will finish loading the operating system
software:

    boot

switch: set MANUAL_BOOT no
switch: boot

Also remember to change the confreg value if it’s not 0x102 on the 3850. In this case it wasn’t needed. (show version last line)


Configuration register is 0x102

Don’t forget to remove the tftp-server config and temporary stuff. :)

Juniper SRX High CPU

Today we came across an issue where an SRX had very high CPU usage. After a bit of digging it turned out to be the httpd process which runs jweb.

“show chassis routing-engine” outputs like the below output normally (below is not actual output from the box with a problem and is only intended as an example), however user CPU was close to 100%

user@TESTFW02> show chassis routing-engine
node0:
--------------------------------------------------------------------------
Routing Engine status:
    Temperature                 55 degrees C / 131 degrees F
    Total memory               512 MB Max   394 MB used ( 77 percent)
      Control plane memory     336 MB Max   299 MB used ( 89 percent)
      Data plane memory        176 MB Max    95 MB used ( 54 percent)
    CPU utilization:
      User                       3 percent
      Background                 0 percent
      Kernel                     9 percent
      Interrupt                  0 percent
      Idle                      87 percent
    Model                          RE-SRX100B
    Serial ID                      XXXXXXXXXX
    Start time                     2017-04-10 02:30:18 UTC
    Uptime                         127 days, 12 hours, 1 minute, 12 seconds
    Last reboot reason             0x1000:reboot due to panic
    Load averages:                 1 minute   5 minute  15 minute
                                       0.18       0.17       0.11

node1:
--------------------------------------------------------------------------
Routing Engine status:
    Temperature                 52 degrees C / 125 degrees F
    Total memory               512 MB Max   415 MB used ( 81 percent)
      Control plane memory     336 MB Max   316 MB used ( 94 percent)
      Data plane memory        176 MB Max    97 MB used ( 55 percent)
    CPU utilization:
      User                      10 percent
      Background                 0 percent
      Kernel                    14 percent
      Interrupt                  0 percent
      Idle                      75 percent
    Model                          RE-SRX100B
    Serial ID                      XXXXXXXXXXX
    Start time                     2017-04-10 02:10:31 UTC
    Uptime                         127 days, 12 hours, 21 minutes, 1 second
    Last reboot reason             0x1000:reboot due to panic
    Load averages:                 1 minute   5 minute  15 minute
                                       0.21       0.20       0.15

To nail down the culprit, you can do the following:

user@TESTFW02> start shell
% top

Bear in mind that some platforms have a process that deliberately sits at a high CPU value in order to maintain performance (eg: flowd_octeon). Check against juniper documents before jumping to conclusions about a particular process. We are looking for something unusual and pretty obvious.

The top output should hint at the culprit. In this case it was httpd (JWEB).

We could have restarted with:

restart web-management 

However, we are managing via Junos SPACE which uses netconf so for us it was safe to disable the service:

delete groups node0 system services web-management
delete groups node1 system services web-management
delete system services web-management
commit

Exiting and checking the routing-engine state with “show chassis routing-engine” showed CPU quickly come back down to normal. Bear in mind the figures are a 1 minute rolling average by the looks of it so it will take a minute for the figure to normalise completely.

GRE over IPSEC between Juniper and Cisco Router

This caused headaches as it needed slightly different configuration to normal. ip mtu not being set here was the cause of things sort-of-but-not-quite-working.

Normally with Cisco to Cisco over IPSEC we’d add “ip tcp adjust mss-1392” to the Tunnel interfaces either side.

This is the config that worked in the end.

GRE Juniper router side
=======================

interfaces {
    lo0 {
        unit 0 {
            family inet {
                address 192.168.255.1/32;
            }
        }
    }

    gr-1/1/10 {
        unit 2 {
            clear-dont-fragment-bit;
            description "-= Gre Tunnel to Remote Office =-";
            tunnel {
                source 192.168.255.1;
                destination 192.168.255.2;
            }
            family inet {
                mtu 1400;
                address 10.0.0.2/30;
            }
        }
    }
}

routing-options
    static {
        route 192.168.255.2/32 next-hop [IPSEC FW Addr];
    }
}


---------


GRE Cisco Side
==============

interface Loopback1
 description * Loopback for GRE Tunnel source/endpoint *
 ip address 192.168.255.2 255.255.255.255


interface Tunnel2
 description * GRE Tunnel to Juniper GR-1/1/10.2 *
 ip address 10.0.0.1 255.255.255.252
 ip mtu 1400
 load-interval 30
 tunnel source Loopback1
 tunnel destination 192.168.255.1
 hold-queue 2000 in
 hold-queue 2000 out

! Route GRE endpoint via IPSEC FW
ip route 192.168.255.1 255.255.255.255 [IPSEC FW Addr]

Reference config for normal Cisco – Cisco.

interface Tunnel2
 description * GRE Tunnel to Remote site int tunnel2 *
 bandwidth 2000
 ip address 10.0.0.1 255.255.255.252
 ip tcp adjust-mss 1392
 load-interval 30
 tunnel source Loopback1
 tunnel destination 192.168.16.13
 hold-queue 2000 in
 hold-queue 2000 out