High CPU on Nexus 5K and no SNMP response – snmpd

Strange issue with SNMP not responding today on a nexus 5K. Tried removing and re-adding the SNMP config, removing the acl altogether that we use to control access and still no joy.

Upon checking CPU usage, it seemed quite high. show proc cpu sort output showed that snmpd was quite busy:

PID    Runtime(ms)  Invoked   uSecs  1Sec    Process
-----  -----------  --------  -----  ------  -----------
 4559           59  991518226      0   44.5%  snmpd
 4605          179        87   2065    9.0%  netstack
 1178         2091  1733135010      0    1.0%  kirqd
    1          157  25653477      0    0.0%  init
    2          837   3474116      0    0.0%  migration/0
    3          600  3970856252      0    0.0%  ksoftirqd/0

I was sure I’d dealt with this before and it seems that I was hitting a bug.

Official word is that: There is a memory leak in one of the processes called libcmd that is used by SNMP. Workaround is entering the hidden command:

no snmp-server load-mib dot1dbridgesnmp

The best solution, however, would be to perform a software upgrade to 5.0(3)N2(2) or later where this is fixed.

Checkpoint Firewall high interrupt CPU%

When this issue occurred, top was showing that the large majority of CPU was of interrupt category despite low traffic levels. Failing over to the secondary member of the cluster did not fix the problem; the fault moved. This issue can be reproduced on Nokia IP Appliances running IPSO and newer Checkpoint platforms running Gaia.

last pid: 59653;  load averages:  0.05,  0.07,  0.02   up 571+16:11:35 12:19:30
45 processes:  1 running, 44 sleeping
CPU states:  1.8% user,  0.0% nice,  1.8% system, 86.1% interrupt, 10.4% idle
Mem: 248M Active, 1321M Inact, 218M Wired, 72M Cache, 99M Buf, 143M Free
Swap: 4096M Total, 4096M Free

ps -aux was showing high CPU time consumed by [swi1: net_taskq0].

cpfw[admin]# ps -aux
USER   PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
root    14 98.2  0.0     0    16  ??  RL   10Feb12 65517:46.72 [swi1: net_taskq0]

Running netstat -ni showed errors incrementing on a few interfaces. At first this seemed like a hardware issue so failover to secondary was initiated. The problem moved to the other firewall.

After more digging, the culprit was found to be some new traffic streams of low bandwidth, but extremely high packet rate (in this case, some UDP syslog forwarding to a host beyond the firewall). A layer 3 switch at the source end was also having some issues so some of the traffic patterns may have been anomalous, compounding the issue.

This traffic was not permissioned on the firewall so was being matched by the drop rule. It seems that having a large rule base makes this issue even worse as traffic at a rate of thousands of packets per second is consuming a lot of CPU cycles. It was noted that adding a rule to permission the traffic near the top of the rule base dropped CPU usage significantly.

It makes sense to assume that as these streams are hitting the drop rule very frequently, rapid evaluations of the entire rulebase are taking place. The handling of “flows” for UDP traffic is probably more limited than is implied in IPSO/Gaia documentation.

It is worth enabling monitoring and finding this sort of traffic to allow you to create or move appropriate rules near the top of the rulebase to avoid unnecessary extra processing, especially if your rulebase is in the order of hundreds of rules.

I suppose you could conclude that you could quite easily DoS a policy-heavy checkpoint firewall by throwing a rapid stream of UDP packets to a far-side destination that doesn’t match anywhere in the rulebase. Note that this issue was encountered on an internal firewall where IPS was NOT enabled. IPS may mitigate this problem.