Checkpoint Connection Limit Woes

It’s been a while since I posted here so I thought I’d share something that’s been driving me absolutely insane for over a month at work.

We had episodes where the checkpoint connection table on one of our internet stack firewalls was getting maxed out, and tracking it down proved to be extremely difficult. Dumping the checkpoint firewall connections table and a quick bit of analysis:

fw tab –t –connections –f –u | awk ‘{print $9”,”$11”,”$13”,”$15”,”$43}’ > /tmp/connections.txt

Summary - top 20 sources
awk -F"," '{print $1}' /tmp/connections.txt | sort -n | uniq -c | sort -rn | head -20

Summary - top 20 destinations
awk -F"," '{print $3}' /tmp/connections.txt | sort -n | uniq -c | sort -rn | head -20

…didn’t really yield anything interesting and the times at which the issue was happening were completely random. Increasing the connections table limit just moved the problem further up the stack to the perimiter firewall! The top sources were the proxies which is to be expected. Without access to proxy logs this was also a pain (when you’re in a big organisation, you can’t just jump on to their kit and take a look, sadly).

In the end we decided to create new objects for http/https proxy, http and https etc for our Proxy traffic rule and set their timeouts low (10 mins). When we graphed the connections table we noticed that the spikes timed out after the low timeout we specified, proving beyond all doubt that the issue was either user or system based but only for those clients set to use the proxies.

After this I set up reports on our netflow collector to get some stats on traffic hitting the proxies and did a bit of digging via awk to find the top destination IPs – nothing out of the ordinary, certainly a lot of google traffic but that must be legit, right? So, I turned it around and looked at Client IPs to get a clue. We had to use realtime graphing on the Checkpoint to pick out exactly when the spikes were occuring so we could investigate netflow within a 1 minute window, otherwise it was like looking for a needle in a haystack.

In the end, and to cut a long story short, we found that some users had installed Google Chrome on their development PCs. For some reason, it seems that Google Chrome was creating over 17 THOUSAND connections in a very short space of time, and somehow, these weren’t being closed properly (whether by the browser or proxy, I’m still not sure). I replicated this behaviour on a user’s desktop with two perfectly legitimate sites in two tabs. netstat -an output on the user’s PC was not pretty… a scrolling mass of connections either established or in TIME_WAIT. Netflow suggests that almost all of these connections are never used to actually transfer any data so it must be something to do with the network behaviour prediction of Chrome.

Anyways, we banned Chrome from user desktops and now the issue has gone away. I also discovered that an older version of Opera on another user’s desktop had the same problem.

I hope this helps someone else suffering the same weird issues. It’s not funny when your public IP PAT port pool for browsing gets exhausted during business hours thanks to some rogue browser going mental.

Checkpoint Firewall high interrupt CPU%

When this issue occurred, top was showing that the large majority of CPU was of interrupt category despite low traffic levels. Failing over to the secondary member of the cluster did not fix the problem; the fault moved. This issue can be reproduced on Nokia IP Appliances running IPSO and newer Checkpoint platforms running Gaia.

last pid: 59653;  load averages:  0.05,  0.07,  0.02   up 571+16:11:35 12:19:30
45 processes:  1 running, 44 sleeping
CPU states:  1.8% user,  0.0% nice,  1.8% system, 86.1% interrupt, 10.4% idle
Mem: 248M Active, 1321M Inact, 218M Wired, 72M Cache, 99M Buf, 143M Free
Swap: 4096M Total, 4096M Free

ps -aux was showing high CPU time consumed by [swi1: net_taskq0].

cpfw[admin]# ps -aux
root    14 98.2  0.0     0    16  ??  RL   10Feb12 65517:46.72 [swi1: net_taskq0]

Running netstat -ni showed errors incrementing on a few interfaces. At first this seemed like a hardware issue so failover to secondary was initiated. The problem moved to the other firewall.

After more digging, the culprit was found to be some new traffic streams of low bandwidth, but extremely high packet rate (in this case, some UDP syslog forwarding to a host beyond the firewall). A layer 3 switch at the source end was also having some issues so some of the traffic patterns may have been anomalous, compounding the issue.

This traffic was not permissioned on the firewall so was being matched by the drop rule. It seems that having a large rule base makes this issue even worse as traffic at a rate of thousands of packets per second is consuming a lot of CPU cycles. It was noted that adding a rule to permission the traffic near the top of the rule base dropped CPU usage significantly.

It makes sense to assume that as these streams are hitting the drop rule very frequently, rapid evaluations of the entire rulebase are taking place. The handling of “flows” for UDP traffic is probably more limited than is implied in IPSO/Gaia documentation.

It is worth enabling monitoring and finding this sort of traffic to allow you to create or move appropriate rules near the top of the rulebase to avoid unnecessary extra processing, especially if your rulebase is in the order of hundreds of rules.

I suppose you could conclude that you could quite easily DoS a policy-heavy checkpoint firewall by throwing a rapid stream of UDP packets to a far-side destination that doesn’t match anywhere in the rulebase. Note that this issue was encountered on an internal firewall where IPS was NOT enabled. IPS may mitigate this problem.