Checkpoint Connection Limit Woes

It’s been a while since I posted here so I thought I’d share something that’s been driving me absolutely insane for over a month at work.

We had episodes where the checkpoint connection table on one of our internet stack firewalls was getting maxed out, and tracking it down proved to be extremely difficult. Dumping the checkpoint firewall connections table and a quick bit of analysis:

fw tab –t –connections –f –u | awk ‘{print $9”,”$11”,”$13”,”$15”,”$43}’ > /tmp/connections.txt

Summary - top 20 sources
awk -F"," '{print $1}' /tmp/connections.txt | sort -n | uniq -c | sort -rn | head -20

Summary - top 20 destinations
awk -F"," '{print $3}' /tmp/connections.txt | sort -n | uniq -c | sort -rn | head -20

…didn’t really yield anything interesting and the times at which the issue was happening were completely random. Increasing the connections table limit just moved the problem further up the stack to the perimiter firewall! The top sources were the proxies which is to be expected. Without access to proxy logs this was also a pain (when you’re in a big organisation, you can’t just jump on to their kit and take a look, sadly).

In the end we decided to create new objects for http/https proxy, http and https etc for our Proxy traffic rule and set their timeouts low (10 mins). When we graphed the connections table we noticed that the spikes timed out after the low timeout we specified, proving beyond all doubt that the issue was either user or system based but only for those clients set to use the proxies.

After this I set up reports on our netflow collector to get some stats on traffic hitting the proxies and did a bit of digging via awk to find the top destination IPs – nothing out of the ordinary, certainly a lot of google traffic but that must be legit, right? So, I turned it around and looked at Client IPs to get a clue. We had to use realtime graphing on the Checkpoint to pick out exactly when the spikes were occuring so we could investigate netflow within a 1 minute window, otherwise it was like looking for a needle in a haystack.

In the end, and to cut a long story short, we found that some users had installed Google Chrome on their development PCs. For some reason, it seems that Google Chrome was creating over 17 THOUSAND connections in a very short space of time, and somehow, these weren’t being closed properly (whether by the browser or proxy, I’m still not sure). I replicated this behaviour on a user’s desktop with two perfectly legitimate sites in two tabs. netstat -an output on the user’s PC was not pretty… a scrolling mass of connections either established or in TIME_WAIT. Netflow suggests that almost all of these connections are never used to actually transfer any data so it must be something to do with the network behaviour prediction of Chrome.

Anyways, we banned Chrome from user desktops and now the issue has gone away. I also discovered that an older version of Opera on another user’s desktop had the same problem.

I hope this helps someone else suffering the same weird issues. It’s not funny when your public IP PAT port pool for browsing gets exhausted during business hours thanks to some rogue browser going mental.