Checkpoint Connection Limit Woes

It’s been a while since I posted here so I thought I’d share something that’s been driving me absolutely insane for over a month at work.

We had episodes where the checkpoint connection table on one of our internet stack firewalls was getting maxed out, and tracking it down proved to be extremely difficult. Dumping the checkpoint firewall connections table and a quick bit of analysis:

fw tab –t –connections –f –u | awk ‘{print $9”,”$11”,”$13”,”$15”,”$43}’ > /tmp/connections.txt

Summary - top 20 sources
awk -F"," '{print $1}' /tmp/connections.txt | sort -n | uniq -c | sort -rn | head -20

Summary - top 20 destinations
awk -F"," '{print $3}' /tmp/connections.txt | sort -n | uniq -c | sort -rn | head -20

…didn’t really yield anything interesting and the times at which the issue was happening were completely random. Increasing the connections table limit just moved the problem further up the stack to the perimiter firewall! The top sources were the proxies which is to be expected. Without access to proxy logs this was also a pain (when you’re in a big organisation, you can’t just jump on to their kit and take a look, sadly).

In the end we decided to create new objects for http/https proxy, http and https etc for our Proxy traffic rule and set their timeouts low (10 mins). When we graphed the connections table we noticed that the spikes timed out after the low timeout we specified, proving beyond all doubt that the issue was either user or system based but only for those clients set to use the proxies.

After this I set up reports on our netflow collector to get some stats on traffic hitting the proxies and did a bit of digging via awk to find the top destination IPs – nothing out of the ordinary, certainly a lot of google traffic but that must be legit, right? So, I turned it around and looked at Client IPs to get a clue. We had to use realtime graphing on the Checkpoint to pick out exactly when the spikes were occuring so we could investigate netflow within a 1 minute window, otherwise it was like looking for a needle in a haystack.

In the end, and to cut a long story short, we found that some users had installed Google Chrome on their development PCs. For some reason, it seems that Google Chrome was creating over 17 THOUSAND connections in a very short space of time, and somehow, these weren’t being closed properly (whether by the browser or proxy, I’m still not sure). I replicated this behaviour on a user’s desktop with two perfectly legitimate sites in two tabs. netstat -an output on the user’s PC was not pretty… a scrolling mass of connections either established or in TIME_WAIT. Netflow suggests that almost all of these connections are never used to actually transfer any data so it must be something to do with the network behaviour prediction of Chrome.

Anyways, we banned Chrome from user desktops and now the issue has gone away. I also discovered that an older version of Opera on another user’s desktop had the same problem.

I hope this helps someone else suffering the same weird issues. It’s not funny when your public IP PAT port pool for browsing gets exhausted during business hours thanks to some rogue browser going mental.

NNMi and Firewall Connections Monitoring

One thing that is sometimes overlooked on firewalls is the connection count. Badly written applications or incorrect firewall configurations can mean that the connections table becomes saturated, causing disconnections, connection failures and a myriad of other problems. This can result in people running tests, seeing packet loss, and concluding that there must be a duplex mismatch, erroring link, or something else fundamental along the path.

On a side node, one of the painful situations that causes this on Checkpoint is someone adjusting the TCP timeout value in global properties to something way above the default. TCP timeouts on Checkpoint should ALWAYS be done on the service object level, NOT globally.

We can monitor the connection tables on firewalls via NNMi and generate alerts and also affect the node status (colour) on node maps to help us find these problems.

This article will cover Nokia Checkpoint and Cisco ASA Firewalls as an example, but it can easily be replicated for any firewall by using a different OID for the number of concurrent connections.

Firstly, ensure that you have loaded the MIBs for your firewalls (CHECKPOINT-MIB for Nokias and/or the CISCO-FIREWALL-TC and CISCO-UNIFIED-FIREWALL-MIB for ASA)

Secondly, for all Nokia Checkpoint firewalls, you MUST run cpconfig, and enable the SNMP extension. Bear in mind that you will need to restart checkpoint services with cp restart which is disruptive!

Now, we must create Node Groups for monitoring our firewalls. I suggest that you create node groups that define members by SysOID, as each model/configuration of firewall will have a different limit to the maximum amount of concurrent connections allowed. Low, Mid and Top-end groups are a good idea so you can define a reasonably granular threshold for each group. You should consult the vendor documents to decide on what is appropriate for your environment.

The following SysOIDs may be useful:

nokiaIP110      .
nokiaIP1220     .
nokiaIP1260     .
nokiaIP1280     .
nokiaIP150      .
nokiaIP2255     .
nokiaIP2450     .
nokiaIP260      .
nokiaIP266      .
nokiaIP290      .
nokiaIP3400     .
nokiaIP350      .
nokiaIP380      .
nokiaIP390      .
nokiaIP3XX      .
nokiaIP400      .
nokiaIP410      .
nokiaIP440      .
nokiaIP4XX      .
nokiaIP530      .
nokiaIP560      .
nokiaIP600      .
nokiaIP650      .
nokiaIP690      .
nokiaIP6XX      .
nokiaIP710      .
nokiaIP740      .
ASA5505         .
ASA5510         .
ASA5520         .
ASA5540         .
ASA5550         .

Now, we create a MIB expression for the OID we want to monitor.

Clicking on the right hand side of MIB Variable lets us drill down the MIB tree to the OID we want. In this case, it’s – fwNumConn. For ASAs, you want

This gives us the following:

We now create a Custom Poller Policy (The 25000 here suggests devices that have a maximum of 25K connections limit, but you could call it anything you like, such as “Low-End-CP-Firewalls-25K-MAX”). From here, we create a new collection policy (see right hand side of image below). We select “Generate incident” on Node collection to generate incidents when the threshold is breached, and we also select “Affect Node Status”, since connection count being over threshold is going to impact performance.

If using NPS, export the collection. You may also prefer to change the “Incident Source Object” to “Custom node Collection” rather than “Custom Polled Instance” as instances tend to work better for multiple objects within the same OID such as BGP peerings.

And then define a threshold… This should be a bit below the maximum amount of supported connections for the group of devices you are monitoring. Eg: for a 25K connections device, select 20000.

Now, back in the Custom poller policy form, we can assign our Node Group. EG: Checkpoint_LowEnd_Firewalls (a node group that includes IP260/IP290 and selects nodes by SysOIDs )

Once all forms are saved, we can verify this is working by navigating to Monitoring > Custom Polled Instances.

This configuration means that firewalls that are added to the topology will automatically fall into the correct node groups and alerting thresholds. It also means that the map will change when the threshold is breached. Split the node groups down as much as you want, but bear in mind that you will have to create a new polling policy/collection+threshold for each group.

It should also be noted that in NNMi, if you adjust a threshold, the collection policy will be suspended and you will have to re-enable it. Don’t let this catch you out!

You may find that keeping an eye on this particular aspect of your firewalls may save you some real headaches later on.