Checkpoint Connection Limit Woes

It’s been a while since I posted here so I thought I’d share something that’s been driving me absolutely insane for over a month at work.

We had episodes where the checkpoint connection table on one of our internet stack firewalls was getting maxed out, and tracking it down proved to be extremely difficult. Dumping the checkpoint firewall connections table and a quick bit of analysis:

fw tab –t –connections –f –u | awk ‘{print $9”,”$11”,”$13”,”$15”,”$43}’ > /tmp/connections.txt

Summary - top 20 sources
awk -F"," '{print $1}' /tmp/connections.txt | sort -n | uniq -c | sort -rn | head -20

Summary - top 20 destinations
awk -F"," '{print $3}' /tmp/connections.txt | sort -n | uniq -c | sort -rn | head -20

…didn’t really yield anything interesting and the times at which the issue was happening were completely random. Increasing the connections table limit just moved the problem further up the stack to the perimiter firewall! The top sources were the proxies which is to be expected. Without access to proxy logs this was also a pain (when you’re in a big organisation, you can’t just jump on to their kit and take a look, sadly).

In the end we decided to create new objects for http/https proxy, http and https etc for our Proxy traffic rule and set their timeouts low (10 mins). When we graphed the connections table we noticed that the spikes timed out after the low timeout we specified, proving beyond all doubt that the issue was either user or system based but only for those clients set to use the proxies.

After this I set up reports on our netflow collector to get some stats on traffic hitting the proxies and did a bit of digging via awk to find the top destination IPs – nothing out of the ordinary, certainly a lot of google traffic but that must be legit, right? So, I turned it around and looked at Client IPs to get a clue. We had to use realtime graphing on the Checkpoint to pick out exactly when the spikes were occuring so we could investigate netflow within a 1 minute window, otherwise it was like looking for a needle in a haystack.

In the end, and to cut a long story short, we found that some users had installed Google Chrome on their development PCs. For some reason, it seems that Google Chrome was creating over 17 THOUSAND connections in a very short space of time, and somehow, these weren’t being closed properly (whether by the browser or proxy, I’m still not sure). I replicated this behaviour on a user’s desktop with two perfectly legitimate sites in two tabs. netstat -an output on the user’s PC was not pretty… a scrolling mass of connections either established or in TIME_WAIT. Netflow suggests that almost all of these connections are never used to actually transfer any data so it must be something to do with the network behaviour prediction of Chrome.

Anyways, we banned Chrome from user desktops and now the issue has gone away. I also discovered that an older version of Opera on another user’s desktop had the same problem.

I hope this helps someone else suffering the same weird issues. It’s not funny when your public IP PAT port pool for browsing gets exhausted during business hours thanks to some rogue browser going mental.

Quickly getting a Nokia IP appliance restored

How to quickly restore the base IPSO config (not firewall policies).

Config will previously have been backed up ( /config/active ) to a management server as myfirewall-active.txt.

1) scp the active.txt to the Nokia IP Appliance

testbox# scp myfirewall-active.txt admin@myappliance:/config/active.txt

2) rename the current config

myappliance# cd /config
myappliance# mv active active.old
myappliance# mv active.txt active

3) Reload the device

All base configuration is now restored (interfaces, static routes etc etc). Now you can establish SIC and push the policy.

Checkpoint Firewall high interrupt CPU%

When this issue occurred, top was showing that the large majority of CPU was of interrupt category despite low traffic levels. Failing over to the secondary member of the cluster did not fix the problem; the fault moved. This issue can be reproduced on Nokia IP Appliances running IPSO and newer Checkpoint platforms running Gaia.

last pid: 59653;  load averages:  0.05,  0.07,  0.02   up 571+16:11:35 12:19:30
45 processes:  1 running, 44 sleeping
CPU states:  1.8% user,  0.0% nice,  1.8% system, 86.1% interrupt, 10.4% idle
Mem: 248M Active, 1321M Inact, 218M Wired, 72M Cache, 99M Buf, 143M Free
Swap: 4096M Total, 4096M Free

ps -aux was showing high CPU time consumed by [swi1: net_taskq0].

cpfw[admin]# ps -aux
root    14 98.2  0.0     0    16  ??  RL   10Feb12 65517:46.72 [swi1: net_taskq0]

Running netstat -ni showed errors incrementing on a few interfaces. At first this seemed like a hardware issue so failover to secondary was initiated. The problem moved to the other firewall.

After more digging, the culprit was found to be some new traffic streams of low bandwidth, but extremely high packet rate (in this case, some UDP syslog forwarding to a host beyond the firewall). A layer 3 switch at the source end was also having some issues so some of the traffic patterns may have been anomalous, compounding the issue.

This traffic was not permissioned on the firewall so was being matched by the drop rule. It seems that having a large rule base makes this issue even worse as traffic at a rate of thousands of packets per second is consuming a lot of CPU cycles. It was noted that adding a rule to permission the traffic near the top of the rule base dropped CPU usage significantly.

It makes sense to assume that as these streams are hitting the drop rule very frequently, rapid evaluations of the entire rulebase are taking place. The handling of “flows” for UDP traffic is probably more limited than is implied in IPSO/Gaia documentation.

It is worth enabling monitoring and finding this sort of traffic to allow you to create or move appropriate rules near the top of the rulebase to avoid unnecessary extra processing, especially if your rulebase is in the order of hundreds of rules.

I suppose you could conclude that you could quite easily DoS a policy-heavy checkpoint firewall by throwing a rapid stream of UDP packets to a far-side destination that doesn’t match anywhere in the rulebase. Note that this issue was encountered on an internal firewall where IPS was NOT enabled. IPS may mitigate this problem.

Resolve MAC addresses to Port, IP and DNS Name

Resolving MAC address to port, IP and DNS or name service name (or more simply for some, resolve mac to name) is a challenge that every network engineer has come across at some point in their career. It’s easily solved with a bit of thought and logic. Unfortunately the past few products I’ve dealt in the past with for this purpose have either been abandoned or aren’t as multi-vendor as I’d like, so it seems that the only solution is to write your own… bash and expect is sufficient.

If you’re thinking about doing this (and it’s a great learning exercise), you need to get around the following:

– Determining which interfaces are trunks on the switches so you can strip those MAC entries out (CDP works quite well)
– Converting ARP and MAC info into a “clean” format (eg: CatOS and IOS output is a different format)
– Detecting the fields across various pieces of hardware as display output isn’t always consistent for the same commands
– Inconsistent logins/passwords
– Correlating the IP/MAC/Interface information together. This can be done with the UNIX join command and some awk/sed
– What you do with MACs that don’t resolve to an IP address (I include a flag to print these if required)
– Whether the machine you run DNS queries on will be able to resolve the IPs to PTR records
– If using expect, stripping out stray characters (eg \r) that will mess up your greps and other string searches
– Add plenty of debugging so you can quickly tell why something isn’t working properly

I used expect to go and grab the ARP, CDP and MAC information seeing as you can’t get all the required information from SNMP on many devices these days. In my case, this results in the following type of output:

Switch       Interface       VLAN  MAC             IP               DNSName
nycsw12      Fa3/10          100   0060.b0aa.0000    NO_DNS
nycsw12      Fa2/16          99    1060.4b61.0001
nycsw12      Fa2/37          101   1060.4b64.0002
nycsw12      Fa2/42          101   1060.4b68.0003
nycsw12      Fa2/45          98    1060.4b6a.0004
nycsw12      Fa2/32          98    1060.4b6a.0005
nycsw12      Fa3/3           100   2c41.389e.d19f
nycsw13      Fa2/4           100   5c26.0a01.0ac4
nycsw13      Fa2/6           100   6c3b.e531.2ddf

Of course, you can always just use Excel to do a VLOOKUP of your mac-address table output against a sorted table containing all your arp entries, but that’s a bit less automatic.

NNMi and Firewall Connections Monitoring

One thing that is sometimes overlooked on firewalls is the connection count. Badly written applications or incorrect firewall configurations can mean that the connections table becomes saturated, causing disconnections, connection failures and a myriad of other problems. This can result in people running tests, seeing packet loss, and concluding that there must be a duplex mismatch, erroring link, or something else fundamental along the path.

On a side node, one of the painful situations that causes this on Checkpoint is someone adjusting the TCP timeout value in global properties to something way above the default. TCP timeouts on Checkpoint should ALWAYS be done on the service object level, NOT globally.

We can monitor the connection tables on firewalls via NNMi and generate alerts and also affect the node status (colour) on node maps to help us find these problems.

This article will cover Nokia Checkpoint and Cisco ASA Firewalls as an example, but it can easily be replicated for any firewall by using a different OID for the number of concurrent connections.

Firstly, ensure that you have loaded the MIBs for your firewalls (CHECKPOINT-MIB for Nokias and/or the CISCO-FIREWALL-TC and CISCO-UNIFIED-FIREWALL-MIB for ASA)

Secondly, for all Nokia Checkpoint firewalls, you MUST run cpconfig, and enable the SNMP extension. Bear in mind that you will need to restart checkpoint services with cp restart which is disruptive!

Now, we must create Node Groups for monitoring our firewalls. I suggest that you create node groups that define members by SysOID, as each model/configuration of firewall will have a different limit to the maximum amount of concurrent connections allowed. Low, Mid and Top-end groups are a good idea so you can define a reasonably granular threshold for each group. You should consult the vendor documents to decide on what is appropriate for your environment.

The following SysOIDs may be useful:

nokiaIP110      .
nokiaIP1220     .
nokiaIP1260     .
nokiaIP1280     .
nokiaIP150      .
nokiaIP2255     .
nokiaIP2450     .
nokiaIP260      .
nokiaIP266      .
nokiaIP290      .
nokiaIP3400     .
nokiaIP350      .
nokiaIP380      .
nokiaIP390      .
nokiaIP3XX      .
nokiaIP400      .
nokiaIP410      .
nokiaIP440      .
nokiaIP4XX      .
nokiaIP530      .
nokiaIP560      .
nokiaIP600      .
nokiaIP650      .
nokiaIP690      .
nokiaIP6XX      .
nokiaIP710      .
nokiaIP740      .
ASA5505         .
ASA5510         .
ASA5520         .
ASA5540         .
ASA5550         .

Now, we create a MIB expression for the OID we want to monitor.

Clicking on the right hand side of MIB Variable lets us drill down the MIB tree to the OID we want. In this case, it’s – fwNumConn. For ASAs, you want

This gives us the following:

We now create a Custom Poller Policy (The 25000 here suggests devices that have a maximum of 25K connections limit, but you could call it anything you like, such as “Low-End-CP-Firewalls-25K-MAX”). From here, we create a new collection policy (see right hand side of image below). We select “Generate incident” on Node collection to generate incidents when the threshold is breached, and we also select “Affect Node Status”, since connection count being over threshold is going to impact performance.

If using NPS, export the collection. You may also prefer to change the “Incident Source Object” to “Custom node Collection” rather than “Custom Polled Instance” as instances tend to work better for multiple objects within the same OID such as BGP peerings.

And then define a threshold… This should be a bit below the maximum amount of supported connections for the group of devices you are monitoring. Eg: for a 25K connections device, select 20000.

Now, back in the Custom poller policy form, we can assign our Node Group. EG: Checkpoint_LowEnd_Firewalls (a node group that includes IP260/IP290 and selects nodes by SysOIDs )

Once all forms are saved, we can verify this is working by navigating to Monitoring > Custom Polled Instances.

This configuration means that firewalls that are added to the topology will automatically fall into the correct node groups and alerting thresholds. It also means that the map will change when the threshold is breached. Split the node groups down as much as you want, but bear in mind that you will have to create a new polling policy/collection+threshold for each group.

It should also be noted that in NNMi, if you adjust a threshold, the collection policy will be suspended and you will have to re-enable it. Don’t let this catch you out!

You may find that keeping an eye on this particular aspect of your firewalls may save you some real headaches later on.