Linux High load average but low CPU/memory – rogue cron jobs

NB: All the PIDs and commands here are from my RedHat system. Do NOT copy any of those PIDs after the kill command on your own system – these are examples.

I was asked to look into a server that was showing a very high load average (indeed the app self monitoring reported this). Below is the top output that greeted me. Unfortunately this was one of those systems that, given its importance, you really don’t want to reboot if you can help it. O_o

top - 10:43:26 up 38 days, 23:16, 27 users,  load average: 226.11, 225.34, 223.05
Tasks: 1384 total,   1 running, 1383 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.2%us,  0.5%sy,  0.0%ni, 95.6%id,  0.6%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  66001768k total, 65069620k used,   932148k free,  2713600k buffers
Swap:  4194296k total,    49020k used,  4145276k free, 45540940k cached

As above, CPU is fine.

‘free -m’ was normal for the system showing memory usage was fine.

‘iostat -xm 5’ output showed some quite scary wait times which looked to be the issue so it was time to dig into what was happening. Time to look for any heavy file operations being run… note: use ls -lrt /dev/mapper to map device numbers to filesystems.

I tend to be a fan of ‘ps aexf’ when it’s all going wrong as this can quite quickly show any rogue process that has spawned a lot of processes/threads. In this case I found several hundred user processes trying to copy the same large file so it looked like some failed logic in a user script run by cron. I also quickly removed this file to stop things from getting worse seeing as I knew it wasn’t very important.

ps aexf extract:

6351 ?        S      0:00  \_ crond 
6352 ?        Ss     0:00  |   \_ /bin/bash /home/someroguescript 
26548 ?        D      0:04 |   |   \_ cp /tmp/somestupidlylargefile ...
6354 ?        S      0:00  |   \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t 
...and repeated instances times many. 

Output above is tidied up but I could tell that the crond PIDs were spawned by the root crond process. Looking at the owner of that crond process in normal ps -ef output quickly ID’d the culprit.

I tried killing one of the script instances but it zombied on me ie: <defunct>. it looks like crond is quite fussy about having its children killed off.

I’m always a bit nervous about killing off child processes of system essentials but without much choice, it was time to get brutal. I absolutely despise pkill having been burnt once before, so it was time to go and get all the “someroguescript” processes, along with their crond parent PIDs in a safer manner.

# ps -ef | grep someroguescript | grep -v grep
5000       712   710  0 04:10 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
5000       964   959  0 Nov07 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
5000      1050  1049  0 11:55 ?        00:00:00 [someroguescript] <defunct>
5000      1154  1152  0 Nov07 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
-truncated-
5000     22059 22055  0 00:25 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
5000     22413 22412  0 06:55 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript

OK that looks good. Pull out the PIDs and PPIDs and stick them all on a continuous line that kill can process.

ps -ef | grep someroguescript | grep -v grep | awk '{print $2" "$3}' | tr '\n' ' ' ; printf "\n"
1081 1433 1635 1805 1809 2288 3673 3688 3700 4574 4740 4750 
5104 5396 5406 5416 5427 5439 5449 5459 5924 5945 6259 6605 
7391 7402 7412 7422 7423 7433 7606 7617 7878 8165 8708 8718 
8730 8764 8845 8968 9015 9354 9445 9460 9756 9927 9939 9949 ...etc

Being super paranoid, I wanted to verify again that none of those PIDS or PPIDs were anything critical.

for pid in `ps -ef | grep someroguescript | grep -v grep | awk '{print $2" "$3}' | tr '\n' ' '`; do ps -ef | awk '$2 == "'"$pid"'"{print $0}'; done

I then pasted the very long PID/PPID list after the “kill” command. You could pipe the command through to ‘xargs kill` but in all honesty there’s nothing worse than piping output through to a command and then suddenly realising you did something wrong. Always best to take a few moments to ask yourself, “is this the correct thing to do?” :)

I checked those someroguescript processes were gone by seeing how many remained with:

ps -ef | grep someroguescript | grep -v grep | wc -l

Annoyingly after checking my top and ‘ps aexf’ output again, I found that the system was still loaded and file copy operations were still going. Time for another go at it:

Basic check to ensure we don’t catch anything unintended:

# ps -ef | grep somestupidlylargefile | grep -v grep

5000     32489     1  0 11:33 ?        00:00:01 cp /tmp/somestupidlylargefile ...
5000     32500     1  0 11:33 ?        00:00:01 cp /tmp/somestupidlylargefile ...
5000     32501     1  0 11:33 ?        00:00:01 cp /tmp/somestupidlylargefile ...
...and hundreds more

All looked good so time to grab all the PIDs. I noted here that these were all spawned by PID1 (init) so definitely didn’t want to kill that! So.. only extract PID ($2).

ps -ef | grep somestupidlylargefile | grep -v grep | awk '{print $2}' | tr '\n' ' ' ; printf "\n"
1081 1433 1635 1805 1809 2288 3673 3688 3700 4574 4740 4750 
5104 5396 5406 5416 5427 5439 5449 5459 5924 5945 6259 6605 
7391 7402 7412 7422 7423 7433 7606 7617 7878 8165 8708 8718 
8730 8764 8845 8968 9015 9949 10272 10287 10297 10333 10343 ...etc

I pasted all of those after ‘kill’ and everything was back to normal. The load average dropped pretty quickly. Also, the the parent system crond process was still running jobs and there was no worrying output in /var/log/cron. :)

top - 15:06:41 up 39 days,  3:39, 21 users,  load average: 1.64, 1.83, 2.38
Tasks: 493 total,   1 running, 492 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.2%us,  0.6%sy,  0.0%ni, 95.5%id,  0.7%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  66001768k total, 65740228k used,   261540k free,  3008088k buffers
Swap:  4194296k total,   114096k used,  4080200k free, 46928236k cached

Lessons learned: killing user crond processes is safe. People don’t always do full sanity checks in their scripts. User cron permission should be carefully guarded. :)

X11 forwarding over SSH on firewalled CentOS host

I had a few issues with X11 forwarding over SSH on one of my CentOS hosts. After a bit of fiddling, I discovered that there were a couple of things I hadn’t taken into account.

I’d set my putty session up to allow X11 fowarding, and set the X display location to “localhost”. On the server, I installed xclock and its dependencies for testing, and set the following in /etc/ssh/sshd_config:

X11Forwarding yes
X11DisplayOffset 10
X11UseLocalhost yes

I restarted sshd, however this still wasn’t working.

In short, I was missing two things:

1) xauth wasn’t installed. This is required!
2) I wasn’t allowing connections to localhost in my iptables config. This was fixed in my ruleset with:

iptables -A INPUT -i lo -j ACCEPT

sshd was restarted after installing xauth and adding the firewall rule and it now works a treat!

Spoofing SNMP Traps for testing

Somehow I missed the fact that JunOS allows you to spoof SNMP traps. I discovered this recently and must say it’s very handy, especially when testing new NNMi or other NMS incident configurations. It helpfully populates the varbinds for you with preset values (although you can specify them if desired).

This is done as follows from the JunOS command line:

user@SRX> request snmp spoof-trap ospfNbrStateChange variable-bindings "ospfNbrState = 8"

You can look up varbind names in the appropriate MIB.

A bit easier than the traditional equivalent in shell:

/usr/bin/snmptrap -v 2c -c mycommunity nms.mydomain.com:162 '' .1.3.6.1.2.1.14.16.2.0.2 \
.1.3.6.1.2.1.14.1.1 a 0.0.0.0 \
.1.3.6.1.2.1.14.10.1.1 a "1.2.3.4" \
.1.3.6.1.2.1.14.10.1.2 i "0" \
.1.3.6.1.2.1.14.10.1.3 a "2.3.4.5" \
.1.3.6.1.2.1.14.10.1.6 i "8"

Bypassing header checks for local clients (PostFix/Amavis)

Issue: Email with non-legitimate headers (eg: generated from scripts) from one of my servers was being trashed by Postfix/Amavis. Very annoying, and in the end I had to modify the /etc/amavisd/avavisd.conf file as follows to make things work properly:

$policy_bank{'MYNETS'} = {   # mail originating from @mynetworks
  originating => 1,  # is true in MYNETS by default, but let's make it explicit
  os_fingerprint_method => undef,  # don't query p0f for internal clients
  bypass_spam_checks_maps   => [1],  # don't spam-check internal mail
  bypass_banned_checks_maps => [1],  # don't banned-check internal mail
  bypass_header_checks_maps => [1],  # don't header-check internal mail
};

Note that @mynetworks is defined as follows, where 1.2.3.4/32 is an additional server not in RFC1918 space.

@mynetworks = qw( 127.0.0.0/8 [::1] [FE80::]/10 [FEC0::]/10
                  10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 1.2.3.4/32 );

Filtering by Host and Message in syslog-ng

I had a requirement to filter logs by source IP and by part of the messages, sending them off to another file. Proof of concept was done as follows:

– Single source IP for logs
– Match two source subnets in message text

Configuration in /etc/syslog-ng.conf is:

source s_net {
       tcp(max-connections(10));
       udp();
};

destination d_test { file ("/var/log/test.log"); };

filter f_test { (netmask(178.79.153.178/32)) and
                    (message("Client IP: 10\.1\.1\.*") or
                     message("Client IP: 10\.10\.10\.*")) 
};

log { source(s_sys); filter(f_test); destination(d_test); };

This separates out logs containing the specific Client IP addresses and doesn’t seem to hit CPU too hard. Netmask was used rather than host to avoid any issues with hostnames being resolved and then not matching the filter.