Linux High load average but low CPU/memory – rogue cron jobs

NB: All the PIDs and commands here are from my RedHat system. Do NOT copy any of those PIDs after the kill command on your own system – these are examples.

I was asked to look into a server that was showing a very high load average (indeed the app self monitoring reported this). Below is the top output that greeted me. Unfortunately this was one of those systems that, given its importance, you really don’t want to reboot if you can help it. O_o

top - 10:43:26 up 38 days, 23:16, 27 users,  load average: 226.11, 225.34, 223.05
Tasks: 1384 total,   1 running, 1383 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.2%us,  0.5%sy,  0.0%ni, 95.6%id,  0.6%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  66001768k total, 65069620k used,   932148k free,  2713600k buffers
Swap:  4194296k total,    49020k used,  4145276k free, 45540940k cached

As above, CPU is fine.

‘free -m’ was normal for the system showing memory usage was fine.

‘iostat -xm 5’ output showed some quite scary wait times which looked to be the issue so it was time to dig into what was happening. Time to look for any heavy file operations being run… note: use ls -lrt /dev/mapper to map device numbers to filesystems.

I tend to be a fan of ‘ps aexf’ when it’s all going wrong as this can quite quickly show any rogue process that has spawned a lot of processes/threads. In this case I found several hundred user processes trying to copy the same large file so it looked like some failed logic in a user script run by cron. I also quickly removed this file to stop things from getting worse seeing as I knew it wasn’t very important.

ps aexf extract:

6351 ?        S      0:00  \_ crond 
6352 ?        Ss     0:00  |   \_ /bin/bash /home/someroguescript 
26548 ?        D      0:04 |   |   \_ cp /tmp/somestupidlylargefile ...
6354 ?        S      0:00  |   \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t 
...and repeated instances times many. 

Output above is tidied up but I could tell that the crond PIDs were spawned by the root crond process. Looking at the owner of that crond process in normal ps -ef output quickly ID’d the culprit.

I tried killing one of the script instances but it zombied on me ie: <defunct>. it looks like crond is quite fussy about having its children killed off.

I’m always a bit nervous about killing off child processes of system essentials but without much choice, it was time to get brutal. I absolutely despise pkill having been burnt once before, so it was time to go and get all the “someroguescript” processes, along with their crond parent PIDs in a safer manner.

# ps -ef | grep someroguescript | grep -v grep
5000       712   710  0 04:10 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
5000       964   959  0 Nov07 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
5000      1050  1049  0 11:55 ?        00:00:00 [someroguescript] <defunct>
5000      1154  1152  0 Nov07 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
-truncated-
5000     22059 22055  0 00:25 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
5000     22413 22412  0 06:55 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript

OK that looks good. Pull out the PIDs and PPIDs and stick them all on a continuous line that kill can process.

ps -ef | grep someroguescript | grep -v grep | awk '{print $2" "$3}' | tr '\n' ' ' ; printf "\n"
1081 1433 1635 1805 1809 2288 3673 3688 3700 4574 4740 4750 
5104 5396 5406 5416 5427 5439 5449 5459 5924 5945 6259 6605 
7391 7402 7412 7422 7423 7433 7606 7617 7878 8165 8708 8718 
8730 8764 8845 8968 9015 9354 9445 9460 9756 9927 9939 9949 ...etc

Being super paranoid, I wanted to verify again that none of those PIDS or PPIDs were anything critical.

for pid in `ps -ef | grep someroguescript | grep -v grep | awk '{print $2" "$3}' | tr '\n' ' '`; do ps -ef | awk '$2 == "'"$pid"'"{print $0}'; done

I then pasted the very long PID/PPID list after the “kill” command. You could pipe the command through to ‘xargs kill` but in all honesty there’s nothing worse than piping output through to a command and then suddenly realising you did something wrong. Always best to take a few moments to ask yourself, “is this the correct thing to do?” :)

I checked those someroguescript processes were gone by seeing how many remained with:

ps -ef | grep someroguescript | grep -v grep | wc -l

Annoyingly after checking my top and ‘ps aexf’ output again, I found that the system was still loaded and file copy operations were still going. Time for another go at it:

Basic check to ensure we don’t catch anything unintended:

# ps -ef | grep somestupidlylargefile | grep -v grep

5000     32489     1  0 11:33 ?        00:00:01 cp /tmp/somestupidlylargefile ...
5000     32500     1  0 11:33 ?        00:00:01 cp /tmp/somestupidlylargefile ...
5000     32501     1  0 11:33 ?        00:00:01 cp /tmp/somestupidlylargefile ...
...and hundreds more

All looked good so time to grab all the PIDs. I noted here that these were all spawned by PID1 (init) so definitely didn’t want to kill that! So.. only extract PID ($2).

ps -ef | grep somestupidlylargefile | grep -v grep | awk '{print $2}' | tr '\n' ' ' ; printf "\n"
1081 1433 1635 1805 1809 2288 3673 3688 3700 4574 4740 4750 
5104 5396 5406 5416 5427 5439 5449 5459 5924 5945 6259 6605 
7391 7402 7412 7422 7423 7433 7606 7617 7878 8165 8708 8718 
8730 8764 8845 8968 9015 9949 10272 10287 10297 10333 10343 ...etc

I pasted all of those after ‘kill’ and everything was back to normal. The load average dropped pretty quickly. Also, the the parent system crond process was still running jobs and there was no worrying output in /var/log/cron. :)

top - 15:06:41 up 39 days,  3:39, 21 users,  load average: 1.64, 1.83, 2.38
Tasks: 493 total,   1 running, 492 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.2%us,  0.6%sy,  0.0%ni, 95.5%id,  0.7%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  66001768k total, 65740228k used,   261540k free,  3008088k buffers
Swap:  4194296k total,   114096k used,  4080200k free, 46928236k cached

Lessons learned: killing user crond processes is safe. People don’t always do full sanity checks in their scripts. User cron permission should be carefully guarded. :)

Tagged , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *