Linux High load average but low CPU/memory – rogue cron jobs

NB: All the PIDs and commands here are from my RedHat system. Do NOT copy any of those PIDs after the kill command on your own system – these are examples.

I was asked to look into a server that was showing a very high load average (indeed the app self monitoring reported this). Below is the top output that greeted me. Unfortunately this was one of those systems that, given its importance, you really don’t want to reboot if you can help it. O_o

top - 10:43:26 up 38 days, 23:16, 27 users,  load average: 226.11, 225.34, 223.05
Tasks: 1384 total,   1 running, 1383 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.2%us,  0.5%sy,  0.0%ni, 95.6%id,  0.6%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  66001768k total, 65069620k used,   932148k free,  2713600k buffers
Swap:  4194296k total,    49020k used,  4145276k free, 45540940k cached

As above, CPU is fine.

‘free -m’ was normal for the system showing memory usage was fine.

‘iostat -xm 5’ output showed some quite scary wait times which looked to be the issue so it was time to dig into what was happening. Time to look for any heavy file operations being run… note: use ls -lrt /dev/mapper to map device numbers to filesystems.

I tend to be a fan of ‘ps aexf’ when it’s all going wrong as this can quite quickly show any rogue process that has spawned a lot of processes/threads. In this case I found several hundred user processes trying to copy the same large file so it looked like some failed logic in a user script run by cron. I also quickly removed this file to stop things from getting worse seeing as I knew it wasn’t very important.

ps aexf extract:

6351 ?        S      0:00  \_ crond 
6352 ?        Ss     0:00  |   \_ /bin/bash /home/someroguescript 
26548 ?        D      0:04 |   |   \_ cp /tmp/somestupidlylargefile ...
6354 ?        S      0:00  |   \_ /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t 
...and repeated instances times many. 

Output above is tidied up but I could tell that the crond PIDs were spawned by the root crond process. Looking at the owner of that crond process in normal ps -ef output quickly ID’d the culprit.

I tried killing one of the script instances but it zombied on me ie: <defunct>. it looks like crond is quite fussy about having its children killed off.

I’m always a bit nervous about killing off child processes of system essentials but without much choice, it was time to get brutal. I absolutely despise pkill having been burnt once before, so it was time to go and get all the “someroguescript” processes, along with their crond parent PIDs in a safer manner.

# ps -ef | grep someroguescript | grep -v grep
5000       712   710  0 04:10 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
5000       964   959  0 Nov07 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
5000      1050  1049  0 11:55 ?        00:00:00 [someroguescript] <defunct>
5000      1154  1152  0 Nov07 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
-truncated-
5000     22059 22055  0 00:25 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript
5000     22413 22412  0 06:55 ?        00:00:00 /bin/bash /home/someuser/scripts/./someroguescript

OK that looks good. Pull out the PIDs and PPIDs and stick them all on a continuous line that kill can process.

ps -ef | grep someroguescript | grep -v grep | awk '{print $2" "$3}' | tr '\n' ' ' ; printf "\n"
1081 1433 1635 1805 1809 2288 3673 3688 3700 4574 4740 4750 
5104 5396 5406 5416 5427 5439 5449 5459 5924 5945 6259 6605 
7391 7402 7412 7422 7423 7433 7606 7617 7878 8165 8708 8718 
8730 8764 8845 8968 9015 9354 9445 9460 9756 9927 9939 9949 ...etc

Being super paranoid, I wanted to verify again that none of those PIDS or PPIDs were anything critical.

for pid in `ps -ef | grep someroguescript | grep -v grep | awk '{print $2" "$3}' | tr '\n' ' '`; do ps -ef | awk '$2 == "'"$pid"'"{print $0}'; done

I then pasted the very long PID/PPID list after the “kill” command. You could pipe the command through to ‘xargs kill` but in all honesty there’s nothing worse than piping output through to a command and then suddenly realising you did something wrong. Always best to take a few moments to ask yourself, “is this the correct thing to do?” :)

I checked those someroguescript processes were gone by seeing how many remained with:

ps -ef | grep someroguescript | grep -v grep | wc -l

Annoyingly after checking my top and ‘ps aexf’ output again, I found that the system was still loaded and file copy operations were still going. Time for another go at it:

Basic check to ensure we don’t catch anything unintended:

# ps -ef | grep somestupidlylargefile | grep -v grep

5000     32489     1  0 11:33 ?        00:00:01 cp /tmp/somestupidlylargefile ...
5000     32500     1  0 11:33 ?        00:00:01 cp /tmp/somestupidlylargefile ...
5000     32501     1  0 11:33 ?        00:00:01 cp /tmp/somestupidlylargefile ...
...and hundreds more

All looked good so time to grab all the PIDs. I noted here that these were all spawned by PID1 (init) so definitely didn’t want to kill that! So.. only extract PID ($2).

ps -ef | grep somestupidlylargefile | grep -v grep | awk '{print $2}' | tr '\n' ' ' ; printf "\n"
1081 1433 1635 1805 1809 2288 3673 3688 3700 4574 4740 4750 
5104 5396 5406 5416 5427 5439 5449 5459 5924 5945 6259 6605 
7391 7402 7412 7422 7423 7433 7606 7617 7878 8165 8708 8718 
8730 8764 8845 8968 9015 9949 10272 10287 10297 10333 10343 ...etc

I pasted all of those after ‘kill’ and everything was back to normal. The load average dropped pretty quickly. Also, the the parent system crond process was still running jobs and there was no worrying output in /var/log/cron. :)

top - 15:06:41 up 39 days,  3:39, 21 users,  load average: 1.64, 1.83, 2.38
Tasks: 493 total,   1 running, 492 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.2%us,  0.6%sy,  0.0%ni, 95.5%id,  0.7%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  66001768k total, 65740228k used,   261540k free,  3008088k buffers
Swap:  4194296k total,   114096k used,  4080200k free, 46928236k cached

Lessons learned: killing user crond processes is safe. People don’t always do full sanity checks in their scripts. User cron permission should be carefully guarded. :)

Policy tracing from CLI on firewalls.

This post is just a quick reference on how to see whether or not some traffic will pass the rulebase or not.

Tracing policy matching on Juniper SRX (for actual live sessions use ‘show security-flow session’)

show security match-policies from-zone trust to-zone untrust source-port 1024 destination-port 40961 protocol tcp source-ip 10.243.0.1 destination-ip 10.243.15.12

Tracing policy matching on Cisco ASA (for live connections, use ‘show conn’)

packet-tracer input inside tcp 10.0.0.1 1024 4.2.2.1 443 [detailed]

Tracing policy matching on Palo Alto

test security-policy-match application twitter-posting source-user acme\mcanha destination 199.59.150.7 destination-port 80 source 10.40.14.197 protocol 6 

NB: protocol 6 = tcp, 17= UDP, 1 = ICMP
see https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml

NNMi 9.2x AF Cluster Reformation DB Start Failure

I was trying to reform an application failover cluster after having to restore a backup and was running into issues. The primary came up absolutely fine but the secondary kept coming up with status “STANDBY_DB_FAILED”.

Looking into the /var/opt/OV/shared/nnm/databases/Postgresql/pg_log files, initially I had this after the DBZIP was received, extracted and the DB start was attempted:

2017-09-01 13:14:48.980 GMT: :4875:0LOG:  database system was interrupted; last known up at 2017-09-01 12:18:23 GMT
2017-09-01 13:14:49.129 GMT: :4875:0LOG:  entering standby mode
2017-09-01 13:14:49.497 GMT: :4875:0LOG:  restored log file "0000000D000005F700000095" from archive
2017-09-01 13:14:49.589 GMT: postgres:4971:0FATAL:  the database system is starting up
2017-09-01 13:14:49.855 GMT: :4875:0LOG:  restored log file "0000000D000005F700000093" from archive
2017-09-01 13:14:49.878 GMT: :4875:0FATAL:  could not access status of transaction 168395881
2017-09-01 13:14:49.878 GMT: :4875:0DETAIL:  Could not read from file "pg_clog/00A0" at offset 155648: Success.
2017-09-01 13:14:49.880 GMT: :4873:0LOG:  startup process (PID 4875) exited with exit code 1
2017-09-01 13:14:49.880 GMT: :4873:0LOG:  aborting startup due to startup process failure

Shutting down and starting in standalone then stopping again made a bit more progress – I saw some cleanup activities before it started but still got this:

2017-09-01 13:28:08.503 GMT: :21253:0LOG:  database system was shut down at 2017-09-01 13:26:27 GMT
2017-09-01 13:28:08.649 GMT: :21253:0LOG:  entering standby mode
2017-09-01 13:28:08.799 GMT: :21253:0WARNING:  WAL was generated with wal_level=minimal, data may be missing
2017-09-01 13:28:08.799 GMT: :21253:0HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
2017-09-01 13:28:08.799 GMT: :21253:0FATAL:  hot standby is not possible because wal_level was not set to "hot_standby" on the master server
2017-09-01 13:28:08.799 GMT: :21253:0HINT:  Either set wal_level to "hot_standby" on the master, or turn off hot_standby here.
2017-09-01 13:28:08.800 GMT: :21251:0LOG:  startup process (PID 21253) exited with exit code 1
2017-09-01 13:28:08.800 GMT: :21251:0LOG:  aborting startup due to startup process failure

In the end, what worked was the following:

1) On the Active member, within nnmcluster, trigger a manual DB backup with dbsync
2) On the Standby member, I removed the /var/opt/OV/shared/nnm/node-cluster.sentinel file – this may or may not have helped
3) On the Standby server, I restarted nnmcluster (nnmcluster -daemon)

In summary, it looks like perhaps there was some corruption or other stale data in the initial backup because the errors I saw on the standby weren’t present on the primary.


  Local?    NodeType  State                     OvStatus     Hostname/Address
  ------    --------  -----                     --------     ----------------------------
* LOCAL     DAEMON    ACTIVE_NNM_RUNNING        RUNNING      servera/servera-22547
  (SELF)    ADMIN     n/a                       n/a          servera/servera-64198
  REMOTE    DAEMON    STANDBY_READY             DB_RUNNING   serverb/serverb-40214

TACACS on 4431 Management Interface

Getting TACACs working via the Cisco 4431 Management interface threw up a few issues and took a few tweaks. The final issue I found was that referencing all servers with the tacacs+ keyword doesn’t work, you have to reference the TACACS group with the servers defined within it.

Below is a working configuration example for TACACs via the management port in the Mgmt-intf vrf. I’ve also included a non-exhaustive couple of examples to get a few other things working.

! Mgmt interface config
!
interface GigabitEthernet0
 description ** Mgmt intf **
 vrf forwarding Mgmt-intf
 ip address 192.168.0.1 255.255.255.0
 negotiation auto
!
!
! Default route for Management VRF
!
ip route vrf Mgmt-intf 0.0.0.0 0.0.0.0 192.168.0.254
!
!
! Define source interface at global level
!
ip tacacs source-interface GigabitEthernet0
!
! aaa config
!
aaa new-model
!
!
! Server-private restricts only within this VRF.
! VRF forwarding and source interface need to be configured
! within the aaa group context too.
!
aaa group server tacacs+ TACACS
 server-private 10.0.0.100 key MYKEY
 server-private 10.0.1.100 key MYKEY
 ip vrf forwarding Mgmt-intf
 ip tacacs source-interface GigabitEthernet0
!
! Fail to enable password if TACACS is not working in this config.
!
aaa authentication login REMOTE_ACCESS group TACACS enable
aaa authentication enable default group TACACS enable
aaa accounting exec REMOTE_ACCESS
 action-type stop-only
 group TACACS
!
aaa accounting commands 15 REMOTE_ACCESS
 action-type stop-only
 group TACACS
!
aaa session-id common
!
!
! Apply to vtys and console if you need to.
!
line vty 0 4
 accounting commands 15 REMOTE_ACCESS
 accounting exec REMOTE_ACCESS
 login authentication REMOTE_ACCESS
line vty 5 15
 accounting commands 15 REMOTE_ACCESS
 accounting exec REMOTE_ACCESS
 login authentication REMOTE_ACCESS

Syslog

logging source-interface GigabitEthernet0 vrf Mgmt-intf
logging host 10.0.0.101 vrf Mgmt-intf

TFTP (auto write after wr mem)

ip tftp source-interface GigabitEthernet0

archive
 path tftp://10.0.0.101/configs/$h-
 write-memory

SNMP Traps

snmp-server trap-source GigabitEthernet0
snmp-server host 10.0.0.101 vrf Mgmt-intf version 2c MYCOMMUNITY

Recovering a Cisco 3850 that’s stuck in Boot Loop/ROMMON

This wasn’t even after an upgrade, just a reload, but the below messages were displayed before an unmounts and reboot of the switch:

Start type: SRV_OPTION_RESTART_STATELESS (23)
Death reason: SYSMGR_DEATH_REASON_FAILURE_SIGNAL (2)
Last heartbeat 0.00 secs ago

PID: 9225
Exit code: signal 11 (no core)

Quick steps:

1) Set up a local switch with the correct image on is as a tftp server to expedite fix as tftp is painful.

tftp-server flash:[image-name.bin]

2) Configure an interface with an IP to allow for direct connection to the management port of the broken switch.

interface GigabitEthernet1/0/47
 description -= temp for tftp to other switch =-
 no switchport
 ip address 192.168.0.1 255.255.255.0

3) Connect (or ask someone on site to connect) the interface to the management port of the broken switch (duh!)

4) On broken switch, hold MODE for 10 seconds to interrupt boot loop, or just wait for 5 or so failures for it to drop to ROMMON.

5) Set up IP info. GW only necessary if you’re tftping from a server outside the local subnet.

switch: set IP_ADDR 192.168.0.2/255.255.255.0

switch: set DEFAULT_ROUTER 192.168.0.1

6) Check for emergency files (you’re looking for cat3k_caa-recovery.bin or similar.

switch: dir sda9:

7) Ping the tftp server

switch: ping 192.168.0.1
ping 192.168.0.1 with 32 bytes of data ...
Up 1000 Mbps Full duplex (port  0) (SGMII)
Host 192.168.0.1 is alive.

8) Start the tftp emergency install. On a local connection this will take 10-20 mins.

switch: emergency-install tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin
The bootflash will be erased during install operation, continue (y/n)?y
Starting emergency recovery (tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin)...
Reading full image into memory......................done
Nova Bundle Image
--------------------------------------
Kernel Address    : 0x6042e5d8
Kernel Size       : 0x31794f/3242319
Initramfs Address : 0x60745f28
Initramfs Size    : 0xdbec9d/14412957
Compression Format: .mzip

Bootable image at @ ram:0x6042e5d8
Bootable image segment 0 address range [0x81100000, 0x81b80000] is in range [0x80180000, 0x90000000].
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
File "sda9:cat3k_caa-recovery.bin" uncompressed and installed, entry point: 0x811060f0
Loading Linux kernel with entry point 0x811060f0 ...
Bootloader: Done loading app on core_mask: 0xf

### Launching Linux Kernel (flags = 0x5)



Initiating Emergency Installation of bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin


Downloading bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...

Validating bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...
Installing bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...
Verifying bundle tftp://192.168.0.1/cat3k_caa-universalk9.SPA.03.03.03.SE.150-1.EZ3.bin...
Package cat3k_caa-base.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-drivers.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-infra.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-iosd-universalk9.SPA.150-1.EZ3.pkg is Digitally Signed
Package cat3k_caa-platform.SPA.03.03.03SE.pkg is Digitally Signed
Package cat3k_caa-wcm.SPA.10.1.130.0.pkg is Digitally Signed
Preparing flash...
Syncing device...
Emergency Install successful... Rebooting
Restarting system.

Once this is done it’ll try and boot again. You need to disable manual boot.

The system is not configured to boot automatically.  The
following command will finish loading the operating system
software:

    boot

switch: set MANUAL_BOOT no
switch: boot

Also remember to change the confreg value if it’s not 0x102 on the 3850. In this case it wasn’t needed. (show version last line)


Configuration register is 0x102

Don’t forget to remove the tftp-server config and temporary stuff. :)