Today we came across an issue where an SRX had very high CPU usage. After a bit of digging it turned out to be the httpd process which runs jweb.
“show chassis routing-engine” outputs like the below output normally (below is not actual output from the box with a problem and is only intended as an example), however user CPU was close to 100%
user@TESTFW02> show chassis routing-engine node0: -------------------------------------------------------------------------- Routing Engine status: Temperature 55 degrees C / 131 degrees F Total memory 512 MB Max 394 MB used ( 77 percent) Control plane memory 336 MB Max 299 MB used ( 89 percent) Data plane memory 176 MB Max 95 MB used ( 54 percent) CPU utilization: User 3 percent Background 0 percent Kernel 9 percent Interrupt 0 percent Idle 87 percent Model RE-SRX100B Serial ID XXXXXXXXXX Start time 2017-04-10 02:30:18 UTC Uptime 127 days, 12 hours, 1 minute, 12 seconds Last reboot reason 0x1000:reboot due to panic Load averages: 1 minute 5 minute 15 minute 0.18 0.17 0.11 node1: -------------------------------------------------------------------------- Routing Engine status: Temperature 52 degrees C / 125 degrees F Total memory 512 MB Max 415 MB used ( 81 percent) Control plane memory 336 MB Max 316 MB used ( 94 percent) Data plane memory 176 MB Max 97 MB used ( 55 percent) CPU utilization: User 10 percent Background 0 percent Kernel 14 percent Interrupt 0 percent Idle 75 percent Model RE-SRX100B Serial ID XXXXXXXXXXX Start time 2017-04-10 02:10:31 UTC Uptime 127 days, 12 hours, 21 minutes, 1 second Last reboot reason 0x1000:reboot due to panic Load averages: 1 minute 5 minute 15 minute 0.21 0.20 0.15
To nail down the culprit, you can do the following:
user@TESTFW02> start shell % top
Bear in mind that some platforms have a process that deliberately sits at a high CPU value in order to maintain performance (eg: flowd_octeon). Check against juniper documents before jumping to conclusions about a particular process. We are looking for something unusual and pretty obvious.
The top output should hint at the culprit. In this case it was httpd (JWEB).
We could have restarted with:
restart web-management
However, we are managing via Junos SPACE which uses netconf so for us it was safe to disable the service:
delete groups node0 system services web-management delete groups node1 system services web-management delete system services web-management commit
Exiting and checking the routing-engine state with “show chassis routing-engine” showed CPU quickly come back down to normal. Bear in mind the figures are a 1 minute rolling average by the looks of it so it will take a minute for the figure to normalise completely.