Increasing the PHP-FPM timeout setting halved the load on a server experiencing an odd race condition.
High Load Issue
We have been suffering abysmal performance on a high traffic Wordpress website. The server was experiencing extremely high persistent load and I couldn't figure out the culprit. Traffic has increased substantially over the past few months and to mitigate the higher load, we installed php-fpm on the server. PHP-FPM is a pooling mechanism that allows PHP to create a cordoned off area whereby the other PHP applications on the server are not effected by the high traffic experienced by one website. This worked well, insuring that the other apps on the server weren't effected by the traffic increase, but in May the server began experiencing some extremely high load conditions. The website was becoming unresponsive and even worse, other sites were being effected.
On June 26th the server went into a tailspin. MySQL was using 4 cores of processing. I ssh'd into the server and started the MySQL console and started running
show processlist; to analyze the load. All of the available query slots were full but they were all stalled -- the COMMAND column was showing that they were all set to "SLEEP". I ran a quick query to kill the sleeping processes, but another set immediately replaced them. I did this repeatedly, and instantly the processlist was showing that another set of SLEEPing processes were waiting. WTF?
I spent the good part of two hours trying to elicit some meaning from this madness. My initial thought was that we were experiencing a DDOS attack. I tailed the Apache logs; there was traffic, but nothing warranting the massive load that was occurring. The problem was spiraling out of control as other MySQL reliant websites were now unresponsive. I decided that it was time to restart MySQL. This is not recommended. Occasionally MySQL tables can be corrupted by a restart. I was out of options though as it was stalling all of the websites on the server. I ran the restart command, but it failed. My stomach dropped. That systemctl
status command showed that it was running, but there were no MySQL processes running. Something was very wrong. With few readily available options I rebooted the server. After rebooting, MySQL came back up, and initially it seemed to be working, but gradually over the next few minutes the processlist showed that the query queue was filling with SLEEPing processes.
PHP-FPM Timeout and a Slow Query Race Condition
I did some slow query logging in May when I was investigating the load. There were a few queries that were taking more than ten seconds to complete. I was able to alleviate some of the bottleneck by tweaking the cache plugin, insuring that these queries were run infrequently. The caching worked for a few weeks, but traffic kept increasing, and by June 26th I was dreading the idea that I was going to have to write a set of faster queries for a 3rd-party Wordpress plugin. When MySQL kept filling up with SLEEPing processes, I knew that there must be another culprit. Why weren't these queries finishing or clearing and when I cleared the current set, why were they immediately replaced by another set?
Maybe the Wordpress PHP/MySQL configuration was wonky. I looked at the config file but there was nothing out of place. The configuration was standard -- used by the other non-Wordpress sites on the server. The only difference was having PHP-FPM enabled for the website. I looked at the PHP-FPM configuration in cPanel:
timeout was set to 5 seconds. I suddenly realized what was happening. PHP-FPM was sending the queries to MySQL, the query was taking longer than 5 seconds, so the PHP-FPM process was timing out, leaving MySQL with an ever-growing queue of queries while the original PHP code had long ago been killed.
I changed the PHP-FPM timeout setting to 30 seconds, and voila: the load halved almost immediately. The processlist completely cleared within a matter of minutes.