I'm not a sysadmin, but I end up doing my best now and then when one of my sites gets into trouble. This is a sort of "after action report" of an incident that I just resolved (hopefully).
I woke up and happened to check email on my phone (don't always do this, will now) and was greeted with a uptime robot email that one of my sites was down, and had been for about 4 hours. I quickly checked the site on my phone and yup, it wasn't loading.
Ran to the office and hopped on my laptop. SSH to the server, and everything seems fine. Very little load on the server (AWS instance). Did a restart of apache/php/mysql and the site is still down. Weird. Running the site's index.php file on the command line works as expected and fast.
Ask a few other people to check, and it's down for them. Then I logged into the AWS console and checked on status there - everything is up and running.... WTF? This is a lightsail instance, and then I noticed the outgoing network traffic had dropped to zero - exactly at the time that uptime robot had said the site went down.
Back in SSH, start checking for changes on the server. Nothing out of the ordinary. Apache is up and listening on the appropriate ports (http/https). No weird logins to the server. MySQL is running as well.
I try creating a tiny php file in the web root - even that comes back with a timeout.
I go down a rabbit hole thinking it's some sort of weird networking issue, especially since AWS has had outages three times already this month. But no evidence of it, all AWS status pages are ok...
Then I tried accessing the site's CSS file - which loads immediately. So, only URLs being handle by PHP are failing.
Looking into the apache error logs, there's some cryptic errors like this:
[proxy_fcgi:error] [pid 20763:tid 140450371356416] (70007)The timeout specified has expired: [client REDACTED] AH01075: Error dispatching request to : (polling)
So now I'm thinking that somehow the PHP settings have been changed and go down a rabbit hole of googling that and other related error messages. Most google results are talking about how to extend your timeout settings for long-running PHP scripts. These are not long-running.
Finally I look at the apache access logs and it quickly becomes apparent that the site is under a DOS attack. An IP in Romania is making all sorts of funky http requests, about 20 per second. When those all fail, he starts just request "/foo/" twenty times per second. Once that IP is blocked, we're back up and running.
So, what was happening is this script kiddy runs his script and hits the site hard. PHP-FPM quickly gets overloaded stops responding. The site is down, but the server doesn't show any heavy load because most requests aren't running PHP. Static files like CSS and images load fine. So when I logged into the server, it didn't "feel" slow and showed no load. Usually with a DOS in the past I'll see heavy load on the server, sometimes MySQL is all overloaded, lots of PHP processes running etc, but not this time so I didn't immediately realize what it was.
Next time - check access logs right away!
Comments