Starting this morning around 3am GMT, Scalingo’s has experienced its largest outage since the service has been opened commercially (1st of February). Most applications have been unavailable until 7:30am GMT, for a total of more than 4h30 of downtime.
Around 3am GMT Wednesday 29th, our internal services started raising unusual exceptions. We are using Rollbar to track errors and get notifications from them. Unfortunately, no one of the team has seen the notification on time and some of our servers have been unavailable during the night. It’s at 6am GMT that we’ve started investigating what was actually wrong and how to solve it.
Several servers were severly affected and it was not possible to access them again, all the applications running on them were not available either. Among the running servers, we got a clue about what happened, Docker runtime partition was 100% full. This, as a consequence, makes Docker unresponsive, no new container could have been created, and any operation on the running containers may fail.
After rebooting the failed servers, we’ve observed that they were in the same situation, but additionaly, logs suggested that the Linux kernel suffered of soft-lockup related to the IO queues waiting to write on the Docker partition but failing because of its state. This would explain why the servers were not responding.
We usually over-provision the Docker partition to be sure that such situation does not happen has we are warned that space on this partition is required to operate to containers correctly, but in this case something unusual filled up the partition.
Cause of the issue
One application with a very high volume of visitors (actually sensors) has been stopped by its owner recently, the application was receiving several tens of request per second. In this case our front servers divert the HTTP request to another application which renders a (often cached) web page telling that the application has been stopped. You can see an example here: https://sample-php-base.scalingo.io/
This application handling the error is a ruby on rails application. Commonly, ruby on rails applications are displaying the parameters of the the request. That’s what happened here, all the parameters of the request made to the stopped application were written in the logs, and the body was particularly important: almost 7KB. Each line was then weighing this size.
As a result the log size of the application exploded really quickly, filling Docker partition several GB per minute. Once one node has been considered unavailable, the application has been moved to another node to avoid downtime. Here starts a so called “cascade effect” putting our servers down one after the other. Resulting in downtime for a lot of applications.
Once the problem has been correctly identified, resolving it was time consuming but simpler, servers have been rebooted, huge log files have been truncated and then, all the down databases and applications have been restarted. Once this has been done, everything was available again and it’s time to work on the non-repeatability of what happened last night.
What will we do to avoid such errors?
First, the main problem here has been the human factor. If the error would have been dealt with at the time of the first exception we could have avoid any downtime. We will improve the way we got notificated, critical errors should be able to wake up the whole team instantly.
Secondly, one of the technical causes of the problem has been the way Docker is handling the logs. As stated in Sirupsen blog yesterday, http://sirupsen.com/production-docker, logging is still a bit limited with Docker, so we will have to find another way to get the logs from your apps to avoid filling disks with them, logging should be completely isolated from applications runtime environment. Additionaly, the way we handle error pages should be done differently, parameters should not be logged at all to avoid such situation to happen.
The whole Scalingo’s team apologizes for what happend last night. There was a bit of downtime, but the good part is that it tought us a lot about underestimated parts of the infrastructure and we’ll be able to improve to provide an always better service.