This last month, the platform sustained a few outages. One among them was more important and lead to a postmortem blog post to explain to our customers what happened https://scalingo.com/articles/2015/07/29/outage-of-wed-29th-july-postmortem. With a few weeks of reflexion, we’ve set up a list of tasks to achieve to improve the overall availability of the platform. This article will detail the actions which followed the outage and what is planned now.
Done: Critical alerts notification
First thing, we could not miss another server alert as we did the 29th of July. That’s why we integrated PagerDuty as notification service. PagerDuty is triggered each time an error considered as ‘critical’ is raised. Alert notifications don’t stop until someone of the team acknowledges it.
Done: End-to-end monitoring of the infrastructure
A new entity, external from the infrastructure, is now making application deployments multiple times per hour. Thanks to this, we can be sure that the complete production chain is working as expected. If anything goes wrong an alert is triggered and the team is alerted immediately.
Doing: For more transparency, a status page is coming
http://status.scalingo.com is coming, here we’ll be able to display more transparently problems the infrastructure is undergoing. It will also let us give real-time information about the infrastructure recovery process as well as the maintenance windows which may perturb the platform.
We’ve detected that the most sensible part of the infrastructure is the building process. The build is the process which happens when you start a new deployment. It downloads, uncompresses, compiles (sometimes) a lot of files and this may be comsuming a lot of resources.
So far, builds were done in containers on the same servers that are actually running the users applications, but we found out it could degrade the applications performance more than we thought. We first decided to limit the resources of the build container, but it let us to a high rate of errors impacting the whole server. (See next part concerning BTRFS)
Docker storage backend BTRFS, time for change?
It is not a secret that we made the choice to use Docker as isolation technology for the applications of our users. From the beginning of its development to nowadays, three file systems have been shipped as “stable”. We’ve used all of them and we have mixed feelings about all of them.
First, we did not have the choice, our servers were using AUFS as storage backend for Docker, then when Red Hat pushed Docker to support ext4-devicemapper, as we had some annoying bugs with AUFS, we decided to migrate our infrastructure to this backend.
With device mapper, container creation/deletion performance dropped drastically, but overall weve hit some serious deadlock issues with this storage backend. (examples: https://github.com/docker/docker/pull/4951, https://github.com/docker/docker/issues/5653)
We were pretty happy when BTRFS has been integrated to Docker and considered to stable because it was another promising choice. At the beginning, it seemed to us that it was the most efficient driver for Docker, fast and stable. But as we were scaling up, the amount of operations applied on the partition, new troubles happened. Recently, two outages of one server impacting some customer applications were related to what seems to be a BTRFS deadlock. (More details at http://article.gmane.org/gmane.comp.file-systems.btrfs/47314)
So far, no file system seemed perfect to us. Recently, with Docker 1.7+, two new backends have been integrated to the product, overlayfs and ZFS (still tagged as experimental https://github.com/docker/docker/pull/9411). We’ll probably give them a try in order to find a better alternative and improve the overall platform stability.
All the things must be distributed and fault tolerant
Some of our components are still working with redis as backend to handle message queues (Sidekiq, go-workers). Even if redis is a great database with some interesting features to handle queues, it is not a distributed database. Redis sentinel and cluster are worth looking at, but they’re not built to be a 0-downtime message queue engine. That’s why, we’ll migrate all the components to use NSQ http://nsq.io/. This piece of software, once installed on several nodes, is highly available and does not contain any single point of failure. It will help us handling more serenely node failures.
Improve dysfunctional server isolation and recovery
There are several cases when servers can be considered as dysfunctional. Whatever the problem is, hardware issues, network issues, software bugs or kernel freezes, the priority is to isolate it. It doesn’t matter if all the applications of a node or only one are affected, it has to be detected and handled. Important problems are easy to detect, it is much more tricky to detect minor anomalies. Part of it should and can be done automatically, applications have to be moved immediately.
Store application images to an object storage solution
Today, the image registry is hosted alongside the applications node, it generates a lot of disk and network IO to distribute images to the complete infrastructure. Even if the storage is redundant and highly available, we know that up to a certain scale it simply won’t be efficient enough.
Part of this issue can be solved by using an external storage solution, more especially, an object storage solution. It has the advantage to completely abstract the storage backend, its capacity as well as its capacity to scale and to be easily interfaceable with any technology.
High Availability of databases
Today the addons we are proposing are single-node. It is not possible to create high availability without redundancy. That’s why one of our big goals today is to let you choose and/or upgrade highly available databases, with the insurance that the database cluster will be located close to your application and will keep as good performance as it is today. We are currently looking for the best model which would be easy to integrate in your projects. Our goal is to do it completely transparently for you and your applications.
Conclusion for today
Our priority is to provide the best quality of service possible. There is no silver bullet for this, numerous decisions and actions should be considered and applied. As detailed in this article, we’ve started working to improve the global stability of the platform. This list is far from exhaustive, and we’ll communicate in other articles our advancements on this side.