The Turtle and the Hare: A Recap of 3 Days of Downtime

By Eric Chiang on Mar 25, 2016 in Stories

Like many companies, Gliffy has been on a journey to migrate our services to an Infrastructure as a Service (IaaS) offering such as Amazon Web Services (AWS), Microsoft Azure or Google Cloud Platform. In fact, our latest Atlassian add-on product, Gliffy Diagrams in JIRA Cloud, was built from the ground up using an IaaS and architected as a collection of microservices enabled by Docker.

Being pragmatic, we chose to migrate pieces of our Gliffy Online (GO) platform towards this architecture over time rather than taking on a complete rewrite. Therefore, the bulk of our GO infrastructure still resides at a fully-managed hosting provider. It was this part of our infrastructure that suffered downtime as a result of a simple human error.

I point out this distinction because there are many features built into a service like AWS that could have reduced and or prevented this type of error, but unfortunately, we weren’t quite there yet.

Now onto the sequence of events…

Thursday 03.17.16

Our alerting system identified an issue with the replication of one of our databases. We use MySQL in a master-master-slave configuration and the secondary master node had gotten too far behind the primary master node, so a full reseed was required to restart replication. Maintenance was scheduled to repair this issue over the weekend.

Sunday 03.20.16, 10:30 PM PDT: Disaster Strikes

A system administrator was tasked with reseeding our secondary master. As a part of any replication seed, it starts with a command to drop all tables from the schema. Unfortunately, they failed to sever the master-master link before executing the restore and the drop table commands cascaded to our primary node, and ultimately to the slaves.  This happened in a matter of seconds and all of our data was deleted. Our engineering team was immediately notified that the application had failed as our alerting system triggered once again. We immediately kicked off a restoration process from our last daily backup which occurred on late Saturday evening.

Because we use replication, MySQL produces binary logs (binlogs) of every query that is produced at the master. The combination of our Saturday backup and our retention policy of these binlogs (~2-3 days of data) would give us enough overlap to perform a full and complete restore of all customer data up to the point of disaster. This gave us some level of comfort.

Monday morning to Monday evening, 03.21.16

Every company should have a backup and recovery plan. We did. The last time we tested it, our restoration took 10-12h. An inconvenient amount of downtime, but not catastrophic.

We watched the restore process and waited. However, the combination of a recent change to use table level compression in our configuration, the sheer amount of data that we’ve built up since our last backup/recovery test, and the fact that the MySQL restore process was single-threaded meant that our restore would take an estimated 4+ days to complete. We were caught off guard at how long this process would run, and it was clearly an unacceptable amount of downtime. We left this process running and called this the “turtle” in the race.

We brainstormed to come up with several ideas to get us back to a functional state in a shorter amount of time. One of the ideas was to ship our backup to AWS, use a powerful Elastic Compute Cloud (EC2) instance to restore the data without the use of compression, and then to ship it back again to our production facility.  We believed that this would still be faster than the restore process we kicked off Monday morning. We labeled this process our “hare”.

While we continued to prototype other ideas, we put this plan into action. However, once we got our backup into AWS and began restoring our data, we encountered issues.  We were running out of disk space on our instances, due to the sheer size of our data as it was being decompressed.  Our first and second attempts at restore failed due to this issue, and it had consumed the bulk of our day.

Monday evening, 03.21.16 to Tuesday evening, 03.22.16

We waited for our third attempt to be restored in our EC2 instances. The process was significantly faster and eventually completed at around 8pm. From there, we shipped our restored database back to our production facility and waited for the data transfer to complete.

Even though we had the binary logs, it required some investigation to find the exact start location to match up with our Saturday backup, as well as to remove the offending drop table commands. We took this time to make snapshots of our data in AWS so that we could begin testing our binary log restore process to get us a full and complete data restoration.

Wednesday 03.23.16, 11:00 AM PDT

Our data copy completed. We had a restored database in our production environment.

While there were several other processes running in parallel in this race to reduce our downtime, we thought our “hare” would beat them all. Remember that original database restore that was kicked off on Sunday night? Well, it had accelerated and was only a few hours away from completing.

We decided to wait and apply our binary logs on top of that restored database as it was less risky than using the mildly different database configuration we used back in AWS.

As the story goes, the “turtle” beat the “hare” after all.

The remaining time was spent seeding the other database nodes in order to reconfigure master-master-slave replication.

Wednesday 03.23.16, 09:45 PM PDT

Full restoration of the system was completed with total downtime at seventy one (71) hours, fifteen (15) minutes.

Ironically, in the end, we ended up following our restoration playbook as it was written.

Next Steps

In the 24 hours since our restoration, we’ve already built an improved process to prevent this level of downtime in the future. We’ll be doing a full retrospective in the next few days which will help us prioritize short and long term goals to improve our reliability and availability for our customers. As a result of the encouraging feedback we’ve received from our support forums, social media, and popular tech news outlets, we will be sharing many of our learnings with you in a series of technical blog posts in the near future.

We greatly appreciate the outpouring of support we received from our customers and the technical community at large and hope everyone can learn from our mistakes and our plan of preventing them in the future.