Getting Back on Track Post Outage (Part 1)

By Liza Mock on Jun 08, 2016 in Let's Talk Tech

As you may remember, Gliffy recently suffered a 3-day outage, the longest in our 11-year history. In our post-outage blog, The Turtle and the Hare, we promised to keep you in the loop as we made changes to ensure that nothing like this ever happened again. Here, as promised is the first post about what we did. We hope you benefit from our mistakes and from the solutions we’ve come up with.

After the outage, we took a long, hard look at where we were and compared it to where we wanted to be in terms of infrastructure and operations. What we had was an application layer that relied heavily on MySQL setup with master-master replication and a reporting slave. The secondary master node was used for hot failover and nightly backups, but we had no automated restore process (a red flag); we had no team that prioritized the maintenance of our database (another red flag); and our backups never left the data center (a yellow flag). We were keeping our database in working order, but not much beyond that.

The wish list for where we wanted to be included a team directly responsible for all operational decisions who would create configuration, scripts and tests that would define and verify our infrastructure. We wanted our infrastructure to be incremental and deployable and a database that would restore itself multiple times a day. In short, we wanted a DevOps team who would be empowered to build the future of Gliffy. We had our soaring vision, now we needed to come back down to Earth and prioritize quick wins that we could roll out iteratively in order to protect ourselves from another large outage.

Baby Steps

After some discussion, we came up with the following list of priorities:

1. Prevention
2. Increased speed
3. Automation & verification

First and foremost we needed to prevent the risk of another outage. After that was taken care of, we could focus on reducing the total time to backup and restore our massive database. Shorter backup and restore times cause a dramatic ripple effect, increasing general availability, the number of backups that can be performed, and the number of options for spinning up replicas in the future — all big wins.

Next, we would turn to automation, removing the potential for human error from mechanical procedures and making it a snap to reliably restore our database without intervention. Finally, we would implement two types of testing to give us confidence in the process, verifying that our restoration environment mimicked our production environment and testing the restored data to ensure that it was up to date and whole.

Prevention

Prevention came in two flavors: ownership and replication configuration. For a long time, our database infrastructure was managed in an ad-hoc manner. As our needs changed, we made manual updates to the configuration, schema and other moving parts of the system. The database was an accepted existence with no clear ownership or maintenance duties. If you’re in a similar situation, the importance of owning your infrastructure and its processes cannot be overstated. Having an owner’s mindset allowed our team to prioritize and come up with an iterative approach to the problem at hand.

Another simple win was dropping our master-master replication configuration. While master-master replication is great in that it affords options for load-balancing workloads and failover for traffic during maintenance, the price tag is a more complex architecture that leaves room for human error. A devastating example was the too-easy-to-make mistake of restoring a backup on a master node without disabling replication first.

After giving it some thought, we decided that increasing the complexity of our database maintenance was a worthy tradeoff in return for simpler and more human error-proof daily operations.

Increased Speed

The next challenge was speeding up our backup and restore processes. During the outage and shortly thereafter we read many pages of documentation detailing mysql’s backup and restore procedures. Our process before the outage was to make a logical backup of our hot standby master using mysqldump. This process would run nightly and we would keep 2 backups at any given time. We investigated several tools and options for parallelizing and tweaking the vanilla mysqldump process, but many of them were not sufficient for our needs.

Given the size (100’s of GB under high compression) and nature of our data (clustered into a few key tables), a logical backup of any type would have never worked. However, we were able to perform a cold, physical backup on a read slave. Switching to a physical backup strategy gave us an incredible 18x speed improvement; now we can back up, sync the data to another system and restore it in less than 5 hours! This increased speed paved the way for other critical work that needed to be done, namely automation.

Automation

We had accrued significant technical debt in the form of operational automation. Fortunately, the last few years have seen a renaissance in this area. We aimed our sights at building a stable, immutable platform that many have espoused as the best approach.

Since we had a team of programmers, we decided to attack it like a programming problem and refactor the infrastructure itself. Any good refactoring project starts with high-quality tests and in many cases that means testing existing code or in our case infrastructure. This is how we ended up dipping our toes into the vast sea of DevOps tooling, replicating a database environment programmatically and then restoring a database backup into that environment.

Tools of Choice

Test Kitchen: An infrastructure test harness and integration suite that allows you to isolate a clean environment for testing.
Ansible: A configuration management, provisioning, and application deployment platform.
Serverspec: Infrastructure tests executed with RSpec. (This is what allowed us to refactor our infrastructure in a stable and sane way.)

There were a handful of libraries we used to fuse these tools together, but the above libraries formed the core of our infrastructure tooling for this project. We started by writing Serverspec tests for our existing database slaves. This provided 2 benefits: declarative, testable documentation for the existing environments and verification scripts for any automated infrastructure that was built in the future.

Satisfied that our database slave was well specified, we built an Ansible playbook to provision database slave servers. We used Test Kitchen to create ephemeral EC2 instances, which we would “converge” to a desired state with Ansible and finally verify the finished product given our previously created Serverspec tests.

With the core infrastructure in place, it was a simple matter of adding the scripts for backing up, encrypting, storing in S3 and restoring the database. This was very powerful for us because it meant that we could create a new production database slave in less than 4 hours. Better yet, no humans (or human errors) were involved.

Putting It All Together

diagram of database restore process in gliffy
Click on this diagram to use it as a template.

As of the writing of this post we’ve managed to address the main concerns with our database backup and restore procedures. We’ve learned a lot about our systems and discovered new tools we could use to automate our processes. Here is high-level outline of the automation scripts we created:

This is the first step to greater operational ownership for Gliffy. We promise to keep you abreast of what we learn and implement next. If you have any questions/advice/tools you want to tell us about, tweet at us with the hashtag #DevOps.

Happy Diagramming!