GitLab.com meltdown, how 5 different backup strategies failed

GitLab Logo

From time to time we hear of companies running large clusters of computers failing, but they are usually non-IT companies. This time the tragedy happened within GitLab.com’s walls: the popular startup known for GitLab, the popular GitHub alternative, apparently nuked its own database on production machines.

The beginning of the end

The story starts on 01/31/2017 16:00-17:00 UTC when an engineer working on database replication creates a LVM snapshot of the system in the hope of using it to bootstrap the replicas. Unfortunately the process estimates +/-20h to complete, the engineer stops working deciding to ask for help to a colleague who’s not working that day.

About five hours later on 01/31/2017 21:00 UTC an unusual high db load gets detected. The cause is a user using Gitlab.com as some sort of CDN. The user gets removed and everything starts slowing (and with some fixes) to revert to normal.

About one hour later 01/31/2017 22:00 UTC an alert is triggered: db2, a production machine, is lagging behind db1 by about 4GB. At this point db2 data folder (PostgreSQL) is wiped to ensure clean data replication. After many attempts db2 refuses to: replicate, then connect to db1, then reports slow connection, then hangs completely.

About one hour later 01/31/2017 23:00 UTC the engineer is frustrated and still working at the issue. Thinking a tool is being strict about an empty directory being there (unclean replication) he decides to remove that directory entirely. The ominous command: rm -rf is issued.

After a few seconds the engineer realizes that he did not execute it on db2, he instead issued it on db1: the only production machine with current data. As soon as he realizes the mistake he stops the process, but it is already too late. The meltdown is already over: out of 310GB of data, only 4.5GB are left.

Crisis, backups

At this point the entire GitLab.com site is down as stated by the many tweets. The team assures:

After a few minutes it becomes clear that files can’t be recovered by normal means and that a backup needs to be used. But the tragedy is not over yet:

  • The most recent backup is the aforementioned LVM snapshot taken 6 hours before the incident. LVM backups are performed every 24h. The best-candidate backup is there “by-chance“.
  • Regular backups are performed on a 24h-basis too. But the engineer can’t figure out where they are. They are eventually found, but it turns out that they are unusable since they are a few bytes large.
  • Due to a version mismatch between machines tools (pg_dump), SQL backups failed silently. Version used by the Omnibus package was enforced to 9.6, that enforcement doesn’t exist on workers resulting in 9.2 being installed and used.
  • Azure‘s disk snapshots are enabled for the NFS server but not for production database servers.
  • S3 backups are in place, but the bucket containing them is empty.

A difficult decision was made: the LVM snapshot taken 6 hours behind the incident will be used to restore GitLab.com.

The recovery of GitLab.com

As the time of writing this article the process is being live-streamed on youtube and is not yet completed:

You can find the full story here, and stay up-to-date by following @gitlabstatus on Twitter.

Conclusion

This incident is just an example of how bad things can accumulate for a long time and explode in just one moment. I usually say: If you care, backup. But that’s only part of the story. The worst thing that could happen is taking backups that are corrupted and mistake them for good backups. Be sure to always double-check backups regularly.

Although the GitLab team had five different strategies, none of them was working correctly and the only functioning one was 24 hours behind. Yes, if it weren’t for that snapshot taken by chance, the only backup available would’ve been behind 24h.

As a last note: rm -rf is a dangerous tool. No matter how experienced you are, be sure to double check what you typed and take a second to think about what it will cause.

Image courtesy of mark | marksei
mark

You may also like...

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.