GitLab.com meltdown, how 5 different backup strategies failed

by mark · Published 1 February 2017 · Updated 1 February 2017

From time to time we hear of companies running large clusters of computers failing, but they are usually non-IT companies. This time the tragedy happened within GitLab.com’s walls: the popular startup known for GitLab, the popular GitHub alternative, apparently nuked its own database on production machines.

The beginning of the end

The story starts on 01/31/2017 16:00-17:00 UTC when an engineer working on database replication creates a LVM snapshot of the system in the hope of using it to bootstrap the replicas. Unfortunately the process estimates +/-20h to complete, the engineer stops working deciding to ask for help to a colleague who’s not working that day.

About five hours later on 01/31/2017 21:00 UTC an unusual high db load gets detected. The cause is a user using Gitlab.com as some sort of CDN. The user gets removed and everything starts slowing (and with some fixes) to revert to normal.

About one hour later 01/31/2017 22:00 UTC an alert is triggered: db2, a production machine, is lagging behind db1 by about 4GB. At this point db2 data folder (PostgreSQL) is wiped to ensure clean data replication. After many attempts db2 refuses to: replicate, then connect to db1, then reports slow connection, then hangs completely.

About one hour later 01/31/2017 23:00 UTC the engineer is frustrated and still working at the issue. Thinking a tool is being strict about an empty directory being there (unclean replication) he decides to remove that directory entirely. The ominous command: rm -rf is issued.

After a few seconds the engineer realizes that he did not execute it on db2, he instead issued it on db1: the only production machine with current data. As soon as he realizes the mistake he stops the process, but it is already too late. The meltdown is already over: out of 310GB of data, only 4.5GB are left.

We are performing emergency database maintenance, https://t.co/r11UmmDLDE will be taken offline
— GitLab.com Status (@gitlabstatus) January 31, 2017

we are experiencing issues with our production database and are working to recover
— GitLab.com Status (@gitlabstatus) February 1, 2017

We accidentally deleted production data and might have to restore from backup. Google Doc with live notes https://t.co/EVRbHzYlk8
— GitLab.com Status (@gitlabstatus) February 1, 2017

Crisis, backups

At this point the entire GitLab.com site is down as stated by the many tweets. The team assures:

The incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis).
— GitLab.com Status (@gitlabstatus) February 1, 2017

After a few minutes it becomes clear that files can’t be recovered by normal means and that a backup needs to be used. But the tragedy is not over yet:

The most recent backup is the aforementioned LVM snapshot taken 6 hours before the incident. LVM backups are performed every 24h. The best-candidate backup is there “by-chance“.
Regular backups are performed on a 24h-basis too. But the engineer can’t figure out where they are. They are eventually found, but it turns out that they are unusable since they are a few bytes large.
Due to a version mismatch between machines tools (pg_dump), SQL backups failed silently. Version used by the Omnibus package was enforced to 9.6, that enforcement doesn’t exist on workers resulting in 9.2 being installed and used.
Azure‘s disk snapshots are enabled for the NFS server but not for production database servers.
S3 backups are in place, but the bucket containing them is empty.

A difficult decision was made: the LVM snapshot taken 6 hours behind the incident will be used to restore GitLab.com.

The recovery of GitLab.com

As the time of writing this article the process is being live-streamed on youtube and is not yet completed:

You can find the full story here, and stay up-to-date by following @gitlabstatus on Twitter.

Conclusion

This incident is just an example of how bad things can accumulate for a long time and explode in just one moment. I usually say: If you care, backup. But that’s only part of the story. The worst thing that could happen is taking backups that are corrupted and mistake them for good backups. Be sure to always double-check backups regularly.

Although the GitLab team had five different strategies, none of them was working correctly and the only functioning one was 24 hours behind. Yes, if it weren’t for that snapshot taken by chance, the only backup available would’ve been behind 24h.

As a last note: rm -rf is a dangerous tool. No matter how experienced you are, be sure to double check what you typed and take a second to think about what it will cause.

Image courtesy of mark | marksei

Author
Recent Posts

mark

The IT guy with a slight look of boredom in his eyes. Freelancer. Current interests: Kubernetes, Tensorflow, shiny new things.

Cookie	Duration	Description
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_60468161_1	past	Set by Google to distinguish users.
_ga_DR9SCJ09BV	2 years	This cookie is installed by Google Analytics.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.

Cookie	Duration	Description
edgebucket	session	Reddit sets this cookie to save the information about a log-on Reddit user, for the purpose of advertisement recommendations and updating the content.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	14 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
csv	2 years	No description available.
GoogleAdServingTest	session	No description
wp_api	past	No description
wp_api_sec	past	No description
_pk_id.1.95fa	1 year 27 days	No description
_pk_ses.1.95fa	29 minutes	No description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.

GitLab.com meltdown, how 5 different backup strategies failed

The beginning of the end

Crisis, backups

The recovery of GitLab.com

Conclusion

You may also like...

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials

GitLab.com meltdown, how 5 different backup strategies failed

The beginning of the end

Crisis, backups

The recovery of GitLab.com

Conclusion

Related posts:

You may also like...

FreeNAS 11: new features, VMs, new GUI, goodbye Corral

Rook: storage orchestration for Kubernetes

How to install NextCloud 12 server on CentOS 7

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials