VDO: Linux deduplication and how to use it

by mark · Published 20 February 2019 · Updated 27 February 2019

Following Red Hat acquisition of Permabit Technologies, the former decided to release the latter proprietary technology: VDO, as open source. But what is VDO and how can you use it?

What is Deduplication?

Before you delve into VDO it is important to understand what deduplication is. Imagine you have two files which differ by only one byte:

File1: is 200KiB
File2: is 200KiB

Both of them are identical, except for that one byte. How much space do you need to store them on a drive? 400KiB, of course! Now let’s have the same example with deduplication turned on. How much space do you need to store them on a drive? The answer is: 200KiB + 1 byte.

The idea behind deduplication is pretty straightforward: take chunks of data that are similar and store only one copy of it. When the exact data is stored again it will be marked as duplicated and the original copy referenced instead of storing two copies.

That’s all there is behind deduplication. Wonderful, am I right? In the example above, only 50% of the space needed was used to store the same files. So how come no system uses deduplication out of the box? Because it is a trade-off. The deduplication process costs CPU and RAM. How much depends on the solution.

How does VDO work?

VDO (Virtual Data Optimizer, formerly known as Albeiro VDO) was originally developed by Permabit Technologies, later acquired by Red Hat and open-sourced. VDO is a transparent deduplication and compression layer that sits very close to the block device.

The actions VDO takes to optimize data can be summarized in three phases:

Zero-block Elimination: data that contains only zeroes is recorded as metadata only.
Deduplication: data that has already been stored and deemed redundant will not be stored, instead a reference to the stored copy will be written.
Compression: LZ4 is applied to compress data.

This is achieved by two important components:

UDS (Universal Deduplication Services): decides whether a block can be deduplicated using an index, the UDS Index. The UDS Index is stored on the same block device.
VDO: abstracts existing block devices into newer block devices with deduplication and compression capabilities.

Both of them are implemented as Kernel Modules, and VDO (KVDO) interacts directly with the Device Mapper layer.

VDO disk organization — A VDO **volume**

VDO Requirements

As of the time of writing this article, VDO is not yet available in upstream Linux (there is an ongoing effort to make this happen). Because of that you will have to build it manually. On CentOS/RHEL version 7.5 you can simply install it through yum. I haven’t tested it on other distributions, but I have read about people who tried and said that the module doesn’t load properly.

You can find the official Red Hat requirements here and example memory, storage consumption ordered by managed storage here. In short you will need memory for the two components:

UDS: at least 250MB but can grow up to 1GB. According to Red Hat, a sparse index can manage up to 40TB of disk space.
VDO: at least 370MB plus an additional 268MB per 1TB of managed storage.

On top of that remember that VDO will require a portion of the medium to be used to store the UDS Index, hence you will have less space available once you create a VDO volume.

Where does VDO sit in the storage stack?

According to Red Hat documentation:

Under VDO: DM-Multipath, DM-Crypt, and software RAID (LVM or mdraid).
On top of VDO: LVM cache, LVM Logical Volumes, LVM snapshots, and LVM Thin Provisioning.

Also, be aware that the following configurations are not supported:

VDO on top of VDO volumes: storage → VDO → LVM → VDO
VDO on top of LVM Snapshots
VDO on top of LVM Cache
VDO on top of the loopback device
VDO on top of LVM Thin Provisioning
Encrypted volumes on top of VDO: storage → VDO → DM-Crypt
Partitions on a VDO volume: fdisk, parted, and similar partitions
RAID (LVM, MD, or any other type) on top of a VDO volume

VDO Performance

As mentioned before, when you save disk storage using deduplication you will have to pay using memory and CPU power, but there’s also another thing to take into account: performance.

You would expect the performance of certain operations such as adding new data (not deduplicable) to be worse, and other operations such as adding data already stored (deduplicable) to be much better. According to this article by Red Hat the performance of VDO is significantly worse in every case. An excerpt, table from the article:

filesystem backend	deploy to file system	copy on file system
XFS ontop of normal LVM volume	28sec	35sec
XFS on VDO device, async mode	55sec	58sec
XFS on VDO device, sync mode	71sec	92sec

How to get started

Important

I take NO responsibility of what you do with your machine; use this tutorial as a guide and remember you can possibly cause data loss if you touch things carelessly.

Creating a VDO volume is surprisingly easy, as long as you are using RHEL/CentOS >= 7.5:

# yum install vdo kmod-kvdo
# vdo create --name=vdo_volume --device=/dev/vda

Beware: you should use a persistent name rather than /dev/vda, otherwise VDO may encounter problems when booting up.

You can now access the newly created volume at /dev/mapper/vdo_volume, you can now create a filesystem on top of it. If you need assistance here’s a complete guide Disks, Partitions and Filesystems, although you can’t really partition a VDO volume since it is not supported.

Images courtesy of William Warby and Red Hat

Author
Recent Posts

mark

The IT guy with a slight look of boredom in his eyes. Freelancer. Current interests: Kubernetes, Tensorflow, shiny new things.

Cookie	Duration	Description
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_60468161_1	past	Set by Google to distinguish users.
_ga_DR9SCJ09BV	2 years	This cookie is installed by Google Analytics.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.

Cookie	Duration	Description
edgebucket	session	Reddit sets this cookie to save the information about a log-on Reddit user, for the purpose of advertisement recommendations and updating the content.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	14 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
csv	2 years	No description available.
GoogleAdServingTest	session	No description
wp_api	past	No description
wp_api_sec	past	No description
_pk_id.1.95fa	1 year 27 days	No description
_pk_ses.1.95fa	29 minutes	No description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.

VDO: Linux deduplication and how to use it

What is Deduplication?

How does VDO work?

VDO Requirements

Where does VDO sit in the storage stack?

VDO Performance

How to get started

You may also like...

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials

VDO: Linux deduplication and how to use it

What is Deduplication?

How does VDO work?

VDO Requirements

Where does VDO sit in the storage stack?

VDO Performance

How to get started

Related posts:

You may also like...

Red Hat deprecates BTRFS, is Stratis the new ZFS-like hope?

Serious TCP bug in Linux Kernel allows traffic hijacking

Linux distributions, which one is perfect for you?

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials