VDO: Linux deduplication and how to use it
Following Red Hat acquisition of Permabit Technologies, the former decided to release the latter proprietary technology: VDO, as open source. But what is VDO and how can you use it?
What is Deduplication?
Before you delve into VDO it is important to understand what deduplication is. Imagine you have two files which differ by only one byte:
- File1: is 200KiB
- File2: is 200KiB
Both of them are identical, except for that one byte. How much space do you need to store them on a drive? 400KiB, of course! Now let’s have the same example with deduplication turned on. How much space do you need to store them on a drive? The answer is: 200KiB + 1 byte.
The idea behind deduplication is pretty straightforward: take chunks of data that are similar and store only one copy of it. When the exact data is stored again it will be marked as duplicated and the original copy referenced instead of storing two copies.
That’s all there is behind deduplication. Wonderful, am I right? In the example above, only 50% of the space needed was used to store the same files. So how come no system uses deduplication out of the box? Because it is a trade-off. The deduplication process costs CPU and RAM. How much depends on the solution.
How does VDO work?
VDO (Virtual Data Optimizer, formerly known as Albeiro VDO) was originally developed by Permabit Technologies, later acquired by Red Hat and open-sourced. VDO is a transparent deduplication and compression layer that sits very close to the block device.
The actions VDO takes to optimize data can be summarized in three phases:
- Zero-block Elimination: data that contains only zeroes is recorded as metadata only.
- Deduplication: data that has already been stored and deemed redundant will not be stored, instead a reference to the stored copy will be written.
- Compression: LZ4 is applied to compress data.
This is achieved by two important components:
- UDS (Universal Deduplication Services): decides whether a block can be deduplicated using an index, the UDS Index. The UDS Index is stored on the same block device.
- VDO: abstracts existing block devices into newer block devices with deduplication and compression capabilities.
Both of them are implemented as Kernel Modules, and VDO (KVDO) interacts directly with the Device Mapper layer.
As of the time of writing this article, VDO is not yet available in upstream Linux (there is an ongoing effort to make this happen). Because of that you will have to build it manually. On CentOS/RHEL version 7.5 you can simply install it through yum. I haven’t tested it on other distributions, but I have read about people who tried and said that the module doesn’t load properly.
- UDS: at least 250MB but can grow up to 1GB. According to Red Hat, a sparse index can manage up to 40TB of disk space.
- VDO: at least 370MB plus an additional 268MB per 1TB of managed storage.
On top of that remember that VDO will require a portion of the medium to be used to store the UDS Index, hence you will have less space available once you create a VDO volume.
Where does VDO sit in the storage stack?
According to Red Hat documentation:
- Under VDO: DM-Multipath, DM-Crypt, and software RAID (LVM or mdraid).
- On top of VDO: LVM cache, LVM Logical Volumes, LVM snapshots, and LVM Thin Provisioning.
Also, be aware that the following configurations are not supported:
- VDO on top of VDO volumes: storage → VDO → LVM → VDO
- VDO on top of LVM Snapshots
- VDO on top of LVM Cache
- VDO on top of the loopback device
- VDO on top of LVM Thin Provisioning
- Encrypted volumes on top of VDO: storage → VDO → DM-Crypt
- Partitions on a VDO volume: fdisk, parted, and similar partitions
- RAID (LVM, MD, or any other type) on top of a VDO volume
As mentioned before, when you save disk storage using deduplication you will have to pay using memory and CPU power, but there’s also another thing to take into account: performance.
You would expect the performance of certain operations such as adding new data (not deduplicable) to be worse, and other operations such as adding data already stored (deduplicable) to be much better. According to this article by Red Hat the performance of VDO is significantly worse in every case. An excerpt, table from the article:
|filesystem backend||deploy to file system||copy on file system|
|XFS ontop of normal LVM volume||28sec||35sec|
|XFS on VDO device, async mode||55sec||58sec|
|XFS on VDO device, sync mode||71sec||92sec|
How to get started
Creating a VDO volume is surprisingly easy, as long as you are using RHEL/CentOS >= 7.5:
# yum install vdo kmod-kvdo # vdo create --name=vdo_volume --device=/dev/vda
Beware: you should use a persistent name rather than /dev/vda, otherwise VDO may encounter problems when booting up.
You can now access the newly created volume at /dev/mapper/vdo_volume, you can now create a filesystem on top of it. If you need assistance here’s a complete guide Disks, Partitions and Filesystems, although you can’t really partition a VDO volume since it is not supported.