Hard Disk Optimized Deduplication
Albireo technology supports purpose-built configurations for Hard Disk Drive (HDD) environments to enable storage optimization. These configurations are designed to maximize performance, minimize resource utilization and manage the scalability of deduplication when used with mechanical storage to deliver hard disk optimized deduplication. Several techniques are utilized for these environments including:
- Support for SHA-256 hashes in the Albireo SDK – allowing OEMs to avoid additional reads from disk in hardware environments that are able to support high performance strong cryptographic hash generation
- In-memory caching of the Albireo index – reducing the average number of expensive IOs to below 0.1% of lookups by taking advantage of advanced caching techniques
- Asynchronous write support in Albireo VDO – reducing the number of mechanical seeks by queuing up metadata information before synchronizing to disk
- Write coalescence in Albireo VDO – reducing the number of mechanical seeks required for random writes by using VDO virtualization capabilities to convert these into sequential physical writes to disk
Support for SHA-256 hashes in the Albireo SDK
Albireo SDK provides support for both cryptographic and non-cryptographic hash functions to identify duplicate chunk signatures in the data deduplication process. The cryptographic function uses the Secure Hash Algorithm (SHA-256) designed by the U.S. National Security Agency while the non-cryptographic function generates a variant of the MurmurHash3 function developed by Google. The cryptographic hash option is particularly useful for HDD environments where seeks to disk are expensive. When a non-cryptographic hash is used, any duplicate chunk of data that is identified as already stored on disk must first be compared bit-by-bit against the canonical on-disk copy. That copy is read in from the disk and the comparison is performed. A cryptographic signature can avoid this second step of having to go to disk, resulting in reduced seeks and improved deduplication performance overall.
In-memory caching in the Albireo Index
The Albireo index uses in-memory caching techniques to avoid unnecessary disk I/Os. The techniques take advantage of both temporal and special locality data properties determined through over 9 years of experience with real-world enterprise deduplication environments to ensure that fewer than 1000 disk I/Os are required per million new blocks written to a system utilizing the index.
Asynchronous write support in Albireo VDO
VDO provides an asynchronous write mode command to meet the needs of mechanical storage solutions. In asynchronous mode, data resiliency is guaranteed only after Albireo processes a flush request, which requires that all writes of data and the associated metadata that preceded the flush request reside on stable storage.
During development of the asynchronous write policy, efforts were taken to ensure that requests like REQ_FLUSH and REQ_FUA were properly handled. When operating in asynchronous mode, Albireo advertises to the kernel that it can receive REQ_FLUSH and REQ_FUA requests. Albireo does this even if the underlying device does not itself advertise that it has a write back cache. When Albireo receives either of these requests Albireo initiates writes for the appropriate metadata, waits for them to complete, and then, if the underlying device requires it, sends a REQ_FLUSH command down to the device itself. Only after this sequence completes does it respond to the request.
Write coalescence in Albireo VDO
In a situation where a system is restarted after an unclean shutdown, Albireo will perform a rebuild to verify the consistency of its metadata, and repair it if necessary. Rebuilds are automatic and do not require user intervention. In asynchronous mode, data resiliency is guaranteed only after VDO processes a flush request. So all writes which were acknowledged prior to the last acknowledge flush request will be rebuilt.
Albireo also provides performance benefits because it is designed to send sequential write requests to storage. This takes advantage of the natural strengths of hard disk storage because of the preference of mechanical disks to read and write data in sequence.