Hard Disk Optimized Deduplication


Albireo technology supports purpose-built configurations for Hard Disk Drive (HDD) environments to enable storage optimization.  These configurations are designed to maximize performance, minimize resource utilization and manage the scalability of deduplication when used with mechanical storage to deliver hard disk optimized deduplication. Several techniques are utilized for these environments including:

  • Support for SHA-256 hashes in the Albireo SDK – allowing OEMs to avoid additional reads from disk in hardware environments that are able to support high performance strong cryptographic hash generation
  • In-memory caching of the Albireo index – reducing the average number of expensive IOs to below 0.1% of lookups by taking advantage of advanced caching techniques
  • Asynchronous write support in Albireo VDO – reducing the number of mechanical seeks by queuing up metadata information before synchronizing to disk
  • Write coalescence in Albireo VDO – reducing the number of mechanical seeks required for random writes by using VDO  virtualization capabilities to convert these into sequential physical writes to disk

Support for SHA-256 hashes in the Albireo SDK

Albireo SDK provides support for both cryptographic and non-cryptographic hash functions to identify duplicate chunk signatures in the data deduplication process.  The cryptographic function uses the Secure Hash Algorithm (SHA-256) designed by the U.S. National Security Agency while the non-cryptographic function generates a variant of the MurmurHash3 function developed by Google.  The cryptographic hash option is particularly useful for HDD environments where seeks to disk are expensive.  When a non-cryptographic hash is used, any duplicate chunk of data that is identified as already stored on disk must first be compared bit-by-bit against the canonical on-disk copy.  That copy is read in from the disk and the comparison is performed.  A cryptographic signature can avoid this second step of having to go to disk, resulting in reduced seeks and improved deduplication performance overall.

In-memory caching in the Albireo Index

The Albireo index uses in-memory caching techniques to avoid unnecessary disk I/Os.  The techniques take advantage of both temporal and special locality data properties determined through over 9 years of experience with real-world enterprise deduplication environments to ensure that fewer than 1000 disk I/Os are required per million new blocks written to a system utilizing the index.

Asynchronous write support in Albireo VDO 

VDO provides an asynchronous write mode command to meet the needs of mechanical storage solutions.  In asynchronous mode, data resiliency is guaranteed only after Albireo processes a flush request, which requires that all writes of data and the associated metadata that preceded the flush request reside on stable storage.

During development of the asynchronous write policy, efforts were taken to ensure that requests like REQ_FLUSH and REQ_FUA were properly handled.  When operating in asynchronous mode, Albireo advertises to the kernel that it can receive REQ_FLUSH and REQ_FUA requests.  Albireo does this even if the underlying device does not itself advertise that it has a write back cache.  When Albireo receives either of these requests Albireo initiates writes for the appropriate metadata, waits for them to complete, and then, if the underlying device requires it, sends a REQ_FLUSH command down to the device itself.  Only after this sequence completes does it respond to the request.

Write coalescence in Albireo VDO

In a situation where a system is restarted after an unclean shutdown, Albireo will perform a rebuild to verify the consistency of its metadata, and repair it if necessary.  Rebuilds are automatic and do not require user intervention.  In asynchronous mode, data resiliency is guaranteed only after VDO processes a flush request.  So all writes which were acknowledged prior to the last acknowledge flush request will be rebuilt.

Albireo also provides performance benefits because it is designed to send sequential write requests to storage.  This takes advantage of the natural strengths of hard disk storage because of the preference of mechanical disks to read and write data in sequence.

 

 

Media Center

More Media →
About Permabitmore
Read More →

Permabit is a recognized leader in data efficiency technology. We enable OEMs to leverage their R&D investment, increase margin, accelerate time to market and achieve competitive advantage. Permabit Albireo software massively improves performance and efficiency of data creation, transmission and storage. Solutions built with Albireo are being delivered by leading hardware, software and service providers.

Albireo Read More →

Permabit Albireo is the industry’s first purpose-built OEM data deduplication software designed to meet the needs of hardware, software, and service providers who wish to expand their existing solutions without negatively impacting differentiating capabilities or reducing performance. Albireo delivers deduplication at the sub-file level and can be flexibly integrated into existing or next-generation storage and platform architectures. Albireo deduplication is seamlessly deployed in primary, archive, and backup storage across the data center and the cloud. With Albireo, OEMs leverage their R&D investments while accelerating time to market for must-have, industry leading data optimization capabilities.

Twitter

More →

Twitter: permabit