Dedupe isn’t always “dedupe” And Dedupe Everywhere isn’t always “everywhere”
In my last two posts, I explored why deduplication approaches vary widely and dedupe everywhere sometimes isn’t really “everywhere”. I hope I was successful in describing that the names may be the same in dedupe and dedupe everywhere, but what is delivered may be very different. Although the approaches in deduplication vary, they are based on the ability to effectively address the key performance issues that have been, until now, inhibiting deduplication from being more broadly deployed outside of the backup use case.
The two key issues are indexing and memory efficiency. Let’s take a look at them both:
Indexing is the critical linchpin in using deduplication. You may recall files are reduced to data strings or chunks, a data chunk is analyzed by a hashing algorithm (in the case of Albireo it is SHA256) to create a hash key for that particular data chunk. The hash keys are housed in an index and when a new key is created it is compared to the existing hash keys to determine if it already exists. If it does, then a pointer to the original data is created and the actual data is not stored. If it does not exist, the new key is added to the index, the data is stored and so on and so on.
So far so good. This is simple right? Not exactly! Because when the hash keys (256 bits each) for millions and/or billions of data chunks populate the index, that index gets very large! And this is where it starts to get difficult. In order to efficiently query the index for a duplicate hash key, the index should be kept in memory because memory delivers the least latency. However, a GB of memory will fill up pretty quickly and top end the amount of data that can be analyzed for duplication. Whether its 1 GB or many GB’s there will be a top end. Today data stores are exploding in size and the ability to scale out deduplication is critical to help reduce the amount of storage consumed.
Without getting into too much detail in this blog, Permabit has developed patented deduplication techniques that leverage dense and sparse indexing in a hybrid approach to address the scale out issue and resource efficiency. So much so that the key representation can average less than 0.1 bytes per key and the scale out can be over 20PB, depending on available memory, with over 400GB/s ingest rates which addresses the latency issues. That’s where the resource efficiency comes into play! In short, Permabit engineers solved the “dedupe conundrum” of scale out and resource efficiency to take deduplication beyond the initial use case of backup. As a result of their technical leadership, Albireo is the industry leading dedupe technology in performance and efficiency. You may want to review the recent Lab Analysis By ESG on Permabit Albireo. The ESG folks have seen the initial Albireo deployment (reviewed 11/2009) and have completed an updated Lab Review 8/2011.
Albireo is packaged in an SDK that enables it to be deployed into software and firmware environments with minimal additional code and typically within a few weeks. Because the performance and scalability issues have been solved, the SDK can be embedded into applications, operating systems, data bases as well as storage devices (local and cloud) enabling the Dedupe Everywhere I have discussed earlier.
The Dedupe Everywhere approach is complete and integrated and, as a result, the effective cost of storage is reduced and the actual physical amount of storage is also reduced at every layer of storage. Simply put, money is saved on storage acquisition, space is saved on storage devices and operating costs are dramatically reduced across the board. With this type of compound savings the impact on CAPEX and OPEX make Dedupe Everywhere universal.
Deduplication has been a checkbox feature that was found in backup and a few primary storage implementations. With the financial and efficiency impact Dedupe Everywhere brings, it is about to become a core competency and requisite capability for storage and full service vendors. Because of its flexibility in deployment Dedupe Everywhere will change the information storage landscape dramatically because it effectively identifies the duplicates, scales, performs, integrates and maintains data integrity for the duration of the data lifecycle (from application through multiple tiers of storage including backup and archive through to the cloud). Permabit Albireo “Dedupe Everywhere” is the next generation of this cost effective and storage saving technology.