permabits and petabytes blog oem data optimization for next generation storage OEM Data Optimization Solutions

Deduplication – Moving Beyond Backup

It was just a matter of time! The Dedupe 1.0 era comes to an end! For readers that may not know the phrase Dedupe 1.0 is what we refer to as the initial phase of dedupe which employed the use case of dedupe in backup. Simple enough backups by their nature contain an excessively high duplicate rate. Since the daily backup is 90% duplicate from the previous day it stands to reason that the initial use case for dedupe would be backup.

Unfortunately, the environment (backup) dictated the design and application of deduplication. For example, backups are not usually petabytes or more in size so the design was for smaller data sets. In addition, ingestion of the backup data needed to be as short as possible to minimize the impact on running applications. Some vendors just did a quick copy and later deduped the backup data. Dedupe uses hash key algorithms to create unique IDs for data so that dupes can be found. Applying a smaller algorithm such as MD5 or SHA1 delivered more processing efficiency. This would enable them to use less costly and slower processors, keeping the price points down or maybe enabling a bit more margin in the offering. Unfortunately that also exposed some of the data to possible corruption because the smaller hash keys were subject to data collisions meaning there would be two keys that were the same for different data! OOPS!

Those days are behind us. The advent of quad core processors that can be deployed in pairs provided enough compute capability to enable inline deduplication to occur eliminating the post process of data and to even enable SHA 256 hashing to occur resulting in data integrity. The result of these advances is that dedupe is now ready to move beyond backup into tier 2 storage and finally primary storage. We are calling this Dedupe 2.0!

Dedupe 2.0 applies to archiving data in disk based archives, tier 2 data on less costly NAS architectures and even to primary data. Duplicates are in each of these data stores and the more efficient these data stores are, the less storage they consume, the less power/cooling they consume and the less floor space they require and finally in these IT budget challenged times the lower the cost to store that data!

I have been tracking market trends and buying patterns for the last year on this subject and am very glad to see the 2010 IT user survey in Search Storage on February 15. The survey across over 360 IT professionals in all market segments indicated the 77% are either deploying or evaluating deduplication in primary or tier 2-n storage in 2010. So 2010 will be a watershed year for deduplication almost 80% of IT professionals will take dedupe to the next level to Dedupe 2.0. Why because it will save them money, costs, resources and most importantly time and effort. The next step in deduplication is arriving this year!

No Comments