Compression and Dedupe: Redux
Yesterday The Storage Alchemist at Storwize posted a complaint about Tom’s discussion of compression and deduplication. We certainly aren’t savaging compression technologies — I think perhaps it’s clearer to consider our points not so much as a criticism of compression, but as a list of concerns regarding bump-in-the-wire optimization appliances. We absolutely agree with Steve the Alchemist that data compression and data deduplication are two technologies that complement one another well — we use both in our Enterprise Archive Value NAS and Cloud Storage offerings , and we make it possible for our partners to compress (if they so choose) when using our Albireo SDK.
I’ll comment on his technical concerns.
Compression and deduplication are very similar in that they identify and eliminate redundant data, but the scope of this duplicate identification is vastly different. Traditional compression works on a small window of data and with short duplicate segments so that the compression tables fit efficiently in a very small amount of memory. Storwize may not be using a 64 KB window, but I imagine the order of magnitude is about right… and that’s not a criticism of their technology at all. In fact, the way Storwize manages data in chunks so that they can maintain performance is very clever.
Calling deduplication lossy is nonsense; both compression and dedupe replace redundant data with references to other instances of that data, just at different scales as I note above. Unlike Ocarina’s NFO, which frighteningly throws away actual content, both dedupe and traditional compression return the original bitstream. Tom’s point was that Albireo embedded dedupe leverages existing file and block system concepts to make those references so no interaction with our software is required on read, while a compression appliance modifies the data format before it reaches the storage array, which creates data lock-in. Take away the appliance, and the storage is full of uninterpretable data. That’s a concern for storage vendors and users alike.
As to the chart, when you look at this as ‘embedded dedupe’ vs. ‘appliance-mediated compression’, you can see why Tom says that appliance compression alters the data, and Albireo dedupe does not require ‘rehydration’. As for ‘optimizes block’, I haven’t yet seen Storwize’s block optimzation products, so I can’t comment, but I do wonder how they make the space saved to compression available to the user? We agree that savings are absolutely data dependent. In general, deduplication alone offers more savings than compression alone, and both together give the best results by far. Perhaps we can work together to ensure Albireo and Storwize yield optimal results?