permabits and petabytes blog oem data optimization for next generation storage OEM Data Optimization Solutions

Compression and Dedupe: Redux

Yesterday The Storage Alchemist at Storwize posted a complaint about Tom’s discussion of compression and deduplication. We certainly aren’t savaging compression technologies — I think perhaps it’s clearer to consider our points not so much as a criticism of compression, but as a list of concerns regarding bump-in-the-wire optimization appliances. We absolutely agree with Steve the Alchemist that data compression and data deduplication are two technologies that complement one another well — we use both in our Enterprise Archive Value NAS and Cloud Storage offerings , and we make it possible for our partners to compress (if they so choose) when using our Albireo SDK.

I’ll comment on his technical concerns.

Compression and deduplication are very similar in that they identify and eliminate redundant data, but the scope of this duplicate identification is vastly different. Traditional compression works on a small window of data and with short duplicate segments so that the compression tables fit efficiently in a very small amount of memory. Storwize may not be using a 64 KB window, but I imagine the order of magnitude is about right… and that’s not a criticism of their technology at all. In fact, the way Storwize manages data in chunks so that they can maintain performance is very clever.

Calling deduplication lossy is nonsense; both compression and dedupe replace redundant data with references to other instances of that data, just at different scales as I note above. Unlike Ocarina’s NFO, which frighteningly throws away actual content, both dedupe and traditional compression return the original bitstream.  Tom’s point was that Albireo embedded dedupe leverages existing file and block system concepts to make those references so no interaction with our software is required on read, while a compression appliance modifies the data format before it reaches the storage array, which creates data lock-in. Take away the appliance, and the storage is full of uninterpretable data. That’s a concern for storage vendors and users alike.

As to the chart, when you look at this as ‘embedded dedupe’ vs. ‘appliance-mediated compression’, you can see why Tom says that appliance compression alters the data, and Albireo dedupe does not require ‘rehydration’.  As for ‘optimizes block’, I haven’t yet seen Storwize’s block optimzation products, so I can’t comment, but I do wonder how they make the space saved to compression available to the user? We agree that savings are absolutely data dependent. In general, deduplication alone offers more savings than compression alone, and both together give the best results by far. Perhaps we can work together to ensure Albireo and Storwize yield optimal results?

  1. Steve - The Storage Alchemist on June 23, 2010 at 7:41 am Reply

    Jered,

    Thanks for some additional clarification. I have a couple of comments / questions.

    First, perhaps we should talk about the value of the Storwize compression technology – we have no initial segment size. It is a part of our patented technology that looks at doing compression an entirely new way. Maybe there is a joint solution here, who knows.

    Also, my point, as I called out in my blog, wasn’t to say that deduplicaiton was lossy, it was just no more ‘lossy’ than what Tom had eludeded to in his piece.

    Next, so if I read your comment right, you are saying that if you remove the Albireo software, that customers can read the deduplicated data created with your software off of disk? I find this extremely hard to believe. I think if an application (and that is what Albireo is) writes data, it also needs to be the one that reads the data, so I am confused on this point.

    Now all of that said, it would be interesting to see if and how the technologies could complement one another because I think as we all agree, deduplicaiton and compression are friends not enemies.

  2. Jered Floyd on June 28, 2010 at 10:15 pm Reply

    That is correct; even if you remove the Albireo software the customer can still read the deduplicated data created with our software. Albireo operates as a sophisticated “duplicate advisory service” that delivers results across orders of magnitude more data than similar technologies. Once duplicates are identified, the block or extent merges are recorded in the existing file or block system metadata.

    For example, a file system implementation would use our advice to have two (or more) files share a duplicate block across multiple inodes. Once this has been done, the normal file system read path applies. As we acknowledge, this requires an integration effort on the part of the storage vendor, but it generally leverages tools they already have in their stack, and represents many benefits above their other dedupe options.

    I would certainly like to talk more about how we can work more closely together. Some basic documentation regarding Storwize’s on-disk compressed data representation would allow us to identify deduplication boundaries. What would be the best way to contact you on this?

  3. Mike Davis on July 9, 2010 at 4:55 pm Reply

    Hi Steve:

    There’s a layman’s misconception about Alberio that it is a self-sufficient dedupe engine, whereas in reality it is a “chunk-and-lookup” service that just relays information to the OEM’s data management layer. In other words all the additional work of dedupe (such as garbage collection in a post-process implementation) is implemented by the OEM developer who owns the block manager or file system. It’s a fundamental difference in philosophy from a more self-sufficient embedded dedupe package such as Ocarina’s that manages file maps and data containers. It will be interesting to see how the OEMs react to one approach or the other.

    Jered is careful (thanks Jered!) to only claim that Permabit software isn’t involved in the rehydration. But it is totally bogus for them to imply no rehydration is required and I think this is coming across in some of the marketing. In this case they’ve passed the buck to the OEM to do the rehydration. There’s no magic here, files still need rehydration.

    I do take exception to any generic claim that “dedupe alone offers more savings than compression alone.” Sure if you’re a D2D backup target that’s true, but in online storage applications compression typically delivers more results than dedupe, and in vertical applications this is *vastly* true. Ocarina’s product is the only product in the market with concurrent dedupe and compression, and empirically compression delivers better results in isolation. If you do it right both D+C can do better. And no, I’m not talking about imaging apps (although the answer there is even more obvious).

    • Jered Floyd on July 12, 2010 at 1:41 pm Reply

      Mike,

      Thanks for stopping by. You’re right that the core parts of the Albireo software act to rapidly and scalably identify duplicates and the subsequent block or extent unification is handled by the OEM’s existing technologies — universally this is something OEMs tell us is a major strength of the system. Because a standard integration does not provide the mapping/redirection layer we eliminate having our software in the read path and thus eliminate data lock-in. If our software is disabled or removed, there is no threat to the end user’s data. This isn’t the case with compression — if the reader layer or application goes away the user data becomes unrecoverable. That’s why we feel strongly that while Albireo dedupe can be effectively licensed, integrated compression technology must be owned by the OEM, either through internal development or acquisition.

      “Rehydration” is perhaps poorly defined, but I feel comfortable saying that an Albireo integration requires no rehydration. Retrieving blocks from a deduplicated file involves reading the block references from an inode, which is the same operation as when Albireo is not in use. Unless you would call the normal block-gathering done by a standard file system read to be rehydration, a read of a deduped file does not require such an operation.

      We’re in agreement that deduplication plus compression yields the best results. Based on our customer data with Enterprise Archive, I stand by my statement that on an average user data set dedupe alone outperforms lossless compression alone. Some verticals certainly have non-dedupeable data that is stored in a very inefficient format (large decimal value representations that use less that 1/16th of the character set come to mind), and other customers may find lossy compression of that image data acceptable, but in the general enterprise use case deduplication performs better.