permabits and petabytes blog oem data optimization for next generation storage OEM Data Optimization Solutions

Primary Storage Deduplication is the Future

Until now, I’ve chosen to stay out of the little tempest in a teapot that’s going on over at Chuck Hollis’ blog, but it doesn’t seem to be quieting down. He basically says that Data dedupe has no place on primary storage, which flies in the face of where the dedupe market is going… but it’s not a bad position to take when you’re company makes a lot of money off of very expensive primary storage.

Their biggest NAS competitor took the bait, and NetApp jumped into the fray. In between the vitrol some good points are made about why Chuck is wrong. For example, if deduplication is increasing access to common blocks it means that you’ll be seeing much better cache efficiency, which will offset additional load on the drives. The “boot storms” he talks about with many virtual machines hosted on the same storage are actually less likely to occur with deduplication than without!

Now Hu Yoshida has weighed in on Chuck’s side, but then laid out the much more reasonable view that virtualization, tiering and dynamic provisioning are critical in an environment with costly top-tier primary storage. That’s correct, but it doesn’t mean that deduplication at that top tier isn’t a huge win as well.

Deduplication has seen its first successes in the D2D backup space, where it’s easy to get a lot of deduplication due to the data patterns and traditional backup schedule. Applying deduplication beyond backup is hard, because the opportunities for deduplication are fewer and further between, and so these D2D backup devices have never been able to address archive or primary storage effectively. That doesn’t mean dedupe is bad for primary, it just means that it’s harder to do.

At Permabit, we consider dedupe for backup to be Dedupe 1.0, and the future for innovation is in Dedupe 2.0, which includes dedupe for primary and cloud storage. We host a forum over at Dedupe 2.0 to discuss this further, and recently released our Permabit Cloud Storage product to address these new customer needs. I can’t give too much detail, but we’re constantly at work making our deduplication technology available to ever broader markets.

Dedupe for primary is a huge win for the storage consumer, but it’s taken us nearly a decade of extensive technology and patent development to solve the scalability and speed challenges needed for that market. I think it’s no coincidence that the two voices denouncing primary dedupe the most, HDS and EMC, has no products to offer that include a feature which will soon become a customer requirement.

If you’re going to be at Storage Networking World next week and would like to hear more on primary dedupe, Arun Taneja is moderating a panel, “Primary Storage: The New Frontier for Data Deduplication“. I’ll be there, along with Val Bercovici from NetApp, Carter George from Ocarina, and Peter Smails from Storwize. It should be a lively discussion! Perhaps Chuck will stop by?

  1. Hubert Yoshida on October 7, 2009 at 12:04 am Reply

    Jered, thanks commenting on and linking to my blog. As I explained in my blog, I am not against dedupe at any level. I am making the point that dedupe is really adressing the symptom and not the cause. The cause is stale data and over allocation which is not only backedup and deduped over and over again, but is copied and moved many times over. There may be 20 copies of some primary volumes. Do you redupe before you copy and move? Achive and Dynamic(thin) Provision address the source.

    http://blogs.hds.com/hu/2009/10/i-agree-with-chuck-on-data-dedupe.html#more-1581

  2. Jered Floyd on October 12, 2009 at 12:53 pm Reply

    Hu, I see what you’re getting at here, but I also think that addressing the “symptom” (multiple copies of data) is sometimes the right thing to do. Consider the problem of dozens of revisions of a PowerPoint document. We could beg Microsoft to include revision control in PowerPoint (I’d actually like this very much), or we can let the storage solve the problem. Sometimes that’s the right solution.

    Finding and eliminating stale data is a great thing to do, but once you’ve hit the easy targets the cost grows significantly. You can be limited by technological restrictions like my example above. Having a platonically ideal solution (eliminate the data redundancy at its source) doesn’t mean that another method (have the storage identify and eliminate redundancy) might not work better in many cases. Just because I change my sheets once a week doesn’t mean there’s no sense in making the bed in the morning.

    So, I think we are in agreement. Different technologies can solve different problems (dedupe doesn’t help when moving between different storage systems), but different technologies can also solve the same problem in a cooperative way.