permabits and petabytes blog oem data optimization for next generation storage OEM Data Optimization Solutions

All Deduplication is Not the Same

In a short few years deduplication technologies have become commonplace, and in not too much longer will become a requirement for any new storage purchase. When everyone offers deduplication, are they all the same? Definitely not! Most of the deduplication technologies out there only perform in very specific use cases like backup, and cannot handle general purpose data storage.

The biggest dichotomy in deduplication is dedupe for backup, be it VTL or disk-to-disk (D2D), and dedupe for general purpose primary or archive data storage. These uses are very different. In the case of dedupe for backup the system doesn’t have to work as hard to find commonalities in data so it can “cheat” a bit when looking for identical data. This is largely because dedupe for backup depends on telling customers to keep on going with their same old inefficient backup schedules of regular full backups, even though those backups are now going to spinning disk. It’s really easy to get 20x dedupe when you tell the customer to store the exact same data 20 times!

At the heart of any deduplication system is an index of data that has been seen before, catalogued by some form of “data fingerprint”. (Permabit uses SHA-256, the only fingerprint algorithm currently allowed for Federal data security.) As new data comes in the system must rapidly fingerprint the data and determine if it has been seen before. Doing this quickly and efficiently is the core of any deduplication engine, and is an extremely hard problem to solve for hundreds of terabytes or more of data.

How does backup dedupe cheat? Dedupe systems for backup, like those from NEC, IBM, Sepaton and others, depend on something called temporal locality — a technical way of saying that because these are backup images, data that was written together before is likely to be written together again. A full backup of a system is going to look a whole lot like the last full backup of that same system, or a full backup of a very similar system.

Back to the deduplication index: Because these backup systems know that they’re going to see the same data over and over they don’t try to keep an index of all the data stored. Instead they break it up into smaller chunks based on time, covering only a handful of terabytes of data, or maybe a dozen backups. When they see a new backup stream start they look and see if it matches any of those small indexes — if it does, then they read in that index and use only it for deduplication. This catches a lot of duplicate data, but misses anything that was already seen in a different period. In practice, it works well enough for backup…. but, it only works well enough for backup.

Dedupe for Archive and Primary

For archive and primary data storage, on the other hand, the storage system needs to search hard to find duplicate data, looking for not just whole files that are the same but also smaller parts of files that are duplicated as different versions of that file. This is a much harder problem because to get any real efficiency the system must compare against all other data in the system in detail, hundreds of terabytes or more of data. Without inspecting all data in the system, significant deduplication will be unlikely.

As you might imagine, maintaining such a deduplication index of this scale is an extremely challenging prospect. Most vendors that claim to server archive and primary storage take the easy way out — they just don’t scale. Data Domain, for example, tops out at around 30 terabytes of disk, and NetApp won’t let you dedupe outside of a single 16 TB WAFL volume. You’re not likely to get much deduplication there! Others, like NEC’s HydraStor, will apply the backup-specific approach I describe above and happily charge you primary storage costs for archive storage with minimal if any deduplication. Bolt-on solutions like Ocarina are really only just recompressing your JPEG images more efficiently, and require custom reader software to go back and decompress them later.

Only Permabit’s Scalable Data Reduction (SDR) delivers truly scalable data deduplication for non-backup data. Our advanced grid architecture allows us to protect data far better than RAID by use of our RAIN-EC data protection, and also allows us to distribute the deduplication indexing problem. This means that Permabit Enterprise Archive can efficiently index and dedupe across hundreds of terabytes of disk — all real-time. In just thousandths of a second we identify if each incoming chunk has ever been seen before, something that no other vendor on the market today can claim within an order of magnitude.

Don’t get stuck with a deduplication solution that doesn’t perform. Before making your next storage purchase decision, ask your vendor how broad the scope of their deduplication is. Do they dedupe sub-file? Do they dedupe against all previously written data? What are the limits on the amount of unique data that they can store? The answers may surprise you.

No Comments