The Better Dedupe Series – Implementation: inline or post process
Deduplication can be implemented in a number of different ways depending on the needs of the technology provider. There are trade-offs in each and they have the potential to impact cost and performance. Let’s explore them:
Post Process – With post-process deduplication, new data is first stored on a storage device and then analyzed for duplication at a later time. The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data, so the incoming data is not delayed and the write process is not visible to the end user. The trade-off, however, is that a significant amount of disk capacity is consumed to house the pre-processed data (caching). This leaves less storage capacity for actual long term storage diminishing the storage efficiency and raising the overall cost. What happens when you start running out of disk space that you have allocated for the cache? How can you predict when the process will be completed? IT introduced a new term called the “dedupe window” to address the time it takes for a post-process effort to complete. When this technique is applied to backup, users have to be concerned that the dedupe process completes before the next backup cycle begins or they will end up never catching up and completing the process. Post-process is typically implemented for deduplication technologies that cannot perform fast enough to not have an impact on real-time performance.
InLine - Inline deduplication is the process where deduplication hash calculations are created on the target device as the data enters the device in real-time. The benefit of inline deduplication over post-process deduplication is that it requires less storage as data is not duplicated after the initial write and there is no need to set aside storage for data caching.
On the negative side, it is frequently argued that because hash calculations and lookups take time to calculate, it can mean slower data ingestion which reduces the throughput of the device. Since many storage vendors spend millions of dollars optimizing their storage, anything that impedes that efficiency is not looked upon favorably.
With multicore processors available today in storage devices and the emergence of intelligent hashing, this is much less of a concern. With the compute power available to process the hash calculations, do the index lookup for duplicates and the optimization intelligent hashing brings, data can be processed much more efficiently and the need for post-process can be eliminated. The inline deduplication approach has now become the preferred method.
Better Dedupe = inline dedupe.
In my next post I’ll look at Source vs. Target deduplication….