The Better Dedupe Series – Source vs. Target
There are two approaches to where deduplication technology gets deployed: as a source and as a target. While each has its benefits, the real issue is identifying which is the better choice for your application. When implementing a deduplication technology, a decision needs to be made as to whether data is processed at the source location; (i.e. before data transmission over the wire), or at the target destination; (i.e. after the data reaches the destination). There are pro’s and con’s to both of these approaches.
Source Deduplication takes place at the source application/server. This can be done with software residing on the server or more recently, done on a PCI card with the dedupe software onboard. Sometimes the card will also contain a processor that further increases performance. With this approach, the dedupe system scans new files and creates hashes (digital fingerprints) and compares them to hashes of existing files stored. The file data may be stored locally or at a central location. In most dedupe implementations, when files that create duplicates are located they are replaced with pointers to the original files. This approach distributes processing demands by spreading the processing needs across multiple source applications/servers or cards rather than a single process typical of a target-based approach. The benefit is that the amount of data that is deduplicated may not be as large because there would be multiple sources of data and the impact (performance load) on the server may not be significant. If the data load is large, the impact may be that the server overall performance is minimized and the applications that usually run on that server could experience degradation. With the PCI card approach the burden is removed from the server and placed on the card on-board processor. Unfortunately, the processors on the card could be slower single core (if costs of manufacturing were minimized) or the addition of a multicore processor could increase the overall costs significantly if the deployment of many cards across multiple servers is used. Additionally, the amount of onboard memory to store the hash keys is a cost that must be considered. If memory is limited, it may negatively impact the ability of the dedupe process to scale out and manage larger amounts of data.
Target Deduplication is the process of removing duplicates of data at the (target) data store which is usually a single data store that may have multiple sources feeding it. As target dedupe was initially implemented, the processing demands were thought to be significant due to the volume of data and the number of hash computations which needed to be completed; especially in situations where the hash keys were more complex (larger), as they are in SHA-2 hashing algorithms. As a result, those that chose to use the target approach may decide to use a less resource intensive hashing method (SHA1 or MD5) as many have done in the backup space. However, the more effective but also more processor draining SHA-2 hashing algorithm can be used very effectively if multicore processors are available which can provide the needed horsepower with no impact to storage processing. In addition, the benefit of a target approach is that much more memory can be placed in a single data storage device enabling more scalability because the ability to scale is directly tied to the amount of hash key information that can be effectively managed in memory. Today, there have been dramatic advances in memory efficiency and utilization with hybrid dense and sparse indexing approaches. (I’ll discuss that in a future post).
Better Dedupe = Target deduplication (with indexing and multicore processors)
In my next post I’ll be discussing file/subfile and content aware dedupe.