In our next series of entries, we will begin to look at companies that do source-side deduplication. Actually, we already looked at one: Atempo. Source-side dedupe means that the redundant data is eliminated prior to it traveling across the network to the backup server. If you draw this technique up on the whiteboard, it seems like this would be the most logical place to eliminate redundant data, but it is not without its challenges and we will try to address those as we go through them.
The advantage of source-side dedupe is that after the initial backup is complete, only the unique data is sent across the wire. This can either be done via a traditional deduplication process or it can be done via a block-level incremental backup. With this deduplication technique, a process occurs that compares changed segments of information to what has already been sent to the backup target, but that comparison is typically across all the data, from multiple sources that have been sent to that target. For example, if server A and server B had the same file, when it became server B's turn to send that file, it would not need to since server A has already sent it. Think of source-side deduplication as an enterprise-wide comparison to eliminate redundancy done prior to data transmission.
Block-level incremental (BLI) backups, after the initial backup, also only send changed segments of information. These segments, however, are typically tied to the boundaries of the blocks laid out by the file system. BLI backups tend to keep an exact mirror of the systems they are protecting at the backup target. They are typically a volume-to-volume matching technique more than a deduplication technique. Most leverage some form of a snapshot to be able to provide point-in-time rollbacks. For obvious marketing reasons, companies that offer a BLI solution want to get lumped into the deduplication category. They do eliminate the need for redundant backups and are also smaller than the classic incremental, because they are only sending and storing changed blocks instead of entire files. Finally, some of the companies will also do a post-process deduplication pass on the data to eliminate cross server redundancy.
A concern with source-side deduplication is what impact does that deduplication comparison step have on the client? All the vendors we have spoken to in preparation, and I am sure the ones we are getting ready to speak with, all claim that there is "little to no client impact." You need to test this yourself. What we can say is that the problem is not as severe as it was several years ago. The client-side software has matured and the processing resources available to that client are significantly greater than they were.
Lab tests and checks with users typically place the impact of the dedupe comparison to about 5 to 10 percent. The BLI technique, because it is a hard-set data segment and only a volume-to-volume comparison, does not require as much CPU resources. Also, many file systems will provide the requesting software a changed block list via an API. BLIs do not, however, have enterprise-wide data reduction, unless they are using a separate, post-process deduplication pass as well.