I've said many times that nowhere in the field of information technology are the words "your mileage my vary" truer than when discussing data deduplication. How much your data will shrink when run through a given vendor's data deduplication engine can vary significantly depending on the data you're trying to dedupe and how well that particular deduplication engine handles that kind of data. One critical factor is how the deduping engine breaks your data down into chunks.
Most data deduplication engines work by breaking the data into chunks and using a hash function to help identify which chunks contain the same data. Once the system has identified duplicate data, it stores one copy and uses pointers in its internal file, or chunk management system, to keep track of where that chunk was in the original data set.
While most deduplication systems uses this basic technique, the details of how they decide what data goes into what chunk varies significantly. Some systems just take your data and break it into fixed-size chunks. The system may, for example, decide that a chunk is 8KBytes or 64KBytes and then break your data into 8KByte chunks, regardless of the content of the data.
Other systems analyze the data mathematically and choose spots that generate higher or lower values from their secret chunk-making function as the boundary between data chunks. On these systems, data chunks will vary in size based on the magic formula but within some limits, so chunks on these systems may be 8KBytes to 64KBytes, depending on the data.
If we implement these two techniques on backup appliances and back up a set of servers with a conventional backup application like NetBackup or Arcserve, the backup app will walk the file system and concatenate the data into a tape format file on the backup appliance.