Data deduplication, at least for backup data, has made it to the mainstream. However, it's important to remember that the term "data deduplication" applies to a relatively wide range of technologies that all manage to store data once, even when they're told to store it many times. Since all of these technologies are sensitive to the data being stored, nowhere in IT is the term "your mileage may vary" more true than in dedupe. As 2010 winds down, I figured I'd share a few tips on how to get the most out of deduplication.
Do check that the dedupe solution you're looking at supports your backup application. While most deduplication systems will find some duplicate data in an arbitrary data stream, most will get better results if they have some context about the data to work with. Deduplication systems based on hashes break the data into chunks and eliminate duplicate chunks. While they all will start a new chunk at the beginning of a new file, most backup applications store data in aggregate files that resemble Unix tarballs or ZIP files.
If your dedupe system knows about the aggregate file format your backup app uses, it can start a new chunk for each file from your source system in the backup. This will allow the system to identify more duplicate data. In addition to your data, aggregate files include index information that the backup application uses to speed restores. If you store backup data on a fixed chunk dedupe system, like most of the primary storage systems that dedupe data, this index info may shift data so the system won't recognize that today's backup includes the same data as yesterday's.
Do keep similar data sources in the same dedupe pool. If your dedupe system isn't capable of storing all your data in a single dedupe pool, split your data so that servers that hold similar data are in the same pool. Putting file servers in one pool and Oracle servers in another, for example, will give you a better dedupe ratio than storing all the data from the New York office in one pool and all the data from Chicago in another.
Don't encrypt your data before sending it to the deduping appliance. Encryption algorithms are designed so that similar sets of cleartext data will generate very different sets of cryptotext. That prevents your deduping appliance from recognizing duplicate data. Compression has just about the same effect, so leave the compression to the back-end deduplication device and not the backup software.