A recent blog post by CommVault's Dipesh Patel has reopened the argument over the value of higher deduplication ratio. He, like others that claims that dedupe ratios don't matter, and points out that a dedupe ratio of 10:1 reduces the size of your data by 90 percent, reaching the zone of diminishing returns since a 20:1 dedupe ratio will only get you another 5 percent reduction. To me that argument sounds an awful lot like Lucy Ricardo explaining how she saved a lot of money on the new living room furniture because it was on sale.
I tend to think more like Ricky and worry not about how much was saved but how much the new living room suite, or backup solution, cost. A system that gets 20:1 is twice as good at data reduction as one that gets 10:1 not 5 percent better. It's the absolute space needed not the incremental savings that we shell out our hard-earned budget dollars for.
If I have 500TB of backup data I want to keep online, I'll need 50TB of usable space in a system that reduces data 10:1 solution but just 25TB if it got 20:1. Assuming 1TB drives and 10+2 RAID-6 that would be 60 1TB drives for the system that gets 10:1 but just 30 for the 20:1 system. Even if the initial purchase price of the two solutions was the same, I'd still need to pay more to rack, maintain, power and cool twice as many disk drives Things get really interesting when we start replicating the data. Dedupe twice as effectively and you've reduced the amount of data to replicate in half. That could easily be the difference between meeting an SLA to replicate backups off-site in 24 hours with one T-3 line or having to pony up for a pair of T-3s.
Even if they can't accept my logic, which I admit may not be as solid as Spock's, vendor spokespeople like Mr. Patel should avoid making arguments that metrics like dedupe ratios don't matter because some readers will assume they're trying to downplay a weakness in their product. There's scarce data available on how well various products dedupe. Add in that deduplication rates can vary greatly as the amount of duplicate data varies from data set to another and one can understand why users are skeptical about dedupe ratio claims and counterclaims.
In addition to Mr. Patel I'd like to thank Curtis Preston (Blog Entry 1, Blog Entry 2) and Sepaton's Jay Livens (Blog Entry) for adding to the discussion in their blog posts.