While the financial press is speculating about how the EU's anti-trust concerns may put the kybosh on the OraSun (or is it Sunacle?) merger, Sun blogger and ZFS creator Jeff Bonwick announced this week that ZFS now includes inline deduplication. While we've been waiting since July for Sun to get their deduplication working, I'm intrigued by both the details of how ZFS dedupe works and the ramifications of including deduplication in reasonably priced server based storage solutions.
When I first heard that Sun was going to add dedupe to ZFS, I expected something resembling NetApp dedupe formerly known as A-SIS. That is a post process, relatively low data reduction, system that would be interesting to Sun users. I've mentioned before that the enterprise NAS guys have been very conservative when adding data reduction technologies so their customers would never have a reason to think any new feature might slow their NAS box down in any way.
Sun, on the other hand, has recognized that server CPU cycles are growing much faster than disk I/O bandwidth and have decided to use the CPU cycles available to manage storage. This lets them design one server that can be a compute node or a storage node in the data center.
Like NetApp dedupe, ZFS leverages the per block checksums it calculates as each block is written to disk to insure data integrity to identify duplicate blocks. Admins can turn dedupe on by storage pool with a single command. They can also choose to not trust the very collision resistant SHA-256 hash algorithm and turn on byte by byte verification. Clever users could even use the less compute intensive fletcher4 checksum to identify "similar" blocks and rely on verification to insure they don't deduplicate data that isn't really duplicated in the first place.
Add in the compression that ZFS has included for years and a server running NexentaStor (or a Sun appliance), and this could be really be a general purpose storage system with good data reduction for NFS, iSCSI or even FC attached systems.