PICK AND CHOOSE
While we talk about data deduplication as if it were a single technique, there are several ways to slice and dice data. The simplest is single-instance storage, which uses symbolic links so one file can appear to be in multiple places with different names. Microsoft's Windows Storage Server has a background task that finds and eliminates duplicate files this way. Single-instance storage is a good start, but to get really impressive data reduction, you have to work at a finer level.
Hash-based systems such as NEC's Hydrastor and Quantum's DXi series divide data into blocks and use an algorithm like MD5 or SHA-1 to create a fingerprint for each block. When a second block generates the same fingerprint, the system creates a pointer to the first block rather than saving the second.
While hash-based deduplication sounds simple, the devil, as they say, is in the details. The first gotcha is figuring out where the block boundaries should go. While fixed-length blocks are easier to implement, they can miss a small amount of data inserted into an existing file. Variable block schemes are more complex but usually result in greater deduplication.
The other snag is the possibility of hash collisions. While the likelihood of a collision is on the order of 1 in 1020 per petabyte of data--a few billion times less likely than being hit by lightning--vendors have recognized that customers are concerned about even this low probability and have added byte-by-byte comparisons or calculate a second hash with another algorithm as a double check.
Hashing isn't the only way to deduplicate data. Content-aware systems like ExaGrid's and Sepaton's DeltaStor understand the format used by the backup applications that write to them. They compare the version of each file backed up to the version from the last backup, then store only the changes.
The rub is that content awareness alone only identifies data duplication in the temporal realm. It can be efficient at storing the 15 new messages in a huge e-mail file, but it won't reduce the amount of space that 400 copies of the corporate memo template take up across all users' home directories.
Whereas hash-based systems can deduplicate any data you throw at them, the makers of content-aware systems have to explicitly build in support for the file and/or tape data format of each application supported. This has caused some consternation--for example, among users that bought Sepaton VTLs and tried to use EMC's NetWorker or CommVault Galaxy, which Sepaton doesn't yet support. They're stuck waiting for Sepaton to add support for their backup programs.
Saving disk space in the data center is a pretty neat trick, but it's in remote-office and branch-office backups that deduplication really shines, reducing not just the amount of disk space you need at headquarters to back up all your remote offices, but also the amount of WAN bandwidth it takes to send the data. Organizations can eliminate all their remote-office tape-handling problems by replacing the tape drives in remote offices with a deduplicating appliance that replicates its data back to another appliance in the data center while keeping backup software and processes such as Quantum's DXi or Data Domain.
The other solution is to use remote-office and branch-office backup software like EMC's Avamar, Symantec's NetBackup PureDisk, or Asigra's Televaulting, which perform hash-based data deduplication at the source to vastly reduce the WAN bandwidth needed to transfer the backup data to company headquarters.
Like any conventional backup application making an incremental backup, remote-office and branch-office backups use the usual methods, such as archive bits, last-modified dates, and the file system change journal, to identify the files that have changed since the last backup. They then slice, dice, and julienne the file into smaller blocks and calculate hashes for each block.
Why Deduplicate?
LOCAL BACKUP
Deduplicating virtual tape libraries or other disk-based systems letsthem store 10 to 30 times as much data as nondeduplicated systems, thereby extending retention and simplifying restores.
REMOTE BACKUP
Replicating deduplicated data makes backup across the WAN practical. Global deduplication even eliminates duplicates across multiple remote offices.
ARCHIVES
In addition to saving disk space in the archive, deduplication hashes serve as data signatures to ensure data integrity and expand available storage
WAN ACCELERATION
Hashing data, eliminating duplicate blocks, and caching in WAN acceleration appliances speed applications and replication without the cache coherency problems associated with wide area file services.
The hashes are then compared with a local cache of the hashes of blocks that have been backed up at the local site. The hashes that don't appear in the local cache and file system metadata are then sent to a grid of servers that serve as the central backup data store, which then compares the data with its hash tables. The backup server sends back a list of the hashes that it hasn't seen before; the server being backed up then sends the data blocks represented by those hashes to the central data store for safekeeping.
These backup systems could reach even higher data-reduction levels than the backup targets by deduplicating not just the data from the set of servers that are backed up to a single target or even a cluster of targets, but across the entire enterprise. If the CEO sends a 100-MB PowerPoint presentation to all 500 branch offices, it will be backed up from the one whose backup schedule runs first. All the others will just send hashes to the home office and be told, "We already got that, thanks."
This approach is also less susceptible to the scalability issues that affect hash-based systems. Since each remote server only caches the hashes for its local data, that hash table shouldn't outgrow available space, and since the disk I/O system at the central site is much faster than the WAN feeding the backups, even searching a huge hash index on disk is much faster than sending the data.
Although Avamar, NetBackup PureDisk, and Televaulting all share a similar architecture and are priced based on the size of the deduplicated data store, there are some differences. NetBackup PureDisk uses a fixed 128-KB block size, whereas Televaulting and Avamar use variable block sizes, which could result in greater deduplication. Asigra also markets Televaulting for service providers so small businesses that don't want to set up their own infrastructure can take advantage of deduplication, too.