As we begin our dive into deduplication backup software, we will start with the guys that started it all: Avamar. Since the early part of this decade, Avamar, originally a standalone company, and later as part of EMC, has been working to convince people that delivering deduplication in the backup software and starting the process at the client, is the way to go. It seems like over the past few years they have been gaining steam, driven mostly by a clear articulation of where their source-side technology plays best, especially since one of those areas seems to be the VMware use case.
Architecturally speaking, this is essentially an enterprise back-up software application that is designed to send data to a custom back-end disk target. The software installs as an agent on the servers that are going to be protected and then sends backup data to a grid of interconnected server and storage nodes. There are several forms of delivery for this grid, but the predominant delivery package is a disk back-up appliance called an Avamar Data Store.
The client software, unlike other solutions, does all of the deduplication processing and communicates with the server grid to assure cross-client deduplication. The benefit to this is that only the changed segments are sent across the network to the disk target. With source-side deduplication, the bulk of time is spent on identifying and minimizing what to back-up, compared to target-side dedupe, where the bulk of time is spent transferring all the data across the wire. Source-side dedupe means very minimal use of LAN/WAN network bandwidth, shorter backup transfer windows and of course, savings in backup storage at the potential expense of source processor utilization.
Processor utilization at the client has historically been a perceived as a concern with source-side deduplication technology, and was an issue when we first looked at the technology almost seven years ago. Of course, in seven years we have seen massive advances in server processing power, as well as improvement in overall efficiency of deduplication backup software. As a result, what a customer should typically see today is a modest spike in CPU utilization at the client, but for a shorter amount of time, when compared to traditional backup software.
The short-duration impact of deduplication processing for most servers should be manageable, and where there is a concern, the amount of CPU resources used can be adjusted to customer specified limits. While this may lengthen the backups a bit, it allows you to maintain a service level on the host being backed up. This is especially important in VMware environments, where there is sensitivity to CPU consumption for backup, and where vMotion and other measures are often triggered by excessive CPU usage.
Once redundant, sub-file segment of data has been identified and eliminated (within and across clients). Only unique, new data is sent across the wire to be backed up. In unstructured data environments, Avamar claims that they can reduce data by over 99 percent. The backup data is received and written to disk at the Avamar Data Store. In the Avamar Data Store, data is striped across the storage in the grid, and the processing load for the backup is distributed across the grid as well. Each node in the grid stores its data in a RAID 5 data protection scheme and then RAIN protection (Redundant Array of Independent Nodes, a grid-like RAID) is applied across the nodes. RAIN provides persistence to any individual node failure, and also allows you to scale the grid without excessive downtime. In addition to RAID and RAIN, Avamar also offers data recovery verification. Data is validated twice daily to make sure that whatever has been backed up is always in a recoverable state. Since Avamar does not rely on a full plus incremental recovery scheme, all recoveries from Avamar are one-step recoveries from logical full backups. This means that pulling the last full backup from the weekend and layering nightly incrementals is not required.