Whitewater deduplicates backup data inline, storing it to the local cache and sending the deduplicated 4-Kbyte data blocks to the cloud storage provider as quickly as it can over your Internet connection. When the local Whitewater cache fills, it overwrites the least recently accessed data on the local cache first, assuming the oldest data is least likely to be recalled.
The Whitewater appliance compares only incoming data to the local cache to see if the data block is a duplicate. It does not compare incoming data to data blocks stored in the cloud service. If you perform a backup that contains data blocks that had been deduplicated once, but had aged out of the local cache, it will be stored to the cache and the cloud provider again, since the Whitewater appliance sees it as new data. Ultimately, you can end up with duplicate data blocks in the cloud. The longer you retain backups, the more new and duplicate data blocks will be stored. Naturally, if the backup file is deleted locally, it will be deleted in the cache and the cloud, as well.
At restore time, the Whitewater appliance will reassemble your data using data chunks from its cache, which will probably include all the blocks from last night's backup, and the performance is what you would expect for a deduplicating backup appliance. When you restore older backup data, some blocks may have aged out of the cache, so the Whitewater will retrieve those chunks from your cloud storage provider. If more than 80% of the data you're restoring is in cache, you should get good restore performance. As the amount of data that needs to be retrieved from the cloud increases, performance will be limited by your Internet connection.
In the event of a Whitewater appliance failure or bigger disaster, users can just set up a new Whitewater and start restoring their data. Since there's a virtual appliance version, you can start restoring data without waiting for Riverbed to overnight a replacement appliance.
Riverbed sent us the top-of-the-line Whitewater 2010 to test and we ran it through its paces at the Lyndhurst N.J.-based DeepStorage.net. We backed up our production file server, which holds about 720 Gbytes of assorted Office documents, software install points and our collection of World War II training films. We then used the Whitewater GUI and an SNMP monitor on our Internet gateway to see how the Whitewater reduced the data and sent it to the Amazon S3 instance Riverbed set up for our testing.
On the initial backup, the Whitewater's GUI indicated that it reduced the data about 2.1-to-1. Given that we had sent it a collection of compressed files and media, along with the Office files, in our initial backup, we got the level of deduplication we expected. We ran some scripts to introduce about 2% of new and changed files and backed the data up again, repeating that process for a total of two full backups. Each time the Whitewater reported our deduplication ratio climbing as it should have until we ended up at about 4-to-1.
Checking the S3 site, we saw Amazon also indicated that there was roughly 350 Gbytes of data stored for our 1.4 Tbytes of backups, which told us the Whitewater UI wasn't lying to us. Of course, sending that initial 350 Gbytes of data took a while over the lab's 50-Mbps down/10-Mbps up cable modem connection, but the Whitewater kept the link at 90%-plus saturation over the time it took to upload. We would advise Whitewater users to set priorities in their routers to allow the Whitewater to send data without bogging down other applications, though that wasn't a problem on our asymmetrical link.
When we switched to performance testing, we had difficulty getting our test system to ingest data at its full rated speed of 1 Tbyte per hour, though we did manage to back up data at more than 720 Gbytes per hour. We had this problem at least in part because we had to manually allocate backup streams across the Whitewater's four Ethernet ports. Since we completed our testing, Riverbed has updated the Whitewater software to support NIC teaming, which should simplify the process and make it easier to cram data into the higher-end Whitewaters quickly.
All in all, we're quite pleased with the Whitewater but would like to see Riverbed enhance the reporting functions. You can see how much data is flowing into and out of the appliance, but not each Ethernet port. You can see how much data is waiting to be replicated, but not how long that replication is going to take. Finally, we'd like to see some reporting on how data is distributed between the cache and the cloud storage back end. Knowing that restores from any backup made in the last 10 days will come completely from cache would be reassuring.