10:15 AM -- Data de-duplication made its first real market inroads as a backup target. It provided an alternative to standard disk-to-disk backups that allowed you to retain data for a longer period of time. Backup is tailor-made for de-duplication because of the amount of data similarities in full backup jobs. But does de-duplication make sense in archiving?
As is always the case, where you will end up on this discussion depends on how you define archiving, how long you need to retain data, and what your motivation is to retain that data.
De-duplication devices in the backup market will claim 20X or more storage efficiency, but most leaders in this market are factoring a certain frequency of full backups being run. Typically, you may only achieve 4X to 6X efficiency between daily incremental jobs. On average, we tend to see about 12X to 16X storage efficiencies with a backup data de-duplication system. (In an upcoming entry we will go into detail on backup de-dupe rates.)
Archiving today has many use cases, but two of the more common motivations are getting older data off of primary storage to reduce costs or storing data to fulfill a legal or corporate governance requirement. In both cases, data is specifically placed on the device for a purpose. In both cases, these are often unique files and, as a result, the amount of commonality between the files is limited -- 2X to 4X storage efficiencies is a typical average.
There are exceptions where de-dupe efficiencies can be fairly high in archive storage. I know of several organizations that are creating an archive of their production databases every night so that they can view that data at any point in time. For example, one uses a database to track trading activity. They want the ability to backtrack any inconsistencies in trading or malicious activities within the database. While this database receives thousands of updates a day, as a percentage it does not change much on a day-to-day basis. The archive system that they're using can do sub-file level data de-duplication and, as a result, the de-duplication efficiency on that system is well over 30X.