As we continue our tour through the deduplication maze, one of the
battle cries of data deduplication suppliers is how well their product
scales. Scalability, however, is in the eye of the beholder. In the
deduplication system space, this results in the classic battle of the
single box storage system vs. a grid or clustered storage system. In
the software space it raises questions.
The value of a single box system is that it is simple. You plug it in
one time and it works. Problem solved. This compares to the potential
challenge of a multi-node storage cluster where you have multiple parts
to put together. You have to make a decision if the single box, like
those offered by Data Domain and Nexsan, will be fast enough
to meet your backup performance and scalability needs over time. Clustered
systems, like those offered by Exagrid and Sepaton, require a bigger
upfront footprint but have the potential to scale both performance and
capacity over the long term. Single unit systems have kept pace with user demands by riding the
technology waves of faster processors and higher capacity hard drives.
From a raw storage I/O standpoint, most of these systems can keep pace
with the multi-node cluster offerings, especially for the typical data
center. They may also be more power efficient than multi-node clusters
and more readily be able to implement MAID like functions.
Scaling capacity requires the addition of a second system that has to
be managed separately. You as the customer have to decide where the
breaking point is for your environment. Managing two data deduplication
systems is not a challenge for most, but managing ten might be a problem.
In the future, I expect single system deduplication systems to manage multiple systems in the background, presenting a
virtual IP address to the backup server. This essentially creates a
loosely coupled cluster.
Deduplication systems typically reach capacity because of a desire to
keep backup data for a long time, potentially eliminating tape. Another
option is to use a Recovery Service Provider like Simply Continuous, or even a straight archive system like those offered by
Permabit, Caringo, Nexsan and others. By shifting the longer term
retention of data to a dedicated archive or a provider, the local box
does not need to scale. Management of different retention times on multiple
boxes is available from several of these vendors and those that support
Symantec's OST have even greater flexibility and control.
If managing multiple single box deduplication
systems and outsourcing the storage of the older backups is a concern
for you, this is where clustered or grid systems come into play.
Something we will delve into in our next entry "Scaling Backup
Deduplication with Clustered Storage".