Network Computing is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Deduplication As An API

In my last entry I
discussed the value of primary storage deduplication. In that entry I
stated that for the benefits of the technology to be realized the
storage vendors were going to have to get it implemented. This can be
done via a third party appliance of course but many vendors are trying
to figure out how to do this themselves. If they don't have the
technology fully baked at this point the development cycle may be too
long and the storage system supplier may be at risk of missing the boat.
To fill that need we are seeing a small handful of vendors try to
address this market with deduplication as an API.

Deduplication, even primary storage deduplication is not a brand new
feature, several operating systems and NAS vendors have had the
capability for a year or so but it is certainly newer than the backup
use case. It's clear though that users are interested in the capability
because of the value it can potentially bring to environments we
discussed in our last entry. Deduplication as an API allows vendors to
embed the technology into their existing storage source code. This not
only gives the vendor a shortcut to offering what will become a must
have capability but also, and maybe more importantly, more control over
how that data is stored.

This control over and knowledge of the deduplication process could prove
to be very valuable. Think of it the same way the Symantec's OST
support changed the way backup applications interacted with disk storage
devices. Once the backup application could have control over the device
the process became much smoother. In the same way once the storage
system has control over the deduplication process, better use of the
technology may be able to occur. For example the storage system could
process all data inline while there was no measurable impact on
performance then shift to post process if storage I/O begins to be
measurably impacted. In the same way they could possibly leverage the
API to provide smarter, more efficient SAN replication than before. Not
sending data that has already been sent from another site like some
backup deduplication products do today.

The question for the suppliers of these API's is what is the impact to system performance and what is the complexity of the API? In other words
can how long will it take to integrate the API set? The other issue is
going to be the data modification impact. While an API makes it easy to
turn something like deduplication on, will you be able to turn it off
and what are the effects of doing so? That is going to be a critical
issue.

I believe primary storage deduplication will be an expected feature on
primary storage within the next one to two years as snapshots are today.
If vendors can't get a primary deduplication product out within that
time frame they need to be looking at an API type of solution ASAP. You
don't want to be the only vendor bringing a knife to a gunfight.