RAID has slowly been choking on the growing capacity of disk drives. The problem is that as capacities increase, the time to rebuild a failed drive onto a spare also goes up linearly. Today’s drives have reached 10 TB in capacity, and rebuilds can take days. Statistics show that the probability of a second failure during rebuild is now large enough to generate concerns about guaranteeing data integrity. The result is that we have had to find new solutions, one of which is erasure coding.
The storage industry first attempted to address the problem of rebuild time by adding a second parity drive to the RAID set. The extra parity calculation slowed down write and read operations dramatically, while not improving data integrity enough. So then, rather than add yet another parity drive, the industry looked to radical alternatives. First was a concept pushed in object storage systems, where the data is replicated multiple times. This is like RAID 1 mirroring, but done at an appliance level, so there is protection against both drive and appliance failures.
One benefit of this replication approach is that it adds geo-diversity to the replica set. One or more copies can be at a remote site, so that a natural disaster won’t cause data loss. In combination with a cloud approach to servers, this also means that loss of a cloud data center is covered, since the remote copy can be hooked up to replacement compute instances fairly quickly.
However, replication uses a lot of storage space; the minimum for full protection is three copies. The continuing search for alternatives led to a technique called erasure coding (EC). This is like RAID, in that data is diced up, striped across drives, and has redundant information added to allow rebuilding. Unlike RAID, a typical configuration will handle many drive failures. Often it uses a 10+6 mapping, which is shorthand for 10 data elements plus 6 erasure coding elements. These are written across a 16 drive set, and the result allows a rebuild from any 10 of the drives. In other words, up to 6 drives can fail or go offline before data is jeopardized.
With erasure coding, the 16 drives are spread over multiple appliances. The best protection in the common 10+6 configuration requires at least four appliances. The loss of one of these knocks out four drives in the set, so the data is still available and protected.
EC takes considerably more time to create the codes and write to disk. This makes it unsuitable for the very fast world of flash/SSD, where replication makes more sense. The lower raw capacity -- at 1.6x compared with replications of 2x or 3x -- and the better resistance to drive failures makes the technology ideal for archive storage with its non-time-critical writes and expectation that data will not change for years after being written.
Erasure coding first entered the market from the academic community around three years ago as the storage mechanism employed by Cleversafe to store archival data. Interest in the approach and confidence in its capability has grown since then and a number of vendors now support the technology.
Both the leading object storage software stacks support erasure coding. Ceph provides erasure coding support and Caringo provides support for both EC and replication.. GlusterFS has followed suit and Intel/Cloudera are adding EC support to HDFS, covering Hadoop.
Storage box vendors are also picking up EC technology. NetApp, NEC, Fujitsu, EMC-Isilon, Dell and HP are all onboard, and Azure and HP Helion cloud services support it. The bottom line is that erasure coding has arrived in the mainstream and will be a key feature of HDD-tier strategy going forward.