In my last blog post, I explained how advanced erasure codes can provide a higher level of reliability than more common RAID techniques. In this post, I’ll look at how vendors and open-source software projects are using erasure coding today, what we can look forward to in the future, and why erasure codes aren’t the data protection panacea some have made them out to be.
Erasure codes -- or, more specifically, erasure codes that provide data protection beyond double parity -- are today primarily used in scale-out object storage systems. These systems distribute the erasure encoded data blocks across multiple storage nodes to provide protection against, not just drive, but also node failures. Since object stores frequently hold hundreds of terabytes to petabytes of data, the 20 to 40% overhead of erasure coding allows operators to save racks of storage nodes when compared to the alternative, three- or four-way mirroring/replication.
Over the past year or two, most object stores, from commercial solutions like Data Direct Networks WOS and Caringo Swarm to open-source projects such as Ceph and Swift, have joined the pioneers of erasure coding, Cleversafe and Amplidata. Some object stores, like Ceph, limit the erasure coding to a single storage pool and rely on replication between storage pools, and therefore datacenters, to provide geographic protection.
The most sophisticated systems extend the erasure coding scheme to disperse encoded data chunks across multiple datacenters. A system using a 10-of-15 encoding scheme could store five chunks in each of five datacenters. This would allow the system to survive a datacenter failure with less than 40% storage overhead.
There’s no such thing as a free lunch, and storage system architects do have to pay a price for the high reliability and storage efficiency of erasure coding. The most obvious cost is in compute power. Calculating Reed-Solomon or turbo codes takes a lot more compute horsepower than simple parity, so systems using erasure coding need more CPU cores per PB than those using simple RAID. Luckily, the ceaseless increases in compute power predicted by Moore’s Law have made that compute power readily available.
But, unfortunately, regardless of how much CPU horsepower we throw at them, erasure codes will also have higher latency and require more back-end storage I/O operations than simpler data protection schemes like replication or parity RAID. Under normal conditions, a conventional RAID system can just read the data it needs, leaving its parity strips to be read only when it can’t read a data strip.
An erasure coded system, even one using erasure codes across local drives, would have to read at least the minimum number of data strips needed to recover a data block and then recalculate the original data from them. For data encoded in, say, a 10-of-16 scheme, that would mean that even the smallest read would require 10 I/O operations on the back-end storage and a delay for the calculations.
When writing data, the latency and I/O amplification erasure coding creates is even worse. Imagine a database writing random 8K blocks to a storage system that uses a 10-of-16 encoding scheme. To write 8K of data, the system would have to read 10 strips, each of which is at least 4K to match up to today’s drive’s sector size, recalculate the coded strips, and write 16 strips back out, turning one I/O request into 26 I/O operations on the back end.
On today’s scale-out object system, the access node that’s responding to a read request sends requests for all 16 stripes to the storage nodes holding them, and uses the first 10 responses to recalculate the data. Recalling all those data strips across the network is a significant amount of network traffic.
While today’s high-bandwidth, low-latency datacenter networks minimize this impact for local access, systems using dispersal codes to spread data strips across multiple datacenters will be performance limited by the bandwidth and latency of their WAN connections. Since these systems request all the data strips at once, and use the first to arrive, they consume network bandwidth, sending data strips that won’t be needed to retrieve the original data.
I’m hoping that one or more of the vendors making dispersal-based systems comes up with a system that will support a configuration where sufficient data strips to access the data are stored in a primary datacenter, and strips from remote datacenters are only recalled as needed. Such a system with a 10-of-20 coding scheme could keep 10 strips in the primary datacenter and five in each of two remote datacenters. The system could survive the loss of a datacenter with just 50% overhead
Regardless of how they’re implemented, high-reliability erasure codes are best suited to those applications that do large I/Os, just like the object stores where they’ve advanced from cutting-edge technology a few years ago to a standard feature today. Those looking to use them for more transactional applications will be disappointed at their performance.
Disclosure: Amplidata has been a client of DeepStorage LLC, and Mark Garos, CEO of Caringo, bought me a nice dinner the last time I saw him in New York.