What is erasure coding?
Erasure coding is a parity-based data protection scheme similar to RAID 5 and 6. But erasure coding operates at a lower level of granularity. In RAID 5 and 6, the lowest common denominator is the volume, where with erasure coding, it is the object. This means if there is a drive failure or node failure, only the objects on that drive or node need to be recreated, not the entire volume.
Similar to replication, erasure coding can be set either manually or by policy to survive a certain number of node failures before there is data loss. Many systems extend erasure coding between data centers, so that the data can be automatically distributed between data centers and nodes within those data centers.
Since it is parity-based, erasure coding does not create multiple, redundant copies of data the way replication does. This means the cost of additional capacity "overhead" for erasure coding is measured in fractions of the primary data set instead of multiples. An erasure-coded methodology designed to provide protection from the same number of failures as a 3x replication method requires an approximately 25% overhead instead of 300%.
The downside to erasure coding is that it's not as lightweight as replication. It typically requires more CPU and RAM resources to manage and to calculate the parity. More importantly, every access requires data to be reconstituted (since erasure-coded data is parsed and stored in changed block increments across nodes). This process can bog down considerably across the storage network compared with replication, which again could be designed to require little or no storage network traffic. The requirement for additional network traffic could be particularly troublesome in a WAN or cloud implementation since the WAN will create latency on every access.
In an attempt to deliver the best of both worlds, some vendors are creating blended models. The first form of this is one where replication is the method used within the data center, so that most accesses from storage have the benefit of LAN-like performance. Then, erasure coding is used for data distribution to the other data centers in the organization. While capacity consumption is still high, data integrity is equally high.
The other blended model is based solely on erasure coding, but the erasure coding is zoned by data center. In this model, erasure coding is used locally and across the WAN, but one copy of all the data remains in the data center that needs it most. Then data is erasure-coded remotely across the other data centers in the customer's ecosystem. While this method consumes more capacity than regular erasure coding, it is still more efficient than the other blended model.
What is the best data protection method?
As is always, the answer is, "It depends." There is a lot to like about the simplicity of replication. It works well for data centers with less than 25 TB. But as data grows, continuing with a replication strategy for data protection becomes untenable.
For larger data centers, replication becomes too expensive from a capacity consumption perspective. If they have high bandwidth or short interconnection distances, erasure coding provides excellent storage efficiency and ideal data distribution. Data centers that have latency issues should consider one of the blended models; most likely, the second model which provides almost as good data efficiency and eliminates network latency issues for day-to-day data accesses.