I’ve always been a fan of scale-out storage solutions, and I’ve always preached about them, for example in a post I wrote almost a year ago The future of storage is Scale Out.
As data is skyrocketing, the best viable way to cope with this growth is having a system that can be scaled accordingly without the pain of data migrations involving TBs of data. One of the limits of scale-out systems however has always been the data protection techniques applied to them. The first iterations based their data protection on what was available at the time, RAID. But RAID applied to scale-out is largely inefficient, in both the ways it can be applied. You can choose to have local raid on each node or at the grid level, using a single node in the same way a disk is used in a local raid.
With local raid protection, you “waste” disk space on each node of the grid (count how many disks you are wasting in say a 20 nodes cluster) and you suffer the performance penalty created by data parity calculations both in a single node and in the network raid applied to the grid. Or you can choose to have no protection in the single node and only rely on the protection at the grid level, but then your failure domain becomes the entire node. As you loose one node, the whole amount of data of a node needs to be replicated again somewhere else in the cluster. When you loose a 10TB node, the whole 10TB needs to be re-striped with RAID algorythms. And it takes ages to be done.
Many scale-out solutions have chosen to rely on simple replication techniques. With no raid involved, each data or chunk of data is replicated multiple times in the grid. When a copy of the data is lost because a single disk has failed or the entire node has failed, the replication factor (the number of copies) is re-established by creating another copy of the data somewhere else. This is more efficient from a performance perspective than RAID, but more expensive since for a given amount of data, you need to buy and operate X times the amount of space (where X is the replication factor).
The TCO (total cost of ownership) of such solutions is becoming unaffordable, and so new techniques are raising.
One of the most used especially in Object Storage solutions is Erasure Coding. I’m not going to explain what Erasure Coding is, others have done it better than I can ever do, so if you want to learn more, before going on with this article use these sources I used too. A great starting point is an excellent paper from James S. Plank. Originally published on usenix, it’s freely available (at the time of this post) in PDF here; it’s maybe a little too deep if you lack some mathematics background, but nonetheless worth a read. Another great source to understand EC in a simpler way is this article, where Karan Singh from Ceph (now part of RedHat) explains it with some easy to understand examples. Even if you are not a Ceph users, you can skip the sections related to EC configurations in Ceph, and consume the intro as s good explanation of Erasure Coding.
As you can understand after looking at these sources, Erasure Coding can really be the perfect match for scale-out solutions, more than RAID for sure. Network RAID consumes too much space both in a local node and in the network RAID itself, making it at some point too expensive once the grid is scaled beyond a few dozen of nodes, and the parity calculations create performance problems. Erasure Code can also be better than pure replication, because given 1 as the data to be saved, instead of consuming 2 or 3 times the amount of space for replica copies (plus the cost of power, cooling and rack space for the additional storage nodes…) it can reach the same level of reliability with only 1.4 or 1.6 overhead, depending on the configured algorythm.
So, is Erasure Coding ready for prime time? Sadly not. As long as we talk about typical object storage use cases, it works very well, in fact many object storage solutions are using some sort of erasure coding, even if called with proprietary names (and proprietary algorythms). But managing large amounts of small files is not the same as managing the building blocks of virtualized environments where applications run: the virtual disks of the virtual machines. In the eye of an object storage, those are huge files (GB or TB) where any write operation is mainly random, and it involves a partial update of only a small area of the entire file. Not really the same situation of updating a word document…
Erasure coding has a trade-off in nature: to rebuild the original file and present it back to the application (the hypervisor in our case) a large amount of CPU power is needed, in order to read the chunks and the protection (or parity) chunks, and rebuild the original file. Easy to be done on small files, not so much with virtual disks.
Nonetheless, EC is still the best way to have effective data protection in a massive amount of nodes used in scale-out architectures. There are two ways to solve the CPU problems: one for sure is brute-force. As CPU compute capabilities are improved, the EC algorythms can be executed faster, up to the point where their latency can be small enough to use them in virtualized enviroment. And I’m expecting in the future more integration with the CPU itself: it has happened in the past for multimedia and encryption (think SSE and AES libraries), it can happen also for erasure coding. Having accelerated libraries directly in the CPU would improve performances by orders of magnitude for sure.
The other way is to further optimize the algorythms, and this is constantly happening. Robin Harris has posted (as usual) an interesting article about some new researches about this: Optimizing erasure-coded storage for latency and cost. Really interesting.
Final notes
There is no doubt Erasure Coding is one of the “new cool guy in town” in data protection techniques. Even if it’s nothing new, in these times there is finally the needed computing power and disk speeds to make it become mainstream at least for general purpose storage solutions, and different storage vendors are starting to add it as their data protection solutions, more than RAID. The first step for EC to become mainstream has already been done in general purpose object storage, where the use case is a large amount of small files. I see another intermediate step before erasure-coded storage can be used in virtualized environments: secondary storage used for data protection. In virtualized enviroments, backup files are large in size (many TB in many cases) and with technologies like in-place incremental writes and instant recovery, it’s a great testbed to see how Erasure Coding can perform with large files and random I/O. I work for Veeam, and I’m alredy planning some extensive tests in this space. Stay tuned!