The future of storage is Scale Out

100 Flares Twitter 0 Facebook 0 Google+ 1 LinkedIn 99 Email -- 100 Flares ×

The IT industry has always been dominated by trends. Each hystoric period has seen the raise and success of groundbreaking technologies, born with the goal to solve the limits of their ancestors, by introducing new ideas, better designed for the always evolving environments.

Storage is no different. If since the ’80s the rulers have been the RAID-based monolithic arrays connected to servers via a SAN, lately new solutions are raising, with different ideas but also some common features.

The limits of traditional storage

Today’s market is dominated by traditional storage systems: monolithic, shared among several servers thanks to a SAN (storage area network). In small and medium environments it’s not possible to reach their inner limits, but because of the incredible growth of managed data (Big Data, Cloud Computing, HPC…) in several situations they are starting to become unsuitable. There are two main problems:
– computing capacity: when data grow, a traditional storage can only be updated by adding new disks, and only in few models it can also add new controllers. The capacity growth has a side effect in a reduction in performances, since a single controller has a well-defined maximum computing capacity, and with it it has to manage a larger number of data
– RAID is becoming an inefficient redundancy technology. First of all because redundancy is fullfilled by sacrificing disk space, and this is becoming more and more clear with new sizes we are seeing today. If you think about a RAID10 using 1 TB disks, a stripe of 20 TB is wasting 10 of them, but a storage of 1 PB is wasting 500 TB only for redundancy. If you choose instead RAID5 or RAID6, you reduce the disk waste, but you are also lowering the performances: write penalty is higher on these kind of RAID. Also, each time a new and bigger disk model is announced, the time needed to rebuild data in a disk increases, and during this time RAID is running in a degraded state, with an additional and important impact on performances.

Scale-out is better

A scale-out storage offers one huge advantage: elasticity. When we all hear about scale-out (at least it happens always to me), we quickly feel their concepts are “right”. And the reason is simple: the whole world around us works this way, using small building blocks, combining them together, more than using huge single monoliths.

Some examples? Houses are made with bricks, and not carved out from a single stone. LEGO bricks, even if are really small, can be connected to create some impressive artworks. Railways are made with many iron bars, each of them is only few meters long but as a whole they form a line that can be hundreds of kilometers long.

But the most fascinating example we have in front of us everyday, comes as usual from nature: basically any living creature is made with millions or billions of microscopic cells, united in a complex living organism. This is scale-out!!

“Classic” scale-out storage

Isilon 2

Scale-out storage is not a new idea of 2013. There are already several solutions around (HP LeftHand and EMC Isilon are the most known) since many years, made with small units, combined together to create a large storage system. Quoting a term introduced by themselves, they moved redundancy from RAID to RAIN (redundant array of indipendent nodes). With those systems, when you add a new node, you increase not only the available capacity, but also performances, because each node brings its own computing power into the cluster. Also, if one of the nodes crashes, if does not breaks the whole cluster, since the survived nodes still have a copy of all the data and they can serve them to clients. The idea is really interesting, and in fact many customers already purchased these kind of stage systems.

However, even if the performance problem listed before was solved, the inefficiency of RAID is still there. Since those systems has been designed when SSD and Flash memories were not available yet, they all based their redundancy on RAID, both locally inside a node and when distributing data across the cluster. A Network RAID 5 is still a RAID5, made by using 3 or more nodes of a scale-out cluster, and by loosing a percentage of the overall capacity. If you sum this “overhead” to the one coming from the local RAID in every node, when you go above certain size, the amount of wasted space becomes an economic problem for customers.

Also, RAID systems needs to rebuild completely all the data inside a disk when it breaks and is swapped with a new one. With the latest disks, as large as 4 TB, rebuilding times are extremely long, up to 8-10 hours. And during the whole rebuilding period, RAID is running in degraded mode, so its performances are lower when compared with a healthy RAID.

At the end, internal replica inside a single node, and even more replica between nodes, lead to an upper limit to the growth of those clusters, since at a certain point the load on the cluster to replicate data is unbearable, and the only solution is to limit the cluster to a certain amount of nodes.

The new scale-out solutions

The latest scale-out storage systems treasured the experience of all the previous generations and their limits, and they use now different approaches to overcome those limits. It’s interesting to see so many startups with different solutions that at the same time have some common ideas; this is probably an evidence that those ideas are the right ones to be used. Let’s see them.

Flash memory

There is absolutely no doubt Flash memory, regardless the form factor in use, is the real driver of this new wave of innovation. Their performances make these new storage systems faster and faster, and at the same time allow for many different data processing activities, most of them in real time, that were simply not possible with spinning disks. We see now systems using tiering between flash and disks, with real time deduplication and compression, up to even more refined solution offering Guaranteed QoS to workloads. As Stephen Foskett summarized beautifully in this recent post “I can’t see anyone ever designing another all-disk storage array for general-purpose use.”

Commodity Hardware

Google Servers

Trust me, you will hear this term always more frequently. I don’t really care about the ongoing discussions about “software defined” and I don’t even want to start trying to classify those new storage systems in one of those categories, the real difference from the past is the extended use of hardware components that are not specifically designed for the storage system. This idea has been introduced many years ago by Google (in the pic on the left you can see one of their “servers”), and the new tendency is to develop everything on software, and then take advantage of common hardware to run it. When you put together fast CPUs, that can address huge amounts of RAM, plus the flash memory, the final result is a high performance storage. Apart from this, I see no big difference if the final solution is sold as a software only, and the customer is free to choose the hardware to run it, or it’s distributed as a hardware appliance, and the components has been choosed by the vendor.

This approach makes development cycles really faster: there is no need to swap and upgrade any hardware component in the storage but with a simple software upgrade a customer can gain new performances, new features, and so on. And this, at the end, lowers the TCO of the solution.

The new SAN is Ethernet

Even here, new solutions take advantage of the evolution in other components of the datacenter. Before, replication was limited also because of 1G networks, but now 10G connections have basically solved these problems, and the available bandwidth is enough to move around huge amounts of data between nodes, without being a bottleneck that “wastes” the performances of Flash memories. Also, a ethernet network is cheaper than a FC Fabric, and most of all it can be scaled linearly and quickly; it can be scale-out too as the storage that is connected to it.

Object Storage

Probably, together with Flash memory, this is the biggest component of these new storage systems.
Have you even notice the complete distance between logical elements of a storage and of a hypervisor? In one side you talk about disks, raid pools, RAID levels, LUNs. On the other, we have small files (for configuration) and big ones (VM’s virtual disks). They talk two different languages, and in order to make them talk to each other we need some kind of common abstraction layer. And abstraction brings complexity and latency. Wouldn’t be better to have a storage talking the language of the hypervisor, or at least something near? and most of all, getting rid of all the useless complexity of a traditional storage? that’s where an object storage comes in.
Each saved data in the new storage is an object, being it the whole file when is small, or a chunk of the big file (think about a virtual disk chopped into several small pieces). These objects are directly saved into the storage native file system, with nothing like LUNs in between. If the hypervisor still sees a LUN or a NFS share, it’s only because the storage also acts as a “gateway” that translates its native format into something understandable by the hypervisor. Even more, there are some integrated solutions (for example the device driver to connect Ceph to Linux) where the operating system sees the object storage directly without any abstraction. I’m sure in the near future someone will start to write these device drivers also for other platforms like VMware ESXi. Once the object is saved into the storage, the system can do many things with it. Snapshots are simple pointers to different versions of the object, multiple copies of the objects offers redundancy like a RAID, a remote copy is enough to create a Disaster Recovery solution. And all these operations happens at the fine-grain level of the single object, not at the whole LUN.

Ease of use

A direct consequence of the above, all these system are really easy to use. Since there are no more the usual elements of a classic storage system, there is little knowledge required to deploy and manage these storage systems. You do not need to know the best number of disks to be placed in a stripe, which block size you need to choose to format a LUN, or other informations like these for initiates. You need 5 TB for your virtual machines? Just two clicks in the console and here it is the new space, ready to be used by the hypervisor. All the technical stuff like redundancy, performances, tiering, are in charge of the storage and you do not need to bother about them.

Sounds like a dream? Well, it’s simply how these systems work. You only need to try one of them, and you will ask yourself how one of the “old” storage systems are still acceptable.

 

Limits?

So, there is no limit or problem? Yes there are, but some of them are only a consequence of the young age of these systems, so they will probably go away in the future.

Obviously the first limit, a pure psychological one, is none of these systems is sold by a big vendor, but they all come from startups. But if you feel that buying one of these systems is a jump in the void, remember on the other side some killing politics adopted by those “trusted huge vendors”, they often suddenly remove from their catalogs mainstream products in order to force customer to forklift upgrades, and spend other money. For sure right now there are so many startups, many of them will disappear in the future and only few of them will survive, by becoming big enough or by being acquired. If the solution they offer is so much better than their competitors, and you are able to appreciate the value of the solution, then your investment makes total sense.

Second: at first sight, all those solutions seems to be tailored for big companies or service provider. They talk about petabytes like peanuts, and if you are a SMB company you could argue this solutions are not for you. For some of them this is probably true, but the beauty of a scale-out storage is also you can start with few nodes, and never add others. Also, some startups have solutions mainly designed for SMB.

Third: a “pure” object storage is redundant and reliable because it creates multiple copies of every object, also known as Replication Factor. 0 means there is only one copy, 1 is like a RAID1 (is the most common), and so on. For sure a Replication Factor 1 wastes 50% of the available space. But again, taking advantage of modern CPUs and Flash memories, these storage systems are able to deduplicate and compress data in real time, without any performance impact. So, the overhead at the end is less than RAID, or at least the same. Even more, in situations where you don’t need high performances, some systems are using Erasure Code, that helps to further reduce data and guarantee at the same time higher resiliency. Probably, as time goes by and performances will increase, those algorithms would be applied also to primary storage systems.

 

Final notes

The “new generation” scale-out storage systems are the future, in my opinion as in those of many industry experts.

All the features I explained make them fast, scalable, and super easy to use. Every old storage you will have to use after trying out one of the new ones, will seems obsolete to you. Why? Because they ARE obsolete!

100 Flares Twitter 0 Facebook 0 Google+ 1 LinkedIn 99 Email -- 100 Flares ×
  • Hi Luca,
    good post pointing out some interesting thoughts.
    I think I can relate to most. I would like to add some remarks..

    Your RAID rebuild times seem very positive, as I’ve seen RAID rebuild times reach 48+ hours on 2TB SATA drives, on a busy array.

    On the point about defined certain sizes for capacity for a new VM for example; I would love to see the operating systems requirements to know the LUN boundaries go away. If an OS does not need to know the geometry of a LUN, but just know it has a certain amount of capacity (seen maybe as extents or chunks) you would no longer need to carve out something that resembles a LUN anymore. The OS would claim more chunks when needed from the object storage providers it knows about.

    Objects would no longer need to be abstracted or translated into SCSI commands, making a lot of things simpler as well. I guess this is too distant in the future at the moment. But the principles of SCSI and LUNs are what is binding us to the old traditional ways of storing information. Even object storage is somewhere translated to and from SCSI commands when you look at the whole stack.

    But again, good post.

    • Luca Dell’Oca

      Hi Ilja, thanks for your remarks.

      Yep, I should have been more detailed: the 9-10 hours to rebuild a 4 TB disk is counted when there is no other I/O activity on the disk, so it can be a pure sequential write on it. Surely with other I/O coming it, those times are relly going to increase.

      I totally agree on the “native protocol” between server and storage, and the removal of SCSI commands for systems that actually do not speak SCSI would probably be the next step. Since we are talking about object storage, what about having for example VMware ESXi writing its virtual disks already chopped in chunks, and using only CRUD commands? no abstraction layers, no latency, no translations. I think it’s going to be a long journey, but maybe we will eventually arrive there. If you think about VSAN, they are already using a different storage format on the background, and sounds like an object storage to me.

      Luca.

  • Tom Sightler

    Hi Luca,

    Excellent post! I agree with pretty much everything written in this article. I’m convinced that we are seeing only the very early stages of what object storage can become, eventually breaking us from the binds of legacy block storage protocols.