In the last months I talked with or looked at several storage vendors, and I saw a new topic becoming more and more important: QoS (Quality of Service). The list of vendors offering this feature (with differences in their own technologies) is becoming quite large: CloudByte, GridStore, Coho Data, SolidFire, HP 3Par, NetApp. And for sure I’m forgetting someone else.
As you can see in my short list, there are both startups that have this feature from their first release, and also big names who added QoS to their existing products. There is a new trend coming, and QoS is for sure becoming a “hot” topic in storage.
You are always overprovisioning
With no doubt, Flash memory is the main reason of the new “renaissance” in the storage industry over the last years. Thanks to its incredible speed, so much more than mechanical disks, together with a really low latency, Flash is literally shaking up storage systems.
But, at the same time, the problem of overprovisioning is still here. In the past, with storage systems built with only mechanical disks, in order to guarantee the needed performances, storage designers often ended up in adding more disks then the needed capacity, only because they were needed to reach the desired level of performances. Adding capacity to increase performances was indeed stupid, but there was no real alternative to this kind of design.
Flash entered the market with the promise of removing this way of designing storage: IOPS per gigabyte skyrocketed, and now you are able to design a storage system that has the exact needed size, and nothing more. But, again, you are overprovisioning. In the past you overprovisioned capacity, now you are overprovisioning IOPS. The reason is simple: Flash memory is so fast, and if you design your storage based on the expected peak value for IOPS, you are going to reach that peak only for few moments, while the rest of the time your storage is running at a fraction of what is capable of. In the past you paid for unused disk space, now you are paying for unused IO.
Here comes QoS
QoS is nothing new in IT: network have always had to deal with limited bandwidth, especially on internet connectivity, and network administrators are using QoS to (try to) fairly distribute a limited amount of bandwidth to all the workloads using it.
But what is QoS in storage? Basically, as in networking, QoS is a policy you can apply to a storage workload to influence its performances.There are several parameters you can modify, and for each of them you can set a lower limit (minimum guaranteed value), an upper limit, and if you want also a burst value (how much the workload can go above the upper limit, and how long). Those parameters are IOPS, bandwidth, latency.
In many storage systems with QoS capabilities I saw, in my opinion those policies are not so advanced, or at least I can say they are not complete. The most common rule you can find is the maximum amount of IOPS a given workload (usually a LUN, or a VMDK in some NFS-based systems), and nothing more. For sure, it’s better than nothing, but is it enough? I don’t think so.
If for example you are a Service Provider, you are surely happy to have some QoS in place, and thus be able to limit a customer’s workload and prevent it from impacting other customers; that’s what is usually known as the “noisy neighbors” problem. By applying a policy to a workload that says for example “you can only use at most 4000 IOPS” in a storage array capable of doing 20.000 IOPS, you can be assured you will always have 16.000 IOPS for other customers.
Granular QoS rules
But, again, is this kind of policy enough? As you have probably guessed by now, no. I’m using again an example for Service Providers, because is the sector I’m more aware of, and at the same time one that can better take advantage of what I call “Granular QoS rules”.
First of all, Service Providers need at a minimum also lower limits. The reason is simple: lower limits are billable! If an SP doesn’t have any QoS capability, the only value he can sell is the storage space. And in fact, this is what you usually see in public price lists. But, if he is hosting a customer that needs at least 1000 IOPS for his virtual machines, and he is able to guarantee that value, here comes a new selling point.
I’m not talking about the old and boring concept of different storage tiers, with the abused example of “bronze, silver, gold” levels of service. Those are made usually with 3 different storage systems, and there is no flexibility at all in this design: if a customer needs to move his workloads from a level to another, almost surely there is going to be a Storage vMotion to move virtual machines from a datastore to another one. And Storage vMotion, although is an awesome feature, is time and IO consuming.
But, if the SP can have a single storage system that can be “tuned” on the fly by simply applying different rules, it can really maximize his investment in storage, offer different “storage profiles” to customers, and ultimately bill them accordingly. Instead of “bronze, silver, gold”, for example he can create 3 different profiles like:
Tier 1: IOPS = 1.000 min, 20.000 max, Bandwidth = 300 MBs max, Latency = 5 ms
Tier 2: IOPS = 200 min, 5.000 maximum, Bandwidth = 100 MBs max, Latency = 15 ms
Tier 3: IOPS = No min, 1.000 maximum, Bandwidth = 50 MBs max, Latency = 30 ms
The underlying storage obviously need to be able to guarantee performances for “number of customer” X “minimum guaranteed values”. But since its a single storage system, customers can be dinamically migrated from one tier to the other without any storage migration, by simply applying a different policy.
At the end, we can say that storage QoS means more revenues for a Service Provider!
What can we expect in the future
I’ve seen some new storage solutions that are near to this level of granularity, for example Solidfire. But more can be done.
VM-Centric: First of all, storage systems should overcome the biggest dichotomy in virtualization storage: storage systems talks about volumes, the hypervizor talks about virtual machines. Only few storage systems are able to actually “see” a single virtual machine, so every rule we can apply to storage is not accurate. Let me explain: if a customer has 5 virtual machines, but he really cares about one of them and wants the Tier2 level I described before, in reality his policy (applied at the volume/datastore level) needs to be higher, because some IOPS/bandwidth/latency needs to be used also for the other 4 virtual machines. The next step in the QoS evolution needs to address this problem, by using the single virtual machine as the unit of measurement. Modern storage systems using NFS can already see the single VMs, while for block storage we should probably wait for VMware vVols. In the article I linked above about Solidfire, there is at the end a nice demo of their QoS solution applied to vVols, and it seems like a really promising combination.
Multiple policies on the same workload: if there are different systems accessing the same volume (or virtual volume, or whatever it is the allocation unit…), maybe not all of them needs to have the same policy. Here is a quick example, probably the best one: there is a Tier3 policy applied to a customer, so his virtual machines are limited to an overall value of 1000 IOPS. But, the service provider also has internal management tasks, like backups. If the policy is applied to that customer, regardless what is the system accessing the storage, also backups would run at a maximum value of 1000 IOPS, even if the storage is capable of let’s say 500.000 IOPS. But, if I can apply two different policies to the same workload, for example I can use an even more granular policy scheme like this:
Tier 3: IOPS = No min, 1.000 maximum, Bandwidth = 50 MBs max, Latency = 30 ms, connecting hosts = VMware ESXi
Backup: IOPS = No min, 100.000 maximum, Bandwidth = 500 MBs max, Latency = no min, connecting hosts = Backup Servers
In a shared storage system, to identify different incoming hosts is easy: WWN for FC, iqn for iscsi, IP address for NFS. With this additional level of granularity, the customer’s virtual machines are still limited by the policy he paid for, but backups executed by the Service Provider can run at full speed.
We are at the sunrise of Storage QoS. I can’t wait to see what awesome improvements will arrive in the next months and years.