I’ve been invited as a delegate at Storage Field Day 4 in San Josè, California. The first vendor we met was CloudByte.
Who is CloudByte
CloudByte is a company partly indian and partly american, with offices both in Bangalore and Cupertino. Their main product is ElastiStor: it’s a software only solution able to transform a server with some local storage (or connected to a remote storage) in a storage array. Several ElastiStor machines can be connected together to form a distributed, scale-out storage.
The concept of a scale-out storage created with commodity servers, all connected by a software solution, is not new, but CloudByte has added some solutions specifically designed for Service Providers. Most of all, some QoS (Quality of Service) features to limit the different workloads you can load and execute on this storage.
“Unpredictable performance” is the terror of every Service Provider: while in a enterprise environment an admin can (more or less) control the workloads that are activated on the storage, a Service Provider has no idea what his customers are going to execute on their virtual machines. So, being able to manage a share storage and to guarantee to each customer the performances they need, is a challenging task. Usually, the easiest solution is over-provisioning: if you fear you could have performance problems in the future, you simply design a storage bigger than what’s needed, so you can survive I/O spikes that would create problems to the storage itself. But this is obviously highly inefficient from an economical standpoint for a Service Provider.
Death to “noisy neighbours”!
“Noisy neighbours” is a well known issue among storage admins in virtualized environments. A single workload requiring a high IOPS value can impair all other workloads hosted on the same storage. And I’ve seen this in my daily job, it could also be a simple IOmeter fired up by a customer willing to know how fast the storage he’s hosted on can run…
CloudByte has base its architecture on TSM, an acronym reminding me IBM, but here meaning Tenant Storage Machine. It’s a logical container where an admin can configure all the parameters belonging to a customer, being them capacity (just like any other storage vendor), but most of all performances.
TSM starts by using a slice of the underlying storage, made with all the disks and flash memories cominf from all the ElastiStor machine in the cluster or pool. Storage is managed with a ZFS filesystem, and ElastiStor is based on FreeBSD operating system. Each TSM can manage different volumes, and all of them are exposed to a single tenant via one of the available protocols: NFS, iSCSI, SMB o FC. Each TSM can be dinamically reallocated in a different area of the storage, and you can also completely move a TSM between different HA-groups (this reminds me a Storage vMotion in VMware…).
Finally, you apply the QoS rules: it’s basically a limits on the maximum IOPS per volume. While running the system measures in real time all those values about throughput, IOPS, and latency. If one of these values exceeds the configured maximums, performances are restricted with a proper filter. You can also configure a desired latency, but it’s not enforced, only measured. If in the future you change some of those values, it can happen a TSM is “migrated” to a different pool with more available performances. This activity has fo sure a cost on IOPS, so you could find yourself in a strange situation where for a little moment you will have worse performances because the TSM is under heavy migration activity.
I’ve been really interested in a deep dive on QoS: right now, it has only an upper limit on IOPS value. This is for sure a diversifying feature, but I do think (and I told it to CloudByte guys) they could for sure do more.
Think about a Service Provider: an upper limit is really useful to manage the simultaneous access to storage, but is not so much a “billable” feature to customers. They are much more interested in a guaranteed minimum value, that could not be violated whatever is the load on the storage. So, un upper limit is ok, but I wish for additional features like guaranteed minimum, and also workload priority.
I also have another idea: since one of the best use case for CloudByte are Service Providers, what about delegated management of a TSM? For example, speaking about IOPS, a customer would buy 5000 IOPS for his TSM; the Provider would only have to configure it this way. Then, the customer itself would be able to create different volumes inside his TSM, and assign different QoS values to each of them. The sytem would finally guarantee the sum of all the QoS of all volumes would not exceed the overall value of the TSM. Same for minimum IOPS and priority.
Another isse is how IOPS are measured. There is no test on the storage, but a predicted value is configured at the beginning by looking at the hardware specs of the underlying hardware. An admin can then change this value. The approach in defining those values is conservative, but again I would like to see a proper performance test, to be excuted in the background, or maybe as a “burn-in” test before adding the new node to a cluster. This is one of the limits of software only solutions, that is not being able to control the underlying hardware that a customer chooses and uses.
There are some good ideas in the CloudByte solution, but the most differentiating ones need to be further improved.
For example, ElastiStor is supposed to be a scale-out storage, but at the moment it can only be enlarged up to 4 nodes, so to me even the idea of being scale-out is somewhat stretched: is ok because there is no shared cache between nodes and each of them is truly indipendent, but there are some “classic” arrays that can scale well above 4 nodes.
Also, I would like to see minimum guaranteed values in the QoS, and not only upper limits, and the possibility to manage priorities between workloads.
We will see in the next releases if these features will surface.
One thought on “Storage Field Day 4: CloudByte”
Thanks Luca for joining us in storage field day and thanks for the detailed article. Quick clarification, QOS control on Elastistor is based on resource allocation in terms of cache, disk bandwidth, CPU and network bandwidth. By allocating appropriate resources and controlling those resource consumptions, the defined IOPS is guaranteed for each volume/TSM, it is both minimum and maximum with +/- few % points variations. We also have an option called ‘grace’, if it is enabled for a given volume, his upper bound is not controlled as long as it is not impacting the other volumes, here the configured IOPS number turns out to be minimum.
In terms of scale-out, our cluster is at Elasticenter level, TSMs within elasticenter can move transparently between elastistor nodes in the cluster. Each elasticenter can scale more than 100 nodes and they can also be geographically distributed. The 4 nodes you referred is just high availability groups within the large elasticenter cluster.
Comments are closed.