My adventures with Ceph Storage. Part 7: Add a node and expand the cluster storage

0 Flares Twitter 0 Facebook 0 LinkedIn 0 Email -- 0 Flares ×

Also available in this series:
Part 1: Introduction
Part 2: Architecture for Dummies
Part 3: Design the nodes
Part 4: deploy the nodes in the Lab
Part 5: install Ceph in the lab
Part 6: Mount Ceph as a block device on linux machines
Part 8: Veeam clustered repository
Part 9: failover scenarios during Veeam backups
Part 10: Upgrade the cluster
At the end of Part 6, we finally mounted our Ceph cluster as a block device in a linux server and started to use it. I described how to create a RBD device and how this thin-provisioned volume can be expanded. There is however a moment where all the available space of the existing cluster is consumed, and the only way to further increase its size is to add an additional node. This is the moment when Ceph shows its scale-out capabilities. You will see in this part how you can quickly and most of all transparently add an additional node and rebalance the resources of the expanded cluster.

Prepare the new node

Obviously, we first of all need a new server to be added as an additional OSD node. You can follow Part 4 to understand how to properly create, install and configure an OSD node based on CentOS 7. In my lab, I’m going to add a new virtual machine with these parameters:

osd4.skunkworks.local, frontend network 10.2.50.204, replication network 10.2.0.204

CentOS 7.0, 2 vCPU, 2 GB RAM, and this disk layout:
disk0 (boot) = 20 GB
disk1 (journal) = 30 GB (running over SSD)
disk2 (data) = 100GB
disk3 (data) = 100GB
disk4 (data) = 100GB

Once CentOS 7 has been installed and configured as described in Part 4, we are ready to deploy Ceph on it. As before, from the admin console, you need to run:

As explained in previous Part 5, this command will update repositories if necessary (probably it will happen everytime on a clean machine, since the main repository will be epel…) and install ceph and all its dependencies. If during installation you encounter an error like this:

It’s because, at some point at the beginning of 2015, yum had some changes and is not respecting some priorities in custom repositories. The quick solution is to run, before ceph-deploy, this command on the osd4 machine:

Once the error is fixed, let the installation run; at the end of the process, you should see lines like these:

This will give you the confirmation Ceph is installed correctly on the new node. Let’s check first that ceph-deploy is able to see the disks of the new node:

The output is what we are expecting:

Time to create the new OSDs and their journals. As we did with previous nodes, here is the command (read again Part 5 to learn the details about these commands):

After the preparation, you should receive an output like this:

The new OSDs on server osd4 are ready to be used. The last operation to do is to add the administration keys to the node so it can be managed locally (otherwise you have to run every command from the admin node):

Add the new node to the cluster

Well, in reality, there is nothing more to do on the cluster, since the previous procedure has already added osd4 to the running cluster! Just as a remainder, this was the situation before the addition of the 4th OSD node (use the command “ceph status”):

But if you run the same command after the preparation of osd4, this is the new output:

As you can see, the 3 new OSDs were added to the pool, so that is has been increased from 9 to 12 OSDs, and the total available space is now 1200 GB up from 900; the difference is exactly the 300 GB available on osd4. Super easy!

Remove a node

A Ceph cluster can dinamically grow, but also shrink. Considerations about disk utilization should be taken before ANY operation involving the decommission of a node: if the used space is bigger than the surviving nodes, we will end up having the cluster in a degraded state since there will be not enough space to create the replicated copies of all objects, or even worse there will be not enough space on surviving OSDs to hold the actual volumes. So, be careful when dismissing a node.

Apart from these considerations, right now there is no command on ceph-deploy to decommission a node (ceph-deploy destroy is in the works as of April 2015 when I’m writing this article), but you can reach the same result with a combination of commands. first, identify the osd running on a given node:

Say we want to remove the just added node osd4. Its OSDs are .9 .10 and .11. For each of them, the commands to run directly on the osd node are:

The OSD are removed from the cluster, and CRUSH will immediately rebalance the surviving OSDs to guarantee replication rules are complaint. After this, you can remove the node itself from the cluster using ceph-deploy:

In the future, there will be a simple command to remove an OSD (still in the works):

Maintenance mode

Finally, a quick tip on how to properly manage a maintenance mode situation. Whenever a node is unavailable in a Ceph cluster in fact, the CRUSH algorytm starts to rebalance the objects among available nodes to guarantee consistency and availability. However, if you are planning to do maintenance activities in one of the OSD nodes, and you know the node will come back later, there’s no sense in spending a lot of I/O and network bandwidth, thus reducing the performances of the cluster, to rebalance the cluster itself; also, especially on large nodes holding many TBs of data, a simple rebalance operation is anyway a heavy operation.

Before working on a node, you simply run:

This command actually doesn’t put a node in maintenance mode. What it does is to prevent any OSD to be marked out of the cluster. Because of this, the PG replica count can’t be properly honored anymore, because when taking down an OSD the cluster will be in degraded state, and recover will not start. To stop an OSD, the command is:

(remember you can get the list of OSDs using ceph osd tree). Once all the OSDs on a node are stopped (or you can even disable the entire ceph service if you are planning to have multiple reboots…), you are free to work on the stopped node; replication will not happen for those OSDs involved in the maintenance, while all other objects will still be replicated. Once the maintenance is over, you can restart the OSDs services:

and finally remove the noout option:

After a while, the status of the cluster should be back to normal.