My adventures with Ceph Storage. Part 5: install Ceph in the Lab

0 Flares Twitter 0 Facebook 0 Google+ 0 LinkedIn 0 Email -- 0 Flares ×

When I started this series of posts, I didn’t thought how many posts it would have required to just arrive to the actual Ceph installation. I could have wrote a quick and dirty guide with step-by-step instructions, but then you would have been stuck into my personal design choices. Instead, I preferred to start from the very beginning, explaining in details what Ceph is, how it works, and how I prepared my lab to use it. This took me 4 blog posts.

However, I know this one is the post you were really waiting for. With all the lab correctly configured and ready, it’s now time to finally deploy Ceph!

Also available in this series:
Part 1: Introduction
Part 2: Architecture for Dummies
Part 3: Design the nodes
Part 4: deploy the nodes in the Lab
Part 6: Mount Ceph as a block device on linux machines
Part 7: Add a node and expand the cluster storage
Part 8: Veeam clustered repository
Part 9: failover scenarios during Veeam backups
Part 10: Upgrade the cluster

Install ceph-deploy

In addition to the 6 specific Ceph machines I’ve created, there’s another Linux VM that I use as an administration console. As I explained in Part 4, this machine can login via ssh without password to any Ceph node, and from there use the “cephuser” user to elevate its rights to root. This machine will run any command against the Ceph cluster itself. It’s not a mandatory choice, you can have one of the Ceph nodes also acting as a management node; it’s up to you.

ceph-deploy

The Ceph administration node is mainly used for running ceph-deploy: this tool is specifically designed to provision Ceph clusters with ease. It’s not the only way you can create a Ceph cluster, just the most simple.

Once all the nodes have been configured with the password-less sudo-capable cephuser user, you need to verify that the administration node is able to reach every node by its hostname, since these will be the names of the nodes registered in Ceph. Which means, a command like “ping mon1” should be successful for each of the nodes. If not, check again your dns servers and/or modify the hosts file.

With all the networking verified, install ceph-deploy using the user cephuser. On a CentOS 7 machine like mine, you need to first add the Ceph repository. If you are running Ubuntu, check the Ceph pre-flight page.

This will be a new empty file. On it, paste this text and replace the values in brackets:

In my example, Ceph release will be “giant” and distro will be el7. Again, check the preflight page if you are using different distributions. Then, update your repositories and install ceph-deploy:

Finally, I prefer to create a dedicated directory on the admin node to collect all output files and logs while I use ceph-deploy. Simply run:

Remember: EVERYTIME you will login into the admin node to work on Ceph, you will need first of all to move into this folder, since configuration files and logs of the ceph-deploy commands will be saved here. Ceph-deploy is ready! Time to install Ceph in our nodes.

Setup the cluster

The first operation is to setup the monitor nodes. In my case, they will be the three MON servers. So, my command will be:

After few seconds, if there are no errors, you should see the command ending successfully with lines like these:

The initial ceph.conf configuration file has been created. For now, we will simply add few additional lines to reflect our public and cluster networks, as explained in Part 3, plus some other parameters to start with (all to be placed under the [global] section of the configuration file):

I’m not going to explain you here what a placement group is and how it should be configured. It is mandatory to choose the value of pg_num because it cannot be calculated automatically. For more informations, read here.
Finally, as a quick visual reminder, this is what we are trying to achieve with the double network:

Ceph Cluster Network

Then, install Ceph in all the nodes in the cluster and the admin node:

The command will run for a while, and on each node it will update repositories if necessary (probably it will happen everytime on a clean machine, since the main repository will be epel…) and install ceph and all its dependencies. If you want to follow the process, just look for these lines at the end of each node installation:

This will give you the confirmation Ceph is installed correctly. Once all the nodes are installed, create monitor and gather keys:

(note: if for any reason the command fails at some point, you will need to run it again, this time writing it as ceph-deploy –overwrite-conf mon create-initial)

Prepare OSDs and OSD Daemons

So far, we have installed Ceph on all the cluster nodes. We are still missing the most important part of a storage cluster like Ceph: the storage space itself! So, in this chapter we will configure it, by preparing the OSDs and OSD daemons.

Remember? We setup the osd nodes with 4 disks, one for the journal and 3 for data. Let’s check first that ceph-deploy is able to see these disks:

As you can see from the output, everytime ceph-deploy connects remotely to a given node, and using sudo runs locally a ceph command, in this case /usr/sbin/ceph-disk list. And the output is what we are expecting:

An OSD can be created with this two commands, one after the other:

Or the combined command:

In any case, you see as there is a 1:1 relationship between a OSD and its journal. So, regardless the fact in our case the sdb device will be a shared between all OSDs, we will have to define it as a journal for each OSD by specifying the single partition we created inside sdb. In my case, the command will be:

After repeating the same command for all the other nodes, our OSD daemons are ready to be used. Note: the first command, zap, is to clean everything that could eventually be on the disk; since it erases any data, be sure you are firing it against the correct disk!

Finalizing

To have a functioning cluster, we just need to copy the different keys and configuration files from the admin node (ceph-admin) to all the nodes:

The cluster is ready! You can check it from the admin-node using these commands:

Here, you can see the 3 monitors all partecipating in the quorum, the 9 OSDs we created (it’s important they are all in status UP and IN), the 256 protection groups grouped in 1 pool, and the 900 GB that we have available (3 * 100Gb disks per node, * 3 nodes).

The most common warning you could see at this point, especially in labs when calculations of PGs are overlooked, is:

And for example a count of 64 total PGs. Honestly, protection group calculations is something that still does not convince me totally, I don’t get the reason why it should be left to the Ceph admin to be manually configured, and then often complain that is wrong. Anyway, as long as it cannot be configured automatically, the rule of thumb I’ve find out to get rid of the error is that Ceph seems to be expecting between 20 and 32 PGs per OSD. A value below 20 gives you this error, and a value above 32 gives another error:

So, since in my case there are 9 OSDs, the minimum value would be 9*20=180, and the maximum value 9*32=288. I chose 256 and configured it dinamically:

That’s it! The cluster is up and running, and you can also try to reboot some of the OSD servers, and see with ceph -w in real time how the overall cluster keeps running and adjusts dinamically its status:

I’ve rebooted OSD1: the three OSDs it contains went down, and the pgmap starts to update itself to reflect the new condition where some PGs where in degraded mode. When the server came up again, immediately the 166 degraded PGs started to resync and in few seconds the state came back to all 256 PGs in active+clean state. But the important part to look after, for the entire duration of the reboot of one node, the overall size of the cluster has always been 899 GB.

Next time, we will create an RBD volume and connect it to a linux machine as a local device!

0 Flares Twitter 0 Facebook 0 Google+ 0 LinkedIn 0 Email -- 0 Flares ×

31 thoughts on “My adventures with Ceph Storage. Part 5: install Ceph in the Lab

  1. Thanks very much for the concise write up. Having an issue running ceph-deploy against the admin node. It works fine for the monitors and OSDs.

    [ceph-admin][WARNIN] ensuring that /etc/yum.repos.d/ceph.repo contains a high priority
    [ceph_deploy][ERROR ] RuntimeError: NoSectionError: No section: ‘ceph’

    I have not been able to find any info online outside of other folks reporting the same issue for hammer.

  2. Thanks very much for the concise write up. Having an issue running ceph-deploy against the admin node. It works fine for the monitors and OSDs.

    [ceph-admin][WARNIN] ensuring that /etc/yum.repos.d/ceph.repo contains a high priority
    [ceph_deploy][ERROR ] RuntimeError: NoSectionError: No section: ‘ceph’

    I have not been able to find any info online outside of other folks reporting the same issue for hammer.

    • I still have to do a clean installation from hammer, mine was done with giant and then updated to hammer. If I find some differences I’ll post them here, thanks for the notes.

      • This is outlined in http://tracker.ceph.com/issues/12694

        You can resolve it by running “sudo mv /etc/yum.repos.d/ceph.repo /etc/yum.repos.d/ceph-deploy.repo”

        I’m having issues with the next step with “ceph-deploy mon create-initial” or “ceph-deploy –overwrite-conf mon create-initial” still working through it though

        • Thanks Ben.
          Sadly Ceph is still a solution that requires a lot of tuning and hacking in the commands and config files to make it work. Calamari is the biggest example of this, after months I’m still fighting to have a working and repeatable installation and configuration process.

          • I just wanted to add in case anyone else had any issues. That “ceph-deploy mon create-initial” doesn’t like it if you have capitals in your hostname of your servers. When it runs the commands it appears to run them in lowercase. (You may have mentioned this, not sure)

            I should also say a big thanks Luca, this article has been fantastic, its a lot easier to follow than the official ceph documentation 🙂

          • Never faced this issue as I never create hosts with uppercase letters 🙂

            Thanks for the kind words, this series is a result of month of failing and retrying, I’m trying to avoid the same pain to others.

  3. This is a great write up on ceph.

    I keep running into the following error after running ceph-deploy mon create-initial

    [ceph_deploy][ERROR ] KeyNotFoundError: Could not find keyring file: /etc/ceph/ceph.client.admin.keyring on host

    Any idea what might be going on?

      • Hi Luca,
        Thanks for responding. I think I found the issue. Since I was using ceph-deploy from a monitor node, attached to the cluster/replication network (10.0.0.0/24), I thought I should all network traffic needed to happen on the replication network. Turns out that ceph-deploy mon create-initial without public IPs does not work. All is running well now.

        Also, I just wanted to mention that it may be required add the mount points of the OSDs to fstab to survive an OSD node/daemon reboot.

        Thanks again for a very helpful blog post.

        • Hi Niels,

          I have the same problem but my admin-node connected to the public and private network. I am using VirtualBox to run a nodes and NAT to have an Internet. I don’t understand why I always get this error. My public network is 10.0.0.0/24 and cluster network is 192.168.57.0/24.

          Can you tell me what have you changed to get rid of this error?

  4. Can you share the IOPS you got ? Did you have a separate OSD journal ?

    • Hi,
      no this is my lab, as every node runs as a virtual machine over 1 Gb links. In production you should look for 10Gb connections as replica and in/out traffic is going to be significant.

  5. Is it in production ? what is your networks speed/topology ? 1Gbps or 10 ?

  6. Many thanks @dellock6:disqus for this extremely informative series of blog.

    For anyone who encountered a “permission denied” when activating the OSD nodes:

    [centos-ceph1][WARNIN] 2016-01-23 02:57:01.408417 7f2625122900 -1 filestore(/var/lib/ceph/tmp/mnt.eSVUoD) mkjournal error creating journal on /var/lib/ceph/tmp/mnt.eSVUoD/journal: (13) Permission denied

    [centos-ceph1][WARNIN] 2016-01-23 02:57:01.408429 7f2625122900 -1 OSD::mkfs: ObjectStore::mkfs failed with error -13

    [centos-ceph1][WARNIN] 2016-01-23 02:57:01.408455 7f2625122900 -1 ** ERROR: error creating empty object store in /var/lib/ceph/tmp/mnt.eSVUoD: (13) Permission denied

    a “chown ceph:ceph /dev/sd” is how I fix the problem.
    Hope that would help :p

    • That will only work temporarily. As soon as you reboot the system the permissions will change back. To fix it permanently you would need to create udev rules to set the permissions on boot.

      I created the following file: /etc/udev/rules.d/89-ceph-journal.rules

      Which contains the the following rules:

      KERNEL==”sdj?” SUBSYSTEM==”block” OWNER==”ceph” GROUP==”disk” MODE==”0660″
      KERNEL==”sdk?” SUBSYSTEM==”block” OWNER==”ceph” GROUP==”disk” MODE==”0660″

      My journals are on /dev/sdj1, /dev/sdj2, /dev/sdj3, /dev/sdj4 and /dev/sdk1, /dev/sdk2, /dev/sdk3, /dev/sdk4.

  7. Hi @dellock6:disqus . Could you help me about error
    “** ERROR: error creating empty object store in /var/lib/ceph/tmp/mnt.Hc9xb0: (13) Permission denied ”
    in log file OSD /var/log/ceph/ceph-osd.1.log .
    Full log you can see at http://pastebin.com/xLuZSgf9

    • Hi,
      you better submit this issue to the ceph forums, there are higher chances to be helped. Sorry but I don’t have time to do support on these topics.

      Luca

    • Hey, did you find a solution for this issue. I ran into the same problem and was not yet able to figure it out.

      • Just discovered the post from @baiyiwang:disqus. This solved the issue for me.

  8. Do you any document for version 9.2 on centos 7? I saw it’s very different. Is this setup for journal using seperate disk from data disk?

  9. Thank you for your great blog, but I found a small error here that max pg per OSD isn’t 32, the error Error E2BIG: specified pg_num 256 is too large (creating 192 new PGs on ~3 OSDs exceeds per-OSD max of 32) means that you can only new create 32 PGs per OSD, that means, you could have max pg_num = 64 + 32 * {number of OSDs}

    • Oops, I think I was wrong, see below command

      [Thu Jul 07 18:27:42 root@ceph-slave-dev001-jylt.qiyi.virtual ~/ceph-jylt-dev01]# ceph osd pool get rbd pg_num

      pg_num: 64

      [Thu Jul 07 18:27:52 root@ceph-slave-dev001-jylt.qiyi.virtual ~/ceph-jylt-dev01]# ceph osd pool set rbd pg_num 256

      Error E2BIG: specified pg_num 256 is too large (creating 192 new PGs on ~3 OSDs exceeds per-OSD max of 32)

      [Thu Jul 07 18:28:04 root@ceph-slave-dev001-jylt.qiyi.virtual ~/ceph-jylt-dev01]# ceph osd pool set rbd pg_num 160

      set pool 0 pg_num to 160

      [Thu Jul 07 18:28:40 root@ceph-slave-dev001-jylt.qiyi.virtual ~/ceph-jylt-dev01]# ceph osd pool set rbd pg_num 256

      set pool 0 pg_num to 256

      [Thu Jul 07 18:35:18 root@ceph-slave-dev001-jylt.qiyi.virtual ~/ceph-jylt-dev01]# ceph osd pool set rbd pg_num 352

      set pool 0 pg_num to 352

  10. Thanks for the great writing. a couple of questions:
    when my virtual machines are in say 10.145.82.0/24, do I configure public network as the same?
    if I want to do the same as the post about the cluster network, do I need to create the vlan first or would ceph does it automatically?

    how do you prepare the physical disks? when you talk about /dev/sdb, dev/sbc etc do you have those physical disks already?

    I know you use vCenter. I am starting from scratch and like to use free hypervisor, what should I use, Xen, KVM, Hyper-v?

    What about docker? can I use docker and how the physical disk work out with docker? Does docker just see host’s disk? What about network? how do you set cluster network? Or is cluster network optional?

    • I can’t suggest on any specific design, but Ceph works at Layer3, so as long as you have network connectivity any subnet is good. For hypervisor, I just created some CentOS machines, you can run them wherever you want.
      For docker, search in Sebastian Han blog (https://www.sebastien-han.fr/blog) he has many articles about Ceph and Docker.

Comments are closed.