My adventures with Ceph Storage. Part 6: Mount Ceph as a block device on linux machines

8 Flares Twitter 0 Facebook 0 Google+ 3 LinkedIn 5 Email -- 8 Flares ×

Also available in this series:
Part 1: Introduction
Part 2: Architecture for Dummies
Part 3: Design the nodes
Part 4: deploy the nodes in the Lab
Part 5: install Ceph in the lab
Part 7: Add a node and expand the cluster storage
Part 8: Veeam clustered repository
Part 9: failover scenarios during Veeam backups
Part 10: Upgrade the cluster
On Part 5, we ended up with our Ceph cluster up and running, perfectly replicating objects among the nodes. As explained at the beginning of the series, one of the goal of using Ceph has been to create a general purpose storage for our datacenter, and a use case was to use it as a repository for Veeam backups. Ceph can be accessed in different ways, but as of today probably the best way for what I’d like to accomplish is to mount it as a local device on a linux machine. Linux kernel has native support for RBD (Rados Block Device) since 2.6.34, so if you are planning to use older distributions like CentOS 6, please refer to Ceph OS Recommendations. In my case, I’m using another CentOS 7 machine as I did for all the nodes of the cluster: this distribution comes with kernel 3.10, which has full support for RBD.

We will deploy an additional machine that will act as a client for RBD and mount the block device. DO NOT run the RBD client on the same machines where the Ceph cluster is running, bad things can happen especially on the OSD nodes. Also, it’s a good design to keep client and server on separate machines: for example when you want to patch the cluster, you can reboot one node at the time and still have the cluster in active+clean state, and the client will not be impacted by any maintenance operation.

So, first things first, let’s deploy a new linux machine like this one:

repo1.skunkworks.local
CentOS 7 minimal
10.2.50.161
2 vCPU, 2 Gb RAM, 20 Gb disk

Once the machine is up and running, just quickly verify that it can reach the Ceph cluster by pinging the monitors and the OSDs. Connections will be initiated towards the monitors, but data stream will happen between the client and the OSDs nodes, on their public network. Remember, the OSD cluster network is only used between OSDs for replication. In my case, the client and all the Ceph nodes will be connected on the same network.

Create a new Ceph block volume

In the previous chapter, the final result of the activities I’ve done was to have a working Ceph cluster, but in order to be used as a mounted volume, we first need to create an RBD (RADOS block device) in the cluster. First, using the ceph-admin machine, let’s check again the cluster is in a correct status:

[ceph@ceph-admin ceph-deploy]$ ceph -s
cluster aa4d1282-c606-4d8d-8f69-009761b63e8f
health HEALTH_OK
monmap e1: 3 mons at {mon1=10.2.50.211:6789/0,mon2=10.2.50.212:6789/0,mon3=10.2.50.213:6789/0}, election epoch 8, quorum 0,1,2 mon1,mon2,mon3
osdmap e81: 9 osds: 9 up, 9 in
pgmap v7276: 256 pgs, 1 pools, 0 bytes data, 0 objects
333 MB used, 899 GB / 899 GB avail
256 active+clean

Remember, HEALTH_OK is the desired status, and all the PGs should be in active+clean state. Ok, time to create the block device:

rbd create veeamrepo --size 20480

In the command above, veeamrepo is the name we give to the RBD, 20480 is the size of the device in MB, so it will be 20 GB. As long as you run the command in the ceph-deploy directory, informations about the monitor servers to connect to, and the keyring to authenticate, will be automatically used by the rbd command. The creation is immediate, and you can check the block device is there:

[ceph@ceph-admin ceph-deploy]$ rbd ls
veeamrepo

In order to retrieve informations about the block device, use instead this command:

[ceph@ceph-admin ceph-deploy]$ rbd --image veeamrepo info
rbd image 'veeamrepo':
size 20480 MB in 5120 objects
order 22 (4096 kB objects)
block_name_prefix: rb.0.10e9.2ae8944a
format: 1

RBD Kernel modules

Now, let’s move into the linux repository server.

As of today, rbd support is not available in CentOS kernel, it may appear in the future or it could be available in other distributions. In order to verify if your kernel already supports rbd modules, try to load the module itself:

modprobe rbd

If it gives you an error, then RBD is not installed, and we will proceed to do it. Some guides on Internet says you have to install the complete ceph-common package to have the needed module, but in reality you just need two packages (kudos to Tamás Mészáros for his detailed blog post about using rbd on CentOS 7):

yum -y install https://ceph.com/rpm-testing/rhel7/x86_64/kmod-rbd-3.10-0.1.20140702gitdc9ac62.el7.x86_64.rpm https://ceph.com/rpm-testing/rhel7/x86_64/kmod-libceph-3.10-0.1.20140702gitdc9ac62.el7.x86_64.rpm

Once the new binaries are installed, you can retry to load the RDB module, and this time it should succeed.

Then, let’s map the block image to a block device. First, from one of the monitor node, retrieve the client admin name and key by looking at the file /etc/ceph/ceph.client.admin.keyring:

[client.admin]
key = AQBE5KtUUPcNLRAAg+O2wpGoCSONAsvkCleePQ==

Then, map the RBD device in the linux repository server:

echo "10.2.50.211,10.2.50.212,10.2.50.213 name=admin,secret=AQBE5KtUUPcNLRAAg+O2wpGoCSONAsvkCleePQ== rbd veeamrepo" > /sys/bus/rbd/add

where the IP addresses are those of the MON servers of the Ceph cluster, rbd is the name of the pool and veeamrepo the name we chose for the rbd image. Be careful as usual when doing copy/paste that you are using the correct “ symbol, otherwise the command will not succeed. The final result will be a new device in the linux machine:

[root@repo1 ~]# ll /dev/rbd*
brw-rw----. 1 root disk 252, 0 Feb 16 12:00 /dev/rbd0

Now, you can format the device using XFS and applying it a label:

[root@repo1 ~]# mkfs.xfs -L veeamrepo /dev/rbd0
log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/rbd0 isize=256 agcount=17, agsize=326656 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0
data = bsize=4096 blocks=5242880, imaxpct=25
= sunit=1024 swidth=1024 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

As you may have noted, I used a label when I formatted the RBD device with xfs: now I can simply use the label to identify the correct RBD. This is especially important if you are going to have multiple RBD devices in the same machine.

Finally, create a mount point like /mnt/veeamrepo and mount the device:

mkdir /mnt/veeamrepo
mount /dev/rbd0 /mnt/veeamrepo

The new device is mounted and ready to be used!

[root@repo1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 18G 1.2G 17G 7% /
devtmpfs 913M 0 913M 0% /dev
tmpfs 921M 0 921M 0% /dev/shm
tmpfs 921M 12M 909M 2% /run
tmpfs 921M 0 921M 0% /sys/fs/cgroup
/dev/sda1 497M 184M 314M 37% /boot
/dev/rbd0 20G 33M 20G 1% /mnt/veeamrepo

When needed, you can unmount the filesystem and remove the RBD device using:

echo "0" >/sys/bus/rbd/remove

This will be useful in the future for the clustered version of the Veeam repository.

Automount at boot

So far, we tested the manual mount of the RBD device. On a production system, we obviously want to mount the device automatically each time the linux machine restarts. Being a CentOS 7 system, we will leverage systemd.

Create a new systemd service unit (e.g. /etc/systemd/system/rbd-veeamrepo.service) for each of your remote rbd images. In it, you will place a text like this:

[Unit]
Description=RADOS block device mapping for rbd/veeamrepo"
Conflicts=shutdown.target
Wants=network-online.target
After=NetworkManager-wait-online.service
[Service]
Type=oneshot
ExecStart=/sbin/modprobe rbd
ExecStart=/bin/sh -c "/bin/echo 10.2.50.211,10.2.50.212,10.2.50.213 name=admin,secret=AQBE5KtUUPcNLRAAg+O2wpGoCSONAsvkCleePQ== rbd veeamrepo > /sys/bus/rbd/add"
ExecStart=/bin/mount -L veeamrepo /mnt/veeamrepo
TimeoutSec=0
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Start the service and check whether /dev/rbd0 is created or not.

systemctl start rbd-veeamrepo.service
systemctl status rbd-veeamrepo.service

If everything seems to be fine, enable the service to start on boot:

systemctl enable rbd-veeamrepo.service

At each reboot, the new service will load the rbd kernel module, connect to the block image on the Ceph cluster, and mount it locally into /mnt/veeamrepo. Now you can use the new disk space!

In Veeam, the linux machine will be added as a linux repository, and the /mnt/veeamrepo will be used as the repository space:

Linux repo ceph

Online expansion

One of the nice things about Ceph, as I stressed from the beginning, is the scale-out nature of the solution. In a future article, I will show you how to add a new node and expand the pool, but even inside an existing cluster, it can be useful to know how to expand an existing RBD image. Block devices are always created thin, and they just consume the space that is actually used. It’s pretty easy to expand a block device to consume additional space available on the Ceph cluster; remember that technically Ceph can be oversubscribed, but you better always check the actual consumption before ending up in a risky situation. It just takes two commands to check the available space from the admin machine; first the physical space:

[ceph@ceph-admin ceph-deploy]$ ceph -s
cluster aa4d1282-c606-4d8d-8f69-009761b63e8f
health HEALTH_OK
356 MB used, 899 GB / 899 GB avail

Second, the space assigned to the RBD image:

[ceph@ceph-admin ceph-deploy]$ rbd --image veeamrepo info
rbd image 'veeamrepo':
size 20480 MB in 5120 objects
order 22 (4096 kB objects)
block_name_prefix: rb.0.10e9.2ae8944a
format: 1

finally, the real consumption of the thin image:

[ceph@ceph-admin ceph-deploy]$ rbd diff rbd/veeamrepo | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
14.4062 MB

As you can see, the repository has not been used yet, so it just consumes 14 MB out of the 20GB assigned. Also, remember Ceph uses replica to protect blocks, so as long as the replication factor is configured to have 2 copies of each block, 20GB on the cluster in reality will consume, when fully loaded, 40GB. The 900GB available in my cluster will allow at most 450 GB of block devices, unless I’ll go for oversubscription.

Say now I want to expand, as I said, my block device. The simple command on the admin node will be:

rbd --pool=rbd --size=51200 resize veeamrepo

Where 51200 MB (50 GB) is the new desired size. After this, on the linux repository you can resize the filesystem on-the-fly with:

xfs_growfs /mnt/veeamrepo

The change will immediately reflect into the linux machine, and after a rescan also in Veeam repository:

Linux repo ceph resize

8 Flares Twitter 0 Facebook 0 Google+ 3 LinkedIn 5 Email -- 8 Flares ×
  • Sumit Gaur

    Great write up, very clean steps and upto date to the concept. Thanks for sharing this.

  • emik0

    Thanks for the article! Can you please also decribe best practice acessing data on all nodes? For example i’d like to create repo2 and repo3 hosts and connect via rdb on them same pool,but all i get is that data is updated on rdb only after remount.

  • Boris Markov

    Great work! Can you also tell why it’so not recommended to run the RBD client on the same machines where the Ceph cluster is
    running?

    • Some libraries are shared between the client and the server, so updating or changing configurations to one of them can impact also the other one.

  • Niels Maumenee

    Just a note to anyone who happens to find that they are unable to complete the step “map the RBD device in the linux repository server” due to permissions errors. Check and make sure you do not already have the rdb module loaded with lsmod. I happened across this error by combining the ceph instructions http://docs.ceph.com/docs/master/start/quick-rbd/#install-ceph with this very helpful blog post.

  • Webber

    Hi Luca,

    Thanks for your sharing about backing up the Ceph datastore via RBD. As far as I am aware, in your illustration Ceph datastore is exclusively mounted on the datamover where your backup client is installed and running. Let’s say I have a production Ceph cluster to which my OpenStack Compute cluster is connected. I also have another Ceph cluster configured and enabled as a backup datastore where the production datastore is being backed up to.What would be the architecture of backing up the production datastore over to the backup’s? Is a server running as a datamover can mount the production datastore in the way of split-mirroring so that backups can be made without imposing storage downtime to OpenStack? Is the way of making off-host backups officially supported by Veeam?