My adventures with Ceph Storage. Part 9: failover scenarios during Veeam backups

0 Flares Twitter 0 Facebook 0 LinkedIn 0 Email -- 0 Flares ×

Also available in this series:
Part 1: Introduction
Part 2: Architecture for Dummies
Part 3: Design the nodes
Part 4: deploy the nodes in the Lab
Part 5: install Ceph in the lab
Part 6: Mount Ceph as a block device on linux machines
Part 7: Add a node and expand the cluster storage
Part 8: Veeam clustered repository
Part 10: Upgrade the cluster

In previous part 8, I’ve showed you how to create the clustered front-end for our repository. In this part, we’ll see different failover scenarios, and what happens to Veeam running jobs.

Configure the cluster in Veeam

Connecting the new repository in Veeam is as easy as usual, the only difference is to point to the Virtual IP or the virtual hostname, instead of the single physical node. When we proceed to configure it as a Linux repository, first of all Veeam will show us the SSH fingerprint:

Ssh fingerprint

There’s a simple trick to verify both servers are exposing the same public SSH key. On a Linux or Mac computer, connect to both nodes via ssh. On each first connection, ssh client will ask us to trust the ssh key, and it will then store it in the known_hosts file. By looking at this file with a command like this:

cat /Users/luca/.ssh/known_hosts | grep -i ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPoLdOCqm9QA+133DsNuc5yKUjRqGR9dU/TuB7BIE5sMIqxEUeZI1N9TWLdXyhYPk1dET/g/SAYdozF1Bf5qw/vwaiv2Dw5KNe39JkePriVp8//Ceod9XEpJ+Y6TxRe4d6+/1ypGsW6sMflFetxdBwtmQzkymrdaoQ9atrdd5b8cw+ft+cONRBw0Eln4KAKQnEuhwM0/pK5UUPExdL4LkmNGM1MJ3oWUurBfb+Mtk5KywuWp5M1V9bwrdFN2dn/pHCaF8xN/h85/lptV++skTr0RgfUsy5MQkgGX9pI01Mw9XXsBL+2RNuWkGDkbeGJSwufpVC8P9fiIL07+/z9Dz/ ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPoLdOCqm9QA+133DsNuc5yKUjRqGR9dU/TuB7BIE5sMIqxEUeZI1N9TWLdXyhYPk1dET/g/SAYdozF1Bf5qw/vwaiv2Dw5KNe39JkePriVp8//Ceod9XEpJ+Y6TxRe4d6+/1ypGsW6sMflFetxdBwtmQzkymrdaoQ9atrdd5b8cw+ft+cONRBw0Eln4KAKQnEuhwM0/pK5UUPExdL4LkmNGM1MJ3oWUurBfb+Mtk5KywuWp5M1V9bwrdFN2dn/pHCaF8xN/h85/lptV++skTr0RgfUsy5MQkgGX9pI01Mw9XXsBL+2RNuWkGDkbeGJSwufpVC8P9fiIL07+/z9Dz/

we will see that both nodes are using the same SSH key.

Moving on in the wizard, the active node (repo2 in my case) will allow us to connect and see the available space of the RDB block device:

Ceph repo cluster

If for any reason I failover the active node to repo1, and I rescan the repository, Veeam will complete the operation successfully:

Ceph repo rescan

This is obviously a static situation, where no activity is running. Way more interesting is to see what happens when a failover happens while a Veeam backup job is running.

Run the first backup

As I’ve shown in Part 5, a Ceph cluster can continue its operations even when a node fails, as long as enough surviving nodes can handle all the objects of the cluster itself. To test this failover in a “pseudo-production” environment, I’ve first configured a backup job that will use the Ceph cluster as its repository. I’m saving a single VM, 50 GB in size, running everytime an active full backup and deleting each time the backup file to keep the Ceph cluster empty, with no additional guest processing. Backup mode is forever forward incremental. First, I’ve executed the job without failing any cluster component, just to check its behaviour:

Ceph job standard

Ignore the performances, my Ceph cluster is made of VMs running in the same storage array as all my infrastructure, so the same of the protected VM and the Veeam components executing the job. What’s interesting is the changes happening in the Ceph cluster. You can either decide to use ceph -w to monitor in real-time the cluster, but since the job is going to last for a while, your shell buffer could maybe not be enough; or you can open afterwards a log in one of the MON servers. In my case, I’ve retrieved the interesting parts from mon1, reading the log saved in /var/log/ceph/ceph-mon.mon1.log. You just need to parse the lines and filter them searching for “pgmap”:

This is the first ever file hitting the cluster, so the volume was just using 50MB for some filesystem meta data. As the job starts, data is ingested by the cluster up to 8,5 Gb, that is the final size of the backup file:

Used size is double, since I’ve configured Ceph with a replication factor of 2, that is 2 copies of each block/object. Finally, for the entire duration of the job, the cluster has been in active+clean state, which means all OSDs were up and running and contributing to the overall cluster.

Back-End failure

Once the cluster has been tested in a stable scenario, let’s see what happens when there’s a failure in the back-end. I repeated the same exact job, again in Active Full, and in the middle of the job I stopped for 15 minutes on of the OSD nodes, osd3 in my case:

Ceph job backend

In this Veeam report, you just see the job completing successfully as before, just taking longer. This is, first and foremost, the proof the failure in the backend of the Ceph cluster was invisible to the front-end! But what I did on the ceph cluster? The job started at 11:00, and the same log as before was recorded:

At 11:10, I powered down the node osd3, and immediately the monitor nodes traced this:

All the three OSDs from osd3 were now missing, and the protection groups where also no more in active+clean state:

The cluster was degraded by 25%, exactly because 1 node out of 4 was missing. The rest of the log was a mix of informations like these for the following minutes:

Basically, Ceph was already rebalancing the cluster, now down from 1200 GB to 900 available GBs, by replicating the unprotected objets into other OSDs. Still, there was data activity in the nodes, this time created both by the Veeam backup job and the background resync of the cluster. At 11:25 the cluster was again completely balanced even with a missing node:

At 11:25, I powered on again osd3:

And again, Ceph started to rebalanced the cluster. I’ll skip another bunch of log lines this time, just look at this one:

The size of the cluster was back to 1200GB, and Ceph was again using the entire available space to balance and protect all the 256 protection groups. And by the end of the backup job, the state was back to normal:

So, as long as there are enough protection groups in a Ceph cluster, the cluster itself can survive failures of entire nodes and still serve the front-end!

Front-End failure

The previous test would have been possible also with a single front-end mounting the Ceph storage. Time now to test a front-end failure. Again, the job was started in active full, and the Ceph cluster was exposed to Veeam from the Linux frontend repo1:

After 10 minutes from the start of the Veeam job,  I powered down repo1. Both the virtual IP and the mount point of the Ceph cluster failed over to repo2:

The job however failed, as the Veeam binaries running in repo1 where not able to see the backend storage anymore:

Ceph job frontend fail

For this job, I configured a schedule, so the default option is to have 3 retries every 10 minutes:

Ceph job retry

At the next retry cycle, the job restarted and completed successfully using repo2. In fact, you could see the Veeam executables in the running processes:

This is the main reason to use this kind of failover: Veeam binaries are deployed on Linux at runtime, and directly executed from /tmp folder. Because they are not permanently deployed and registered as daemons, there’s no way to “clusterize” them. with this configuration however, a running job may fail on a clustered front-end, but each following retry will be attempted on the other node. If you do not change the default configuration of your jobs, all of them will be completed even during a front-end failure. My job had just one VM in it, so it was completely retried, but in regular jobs with multiple VMs, only VMs that were not completed will be retried, all the processed VMs are already stored in the backup file.