My adventures with Ceph Storage. Part 3: Design the nodes

12 Flares Twitter 0 Facebook 2 Google+ 1 LinkedIn 9 Email -- 12 Flares ×

In previous parts of this series, I gave you an introduction to what Ceph is, and an overview of its architecture. Now it’s time to start designing the Lab and its servers.

Also available in this series:
Part 1: Introduction
Part 2: Architecture for Dummies
Part 4: deploy the nodes in the Lab
Part 5: install Ceph in the lab
Part 6: Mount Ceph as a block device on linux machines
Part 7: Add a node and expand the cluster storage
Part 8: Veeam clustered repository
Part 9: failover scenarios during Veeam backups
Part 10: Upgrade the cluster


The OSD Servers

Sometimes Linux users bring the concept of “run on any x86 machine” too literaly, and start using old servers, laptops, in general crappy devices, and then blame the operating system. As many Linux admins know “the fact you can it doesn’t mean you should!”. Linux is indeed a powerful operating system, but it does require adequate resources to run properly. Ceph, as a Linux software, follows the same principles, even more if you ask me since it has to deal with storage, so a lot of I/O operations. For these reason, properly sizing OSD servers is mandatory!

Ceph has a nice webpage about Hardware Reccommendations, and we can use it as a great starting point. As explained in Part 2, the building block of RBD in Ceph is the OSD. A single OSD should ideally map to a disk, an ssd, or a raid group. In general, to a block device as seen by the Linux server. In my view, creating RAID groups locally on each server of a scale-out solution like Ceph is a non-sense: redundancy is achieved by replicating blocks in different positions, and RAID is not only “redundant”, but also a waste of disk space and money to buy a RAID controller. A good HBA adapter (I said good, don’t save money on this component please!) is enough.

I’m going to have an OSD server designed like this:

OSD Server design

Each server will have 5 disks. On production deployments, the number of disks for OSD will probably be many more, using JBOD machines with 12 or more disks to optimize the costs. Also, each OSD disk would be 1TB or more, while in my lab, because of space constraints, I will use this sizing:
OS Disk: 20 GB
Journal:  30 GB
OSD:    100 GB (* 3)

Each disk will be formatted with xfs. Since btrfs is not yet stable enough to run Ceph, this will be my filesystem.

The journal disk

The journal disk requires a dedicated section, since it’s vital to properly configure it to guarantee good performances. Ceph OSDs use a journal for two reasons: speed and consistency.

From the Ceph Documentation:
Speed: The journal enables the Ceph OSD Daemon to commit small writes quickly. Ceph writes small, random I/O to the journal sequentially, which tends to speed up bursty workloads by allowing the backing filesystem more time to coalesce writes. The Ceph OSD Daemon’s journal, however, can lead to spiky performance with short spurts of high-speed writes followed by periods without any write progress as the filesystem catches up to the journal.

Consistency: Ceph OSD Daemons require a filesystem interface that guarantees atomic compound operations. Ceph OSD Daemons write a description of the operation to the journal and apply the operation to the filesystem. This enables atomic updates to an object (for example, placement group metadata). Every few seconds the Ceph OSD Daemon stops writes and synchronizes the journal with the filesystem, allowing Ceph OSD Daemons to trim operations from the journal and reuse the space. On failure, Ceph OSD Daemons replay the journal starting after the last synchronization operation.

Without performance optimization, Ceph stores the journal on the same disk as the Ceph OSD Daemons data. But this can become a bottleneck because when a write has to be destaged from the journal to the OSD, the same disk has to read from the journal and write to the OSD, thus doubling the I/O. For these reasons, the usual suggestion is to use a separate disk for the journal, tipically an SSD for better performances, and that’s what I will do on my Lab too, by creating a dedicated vmdk and placing it on an SSD datastore.

For its size, there are many parameters that have to be considered, like the speed of the OSD disks backed by the journal, the expected I/O, and so on. As a starting point, I will assign 10GB to each journal. Since I will have 3 OSD, the SSD vmdk will be sized at 30GB, and here I will create 3 partition, one per each Journal.


Following Ceph reccommendations, you will need 1 GB of RAM for each TB of OSD. In my lab, I’m going to have 300 GB of OSD, so technically I would need 300MB of RAM. Counting also the Linux OS, I will use 2 GB of RAM.

For the CPU, I will assign 2 cores per each Linux servers.

MON servers

As you’ve seen, the large part of the sizing considerations were made for the OSD servers. MON (Monitor) servers are important too, but they have lower needs. For my lab, these 3 servers will use 2 cores and 2 GB of RAM like the OSD, but only the Linux OS disk, just slightly increased in size to 30GB.

On production environments, you will have to take into considerations the load on the monitors for their proper sizing: high volume of activities means also MON will have a fair load, because of the operations they run, and especially during rebalancing operations the load will be pretty notable. Again, the webpage about Hardware Reccommendations is your best resource for these informations.


A scale-out cluster uses its network a lot. For each I/O operation, especially writes, multiple nodes are involved. As you can understand from the block placement depicted in Part 1, when a write is stored in the cluster, multiple blocks in different nodes are involved. Also, nodes have to constantly balance their content and replicate blocks.

For these reasons, on production environments a 10GB network is becoming really common to deploy Ceph. In my lab I’m not planning to have intensive I/O, so my 1GB network will be enough.

Finally, each node will have two network connections: one will be the frontend used to connect to the hosts using the storage resources running on the cluster, and the other will be the backend, used by Ceph nodes to replicate to each other. In this way, there will be no bottleneck limiting and harming the replication activities of the cluster.

Final notes

Now that we have decided how to build the different servers, and defined the network topology, it’s finally time to build the lab. In the next chapter we will see how.

12 Flares Twitter 0 Facebook 2 Google+ 1 LinkedIn 9 Email -- 12 Flares ×
  • Stefan Eriksson

    I see you dont use a raid1 for OS do you recommend to use this on a production system?

    • Hi Stefan,
      I’ve realized the series using VMs, so indeed you’d like to have some raid for the OS. I usually prefer a hardware raid1 for the OS to be honest, rather than using software raid.

  • Webber

    I wonder if there’s a sizing formula of rules of thumb when designed size of OSDs in view of size of journaling disks. It appears to me size of journaling disks is at least 10% of that of OSDs, isn’t it? And that arose my interest as exactly the same recommendation that VMware articulates the proportion of journaling/ anticipated consumed storage capacity, is coincidentally the same as you referred to – 10%.

    • Yep, reading around seems that also in Ceph 10% is a good ratio for the journal, my guess is because the working set of many virtual machines that are loaded has this size, so when dealing with Openstack for example, 10% is a good rule of thumb. Maybe for other usage like ingesting backups you can increase it even more, but then it also because a matter of total price of the nodes.

  • Thelo Gaultier

    Hi Lucas, How would you determine how many OSD’s journal one SSD should take care of?