It’s 2016: and BTRFS could really be your next filesystem

23 Flares Twitter 0 Facebook 0 Google+ 0 LinkedIn 23 Email -- 23 Flares ×

Switching to a new filesystem is never a task that is done with a light heart. We have our own trusted good old filesytem, that has maybe limits in features and performance, but has never let us down. New filesystems are available, and they promise wonderful things. But as much as we are fascinated by them, the big Q “Should I trust it?” comes to mind when we just start thinking about moving to a new filesystem. In Linux, this question arises everytime BTRFS is involved.

A new that is not that new anymore

Btrfs is aging

I loved this screenshot as soon as I’ve seen it. The guy here is Chris Mason, he now works at Facebook (a heavy user of BTRFS by the way), and he started to develop BTRFS back in the days when he was working for Oracle, around 2007. So, it’s already 8 years old, and as the picture says, it’s not only “not new anymore”, it’s aging 🙂

There are many misconceptions around BTRFS on the Internet, some come from real initial problems that the filesystem had, but also because people usually don’t check the date of the informations they read. Yes, BTRFS was really unstable at the beginning, but if you read about huge data corruption problems in a blog post or topics like that, and that post was written in 2010, well maybe things have changed since then.

The most important part of a file system is its on-disk format, that is the format used to store data onto the underlying media. Well, the filesystem disk format is no longer unstable, and it’s not expected to change unless there are strong reasons to do so. This, alone should be enough to tell people that BTRFS is stable.

So, why it is considered unstable by many?

There are few reasons: first, as I said, people are scared of change when it comes to filesystems. Why changing from a trusted and known one, to something new? And it’s not just Linux, the same is happening in Microsoft and its will to move from NTFS to ReFS. But then, I’ve always seen a paradox here: ok for XFS that has 20 years of stable development, but ext4, the “trusted” default file system, has been developed as a fork of ext3 in 2006. So, it’s just 1 year older than BTRFS!!!

Second reason, probably, the fast development cycles. While the on-disk format is finalized, the code base is still under heavy development to always improve performance and introduce new features. This, together with management tools that had been stabilized only recently, made people think that the entire project wasn’t stable.

Confirmations come from the field

The screenshot I’ve took is coming from this video.

https://www.youtube.com/watch?v=W3QRWUfBua8

First important note, it’s not another ranting post from 8 years ago, it’s Chris Mason himself speaking at NYLUG in May 2015, so just few months ago. And the examples he brings are the best proof about BTRFS: they are using the filesystem at Facebook, where they store production data. And the nice part is that, right because BTRFS is used in production at Facebook, the size of the used storage helps in testing and fixing the code at a pace that wouldn’t be possible in smaller installations.

And if you watch the video, you’ll see how some really heavy weights in the industry are supporting and working to improve BTRFS: Facebook, SuSE, RedHat, Oracle, Intel… And the results are showing up: starting from SuSE Linux Enterprise Server 12, released in October 2015, BTRFS has become the default file system of this distribution. Kudos to the guys at SuSE, because for sure the best way to push its adoption is to place a statement like this “we are a profit company, not a group of Linux geeks, and we trust this filesystem to the point that it’s going to become our default one”.

Why BTRFS is awesome?

Ok, so BTRFS is stable enough to be trusted. Or at least I do, together with guys whose judgement has way more value then me like Facebook and SuSE Linux experts. At this point, if you still don’t trust it, stop reading this post and keep using ext4 or xfs, no problem.

But if you are thinking “maybe I can use BTRFS on my next Linux deployment”, why should you consider it? Well, because it has some great features! The page linked at the beginning has the complete list, here I’m going to list the ones I prefer the most.

BTRFS has been designed from the beginning to deal with modern data sources, and in fact is able to manage modern large hard disks and large disk groups, up to 2^64 byte. That number means 16 EiB of maximum file size and file system, and yes the E means Exabyte. This is possible thanks to the way it consumes space: other file systems use disks in a continguous manner, layering their structure in a single space from the beginning to the end of the disk. This makes the rebuild of a disk, especially large ones, extremely slow, and also there’s no internal protection mechanism as one disk is seen as a single entity by the filesystem itself.

BTRFS instead uses “chunks”. Each disk, regardless its size, is divided into pieces (the chunks) that are either 1 GiB in size (for data) or 256 MiB (for metadata). Chunks are then grouped in block groups, each stored on a different device. The number of chunks used in a block group will depend on its RAID level. And here comes another awesome feature of BTRFS: the volume manager is directly integrated into the filesystem, so it doesn’t need anything like hardware or software raid, or volume managers like LVM. Data protection and striping is done directly by the filesystem, so you can have different volumes that have inner redundancy:

Btrfs allocation tree

For example, Block group 2 is configured for RAID1 redundancy. So, a chunk is consumed on disk1, and its mirror is stored in another device, Disk 2 in the picture. In this way, if we lose Disk1, another copy of the block is still available on Disk2, and another copy can be immediately recreated for exaple on Disk3 using the free chunk. You can configure BTRFS for File Striping, File Mirroring, File Striping+Mirroring, Striping with Single and Dual Parity.

Another aspect of BTRFS is its performance. Because of its modern design and the b-tree structure, BTRFS is damn fast. If you didn’t already, look at the video above starting from 30:30. They have run a test against the same storage, formatted at different stages with XFS, EXT4 and BTRFS, and they wrote around 24 million files of different size and layout. XFS takes 430 seconds to complete the operations and it was performance bound by its log system; EXT4 took 200 seconds to complete the test, and its limit comes from the fixed inode locations. Both limits are the results of their design, and overcoming of those limits was one of the original goal of BTRFS. Did they succeed? The same test took 62 seconds to be completed on BTRFS, and the limit was the CPU and Memory of the test system, while both XFS and EXT4 were able to use only around 25% of the available CPU because they were quickly IO bound.

Other features are worth a mention:
– Writable and read-only snapshots
– Checksums on data and metadata (crc32c): this is great in my view, as every stored block is checked, so it can immediately identify and correct any data corruption
– Compression (zlib and LZO)
– SSD (Flash storage) awareness: another sign of a modern filesystem. BTRFS identifies SSD devices, and changes its behaviour automatically. First, it uses TRIM/Discard for reporting free blocks for reuse, and also has some optimisations like avoiding unnecessary seek optimisations, sending writes in clusters, even if they are from unrelated files. This results in larger write operations and faster write throughput.
– Background scrub process for finding and fixing errors on files with redundant copies
– Online filesystem defragmentation. being a COW (copy-on-write) filesystem, each time a block is updated the block itself is not overwritten but written in a different location of the device, leaving the old block still in place. If the old block at some point is not needed anymore (for example if it’s not part of any snapshot) BTRFS marks the chunk as available and ready to be reused.
– In-place conversion of existing ext3/4 file systems

Final notes

As in any technology, BTRFS is not perfect. For example, it suffers when there are heavy write activities in the middle of an existing files, so probably it’s not the best candidate for virtualization (the virtual disks are updated in-place at each write). But as always, you have to decide if the features available in a given technology are worth the migration to it, and if the (few) limits are going to affect you.

For all these reasons, for sure I’m going to use more and more BTRFS in my next Linux deployments.

23 Flares Twitter 0 Facebook 0 Google+ 0 LinkedIn 23 Email -- 23 Flares ×
  • Gian Domenico Bonazzoli

    Well done ! But…. what about btrfs as the internal file system for ceph ?

    • For sure BTRFS is considerably faster than xfs when used in a Ceph RBD volume, last year I wrote to use xfs, but if I had to rebuild my Ceph cluster today, I’d try at least a couple of volumes with BTRFS.

  • ErikTheRed

    I’m not sure it’s done cooking yet. We stress-tested it on a secondary backup server that is running Ubuntu 12.04 LTS and uses rsync and hard links to maintain space-efficient daily file backups. This is not hugely storage-intensive (physical volume size is ~ 4TB), but it is massively metadata intensive – millions of file entries per day, most of which are unchanged. We had had some metadata issues with EXT4 and wanted to see how BTRFS would handle this in production.

    Short story: BTRFS fails spectacularly in this environment. File system operations slow to an abysmal crawl. Unfortunately I can’t do an exact apples-to-apples comparison with EXT4, but deleting a day’s worth of old data takes at least 8 times longer. The worst part is filesystem maintenance. “brtfs check” would keep crashing due to lack of RAM. This was running on a generously provisioned host, so I was able to keep upping the guest RAM until we could complete the operation – with a mere 220 GiB of RAM required and just over 5.5 hours of wall clock time with a reasonably fast, dedicated SAN physical volume (no major IO latency).

    Granted this is not a normal use case, but compared to other filesystems it was a complete choke.

    • Svein Engelsgjerd

      Btrfs is not properly stable even now in 2016 but if you base your views on an operating system with a kernel from about 2012 you are not really fair to btrfs.

    • aaarrrgggh

      “You are doing it wrong.” I experienced the same thing on a Synology DiskStation 1515+ with rsnapshot… Until I converted the backups to sub-volumes. It went from two hours per backup down to 3 minutes on average. Only been running for a week, but speed has been great. I chose to wipe my previous backups before starting, moving one backup into an hourly.0 subvolumes manually.

    • PTK65

      BTRFS already does what you are duplicating with rsync. All you need to do is take snapshots.

  • Svein Engelsgjerd

    While btrfs is getting more and more stable it is still far from good enough to use for mirroring not to mention raid5/6 like config. If a disk drops out btrfs still tries to write to it. If you wreck your array you may only have one chance to repair it or else the filesystem is stuck in read only mode. The conclusion that ext4 is only one year older than btrfs is also not quite right since ext4 is the continuation of an already stable codebase (ext3) and btrfs is started from scratch. While I would love to see btrfs complete with all its cool features it is still not stable, error prone (just try with some real life bad disks) and still has too many loose threads. Cool to play with but should not be used without backups. It probably needs 4-5 more years!

    • Even with mirroring the filesystem has serious issues with performance under many workloads and with running out of disk space on RAID1 volumes despite ‘df’ showing plenty of space available (this because its COW system seems to be incapable of reclaiming space in a timely manner). It also seems to have the same multitude of filesystem-destroying failures that typified ReiserFS, probably due to the same btree structures that made ReiserFS so prone to lose.

      Part of this is the fault of the Linux buffer cache and block subsystem, which was created for and optimized for non-COW filesystems and despite many hacks intended to improve BTRFS performance continues to be in dire need of rewrite to join the modern era (said as someone who recently read the source code and traced code execution of a modern cleanly designed block subsystem, which the Linux one ain’t). You’ll notice that ZFSonLinux bypasses many of the reliability issues BTRFS has by simply not using the Linux buffer cache, it uses its own (yet another reason why Linus would never allow it to be included as part of the kernel even if Oracle relaxed the licensing).

      And part of it is just plain architectural failure — ReiserFS should have made it clear that btree-flavored filesystems need multiple layers of redundancy to avoid the near-inevitability of filesystem loss, yet BTRFS by default doesn’t do that. Similarly, COW under database-type loads or virtual machine filesystem type loads requires special architectural considerations to improve random rewrite performance and garbage-collect the now-redundant COW sectors, considerations that ZFS incorporates, but BTRFS appears to have utterly ignored those lessons.

      A shame, really. The concept of BTRFS — that you should be able to just throw drives into a pool (and remove drives from a pool) without a management nightmare — is a fine one. If I have a 12-bay storage server with four 12-bay JBODs attached to it, managing those 60 drives by hand and manually slicing and dicing them to give slices of them to virtual machines is a major PITA, much easier to just plop them all in a pool and allocate qcow2 files to give to the virtual machines. But when the performance difference between giving a VM a LVM logical volume as a RAW filesystem and giving a VM a RAW file on a BTRFS filesystem as a filesystem is literally a factor of 20 under common workloads, *plus* unlike LVM the BTRFS filesystem regularly loses its mind and loses all those files, well. FAIL.

  • S.P.

    EXT4 is about as slow as NTFS for Windows, BTRFS was way faster for me

  • tom bunyon

    btrfs runs buttery fast (lol) on my old netbook, although bootup is slower than ext4

  • cat1092

    I don’t use RAID & never have, if I want to run faster (though this isn’t the only reason some chooses RAID), I purchase faster storage.

    And alas I have, in a 512GB Samsung 950 Pro M.2 that burns top line SATA-3 SSD’s by 3x easily, and that’s with it’s just over 1,500 max writes, and at just a little over the same for reads, still close to 1000 MB/sec slower than top speed. That may be a limitation of the Z97 chipset, I don’t know yet.

    What I do want to know, is if I install root (about 40GB) to the SSD, and use btrfs, is this going to net me more speed than ext4, plus easier on wear & tear on the drive (the cost on promo was $309 at Newegg). Funny thing, sometime back, had a Linux Mint install as btrfs & didn’t know it until I clean installed the next LTS, though can’t recall performance levels/differences.

    Can anyone tell me which is best for the Samsung 950 Pro M.2 SSD? And why, if possible. I’m not ‘afraid of change’, as long as the change is beneficial to me.;-)

    Cat

    • cat1092

      BTW, my /home & Swap partitions will be on a 500GB WD RE4, and since I use virtual machines heavily, will as the article implies, run ext4 for /home on that drive. This is how all of my Linux Mint installs are setup, to make things super easy for installing a new LTS, can still choose the current /home & Swap, though not format. Having one’s data or /home on an entire separate drive is also a good practice. Even the best of SSD’s can fail, though haven’t had the first one to.

      However, I’ll only ‘recycle’ the /home partition for a total of two LTS releases, after that, will copy all of the sub-folders inside of Home to an external (NTFS is fine for this), and after total clean install of the OS, delete the folders inside of Home bearing the same name, and copy mine back over. I also perform this task monthly as a backup strategy, this way should the drive fail, and eventually all does, will have a backup to fall back on.

      Cat

  • dotmagic

    I’ve started to use BTRFS on my Laptop 3 years ago (2013) and never had a single issue. I use it for development so there are many small files regularly deleted, recreate…. My biggest concerns was the userspace tools, lucky I never need it 🙂
    If you wanna try it, use the latest Kernel and not any outdated Debian or Ubuntu LTS.

    Before I’ve used reiserfs3, ext4 and also XFS for many years. XFS was great for big files, but deleting many small files is extremely slow. With all i had minor issues.

  • nnyan

    What is the current state of parity-based redundancy in BTRFS? Raid 5/6 was pretty unstable.

  • Hi, it’s a specific “childhood” issue of btrfs, cow is not bad, instead is a good mechanism to protect for example from power loss as original data is not overwritten, rather removed once the new write is committed. But lately BTRFS is improving, and it will again and again, this article is one year old, so things may have changed.