bcache

What is bcache?

From the bcache wiki:

Bcache is a Linux kernel block layer cache. It allows one or more fast disk drives such as flash-based solid state drives (SSDs) to act as a cache for one or more slower hard disk drives.

Hard drives are cheap and big, SSDs are fast but small and expensive. Wouldn't it be nice if you could transparently get the advantages of both? With Bcache, you can have your cake and eat it too.

Bcache patches for the Linux kernel allow one to use SSDs to cache other block devices. It's analogous to L2Arc for ZFS, but Bcache also does writeback caching (besides just write through caching), and it's filesystem agnostic. It's designed to be switched on with a minimum of effort, and to work well without configuration on any setup. By default it won't cache sequential IO, just the random reads and writes that SSDs excel at. It's meant to be suitable for desktops, servers, high end storage arrays, and perhaps even embedded.

The design goal is to be just as fast as the SSD and cached device (depending on cache hit vs. miss, and writethrough vs. writeback writes) to within the margin of error. It's not quite there yet, mostly for sequential reads. But testing has shown that it is emphatically possible, and even in some cases to do better - primarily random writes.

It's also designed to be safe. Reliability is critical for anything that does writeback caching; if it breaks, you will lose data. Bcache is meant to be a superior alternative to battery backed up raid controllers, thus it must be reliable even if the power cord is yanked out. It won't return a write as completed until everything necessary to locate it is on stable storage, nor will writes ever be seen as partially completed (or worse, missing) in the event of power failure. A large amount of work has gone into making this work efficiently.

Bcache is designed around the performance characteristics of SSDs. It's designed to minimize write inflation to the greatest extent possible, and never itself does random writes. It turns random writes into sequential writes - first when it writes them to the SSD, and then with writeback caching it can use your SSD to buffer gigabytes of writes and write them all out in order to your hard drive or raid array. If you've got a RAID6, you're probably aware of the painful random write penalty, and the expensive controllers with battery backup people buy to mitigate them. Now, you can use Linux's excellent software RAID and still get fast random writes, even on cheap hardware.

It has existed for a few years, but was recently considered stable enough to merge into the Linux kernel (in 3.10).

Features

  • attaches SSD cache to HDD device via hooks in the kernel block layer, so you can add and remove it at any time (as opposed to being another layer that you’d have to setup at mount time)
  • allows streaming reads/writes(which HDDs are good at) to pass directly through to the HDDs.
  • can turn random writes (which cause write amplification on SSDs) into sequential writes (which SSDs are good at)
  • also can turn random writes (which HDDs are bad at) into sequential writes (which HDDs are good at)
  • can run in just write-through mode or write-back.
  • in write-back mode effectively acts like an expensive battery-backed-ram-write-back-cache controller, the kind that IBM/HP/Dell sell to make their RAID array performance not totally suck. BUT instead of just having ~1gb of cache it can have whatever size SSD you throw at it!

Background reading

Upstream site
Documentation

LWN.net has had a series of articles on bcache

Thoughts

  • Currently “multiple caches per set isn’t supported yet but will allow for mirroring of metadata and dirty data in the future.” So right now to ensure redundancy we will probably want to use two SSD devices in a RAID1 md array. But when the multiple caches feature is implemented then only the metadata and dirty data will need to be mirrored by the multiple SSDs (which won’t be too much data) and the rest of the space on the devices can be read caches which will be a huge boost.
  • One of the main advantages of bcache is that it helps make the underlying performance of the HDD block device much better. So you can actually use it on top of RAID5/6 and maximize the space. But our allergies might still prevent us from doing that.
  • We have often opted to setup disks in pairs of RAID1 arrays and put dedicated data on them in order to ensure performance but at the cost of some flexibility. With bcache it might be nice to just set them all up as pv’s in the same vg and then allocate lvols from there. And then we can add pv’s as we grow and the bcache device will still sit happily on the vg and no adjustments are needed.
  • With those things in mind, we’d probably do something like this
    • Pairs of HDDs setup as RAID1 md arrays → dmcrypt/luks → pv’s in a lvm vg →lvols
    • a pair of SSDs setup as RAID1 md array → dmcrypt/luks → bcache device attached to the above vg

Uses

  • on hoopoe for backups it would help a lot
  • some people are using it to speed up mysql, mysql does it’s own caching, but this is able to do more if mysql cache misses
  • help get rid of iowait on the BEMS
  • setup fucking awesome tahoe disk arrays, lots of fast storage
  • setup storage arrays for LEAP stuff

Other stuff

  • facebook wrote something called “flashcache” (wikipedia, release announcement, github). It is a write-back/write-through dm-layer cache and uses a system of buckets for keeping track of dirty pages. It’s considered a less generic solution than bcache.
  • EnhanceIO is derived from facebook’s flashcache, but doesn’t use dm and has some other changes.
  • dm-cache (lwn) is another block cache device, it was merged into linux 3.9.
  • May 1, 2013: Comparison: LSFMM: Caching — dm-cache and bcache→https://lwn.net/Articles/548348/]
  • email comparing bcache, dm-cache, EnhanceIO performance