Improving on ZFS: BlobFS

I have a love-hate relationship with ZFS. On one hand, I love how easy it is to operate and how it makes it difficult to damage the array. On the other hand, I hate how it interoperates poorly with Linux and how there are some really crucial operations (like vdev removal or array growing) that aren’t implemented and haven’t been for a long time, despite pleas from the community.

I think it could be easier and better. ‘Final word in filesystems’? Yeah, and 640kb should be enough for anybody.

This is intended for home file servers - big files, mostly reads, performance not a big factor. An assumption is that there is no backup, so the data needs to be as safe and correct and recoverable as possible.

The idea

  • You have some disks. BlobFS will make them a big filesystem with a specified level of redundancy.
  • You can select the number of disks you can lose and hence the level of redundancy (the ’safety margin’). Changes are implemented automatically.
  • If you pull out a disk, the data will rebalance across the array automatically. You’ll have a capacity loss (if the data will fit on the disks) or a safety margin loss (if there’s too much data).
  • If the array does lose too many disks, BlobFS will attempt to allow access to anything that’s still readable. This lets you attempt a partial recovery.
  • All data (including filesystem metadata) is end-to-end checksummed, so you can be fairly certain that flakey equipment is not going to silently corrupt your data.

The basic idea is that you give it some disks and some data, and it will take care of the data as best it can. You leave the low-level decisions about RAID algorithms and device recovery up to BlobFS. It makes the best moves it can with the resources it has available.

This has a subtle but significant benefit in terms of reliability. It’s removing human intervention from the recovery process. You don’t have to say “OK, I’m going to offline that disk and add another one in its place and restripe the array” while hoping that you’re offlining the correct disk and restriping is the right thing to do and your replacement disk is in fact empty and not another disk in the array that contains the last redundant copy of your data. All of that logic is in the software. It takes care of your data as best it can.

disk-drive.jpeg

Some other things that may help reliability:

  • You can ask to remove a disk and BlobFS will move the data off that disk so you can remove it without risking array degradation.
  • If there’s spare space on a disk, automatically fill it with more redundant data. This implies that at any given moment, the array will be (from the perpective of the programmer) completely full. When you add more data, you’re actually removing an old redundant copy of something and replacing it with the new data.
  • Spare space can also be used for journalling and backtracking. One of the big criticisms of RAID arrays without a separate backup is that you can destroy everything very quickly through software - there’s no way to back out of deletions or a virus attack. With journalling and backtracking, however, this concern disappears - if you can backtrack to yesterday’s state before you ran that dodgy porn dialer, you don’t need backup tapes quite so much.

Use case

Bob has a collection of movies and music burned on CD that doesn’t infringe copyright at all. He’s worried about CD’s degrading over time - alreay some of them have holes and bad sectors.

He installs a shiny new 400GB hard drive in his computer, sets up BlobFS on it and copies all of his stuff across. As he ‘obtains’ new media, he just saves them to the BlobFS volume.

Bob starts to worry about his hard drive failing, so he buys another 400GB drive and adds it to the BlobFS volume. BlobFS silently replicates all of his data to the new drive. He still has 400GB (ish) of capacity, but now he can tolerate the loss of either drive (safety margin of 1).

Bob discovers the movie Napoleon Dynamite and has to download every copy in every language. He runs out of space. By now, hard drives are bigger and faster, so he adds a 750GB drive to the volume. BlobFS does two things: it makes another 750GB of space available to Bob, and it uses the space (currently free) to add even more redundancy to the array. Immediately after adding the drive, Bob’s collection of illicit movies has a safety factor of 2 - he can lose any two drives in the volume and not lose any data.

Time passes, and one of the original 400GB drives fails. Bob recieves an email from BlobFS telling him so. He pulls out the drive, replaces it with a 1.25TB drive, and goes back to his popcorn. BlobFS rebalances the data automatically and makes more free space available if it can.

Risks

There are undoubtedly some major technical hurdles to deal with:

  • Automatic array restriping and maintaining safety in the event of a power loss
  • Redundancy across different-sized devices

This list is dangerously short. Obviously, I don’t understand the problem well enough yet to be able to make an accurate assessment of its difficulty. I’ve never written a filesystem. I love the idea of doing so; it’s a meaty problem with some fun algorithms/data structures/performance decisions. It has actual engineering elements to it.

I think I’d really enjoy this project. It is a very large project for one person, though. Without having done any systematic research, I think there’s tremendous demand for something like this. It’s just a big project with some bigtechnical challenges and only me potentially working on it.

I would love to work on this, I really would. But it’s not going to pay the bills for a long time. I think I’d have to be self-sufficient from other income sources first before I could work on this (and expect to make a living from it).

Monetization

  • Sell personal licenses. This will impede widespread use; I’m sure there are already commercial systems that have similar characteristics, but nobody wants to pay for them.
  • Sell commercial licenses. This is similar to what ReiserFS does - you can use it free at home, but make any money off it and you’re supposed to pay for it. Which seems fair enough. I’d probably attempt this and see how well it works out.
  • Sell storage hardware. This seems pretty inevitable, and I love embedded work. 2TB of reliable disks at a low low price? Sold.

Leave a Reply