Status report, 27 Dec 2007

December 27th, 2007

Things are steaming along pretty nicely.

Marketing

After listening to a podcast feature Greg Gianforte at a MySQL conference, I realised that I should be putting effort into marketing now rather than after my product was ready. It seems somewhat obvious now, but I was getting excited about B*-trees and block allocation algorithms without realising that I should do more to establish demand ahead of time.

The filesystem has a marketing name, a website, and is being filled with content. I’m really excited about the idea, actually. It’s a new area of research; I get to build something that I believe in; it hasn’t been done before; there are lots of good monetization options. It is a big product, and so I’m rapidly making my ’six products in one year’ slogan irrelevant. I’m not too bothered. I have some real passion for this idea, and I think that’ll be more important in the long run.

Filling in the website makes me think about what features I intend to support - it’s like a simple spec. And it makes me more excited about what I’m going to build.

I have done a little bit of coding - mostly exploratory stuff with FUSE. It all seems to be making sense. I’ve done a bit of kernel work before, so there’s nothing too scary there.

I haven’t started doing any marketing for the file synchronizer. This is partly because I’m not thinking about it so much with the filesystem. This is a trap that I frequently fall into - I work on something while I’m passionate about it, then forget about it before completion. I do have a good body of code there, however, so I should start working on that at some point.

The big lesson that I’m learning right now is that there is an awful lot of work to be done. I can’t just sprint to a release. It’s going to take consistent effort over a period of months to get these things out the door, and probably more effort to get people to actually use them (let alone buy something!) I need to just be calm and patient and keep putting in consistent effort.

AdWords

I’ve set up a bunch of AdWords ads for the filesystem. They’ll go active when I have enough website content to justify it. They’re always a good way to generate more ideas - I had suspected that Windows users might be interested in something like this, but was interested to learn that a lot of people have been searching for ways to get ZFS to run on Windows. This is encouraging to me - it’s a big market opening. Getting ZFS running on Linux is a bit dodgy right now, and Windows is not even under consideration. Along with the general confusion and problems running RAID arrays under Windows, there’s big demand for highly reliable storage on commodity Windows machines.

Office space

I’ve got some office space. Hurrah! It gets me out of the house and the others I’m sharing it with are great fun. I don’t think I’d go back to working from my bedroom for any length of time. There’s some productivity cost in terms of distractions, but there’s big benefits to my moods and motivation which more than makes up for it.

I am still experiencing difficulty concentrating after about six hours of solid work and finding that it’s somewhat diet-related. Along with a recent Slashdot article on nootropics, I’m going to have to investigate dietary supplements more. Now that I’m actually interested in working longer hours, it may be useful.

Day job

The day job is still there, I guess, but on holidays. I haven’t done much work for them for a while. I have had to start transferring money from my investment accounts, which is a bit scary. But it’s early days yet. I always have the option of seeking out more paying work; I just don’t need to right now.

Improving on ZFS: BlobFS

December 13th, 2007

I have a love-hate relationship with ZFS. On one hand, I love how easy it is to operate and how it makes it difficult to damage the array. On the other hand, I hate how it interoperates poorly with Linux and how there are some really crucial operations (like vdev removal or array growing) that aren’t implemented and haven’t been for a long time, despite pleas from the community.

I think it could be easier and better. ‘Final word in filesystems’? Yeah, and 640kb should be enough for anybody.

This is intended for home file servers - big files, mostly reads, performance not a big factor. An assumption is that there is no backup, so the data needs to be as safe and correct and recoverable as possible.

The idea

  • You have some disks. BlobFS will make them a big filesystem with a specified level of redundancy.
  • You can select the number of disks you can lose and hence the level of redundancy (the ’safety margin’). Changes are implemented automatically.
  • If you pull out a disk, the data will rebalance across the array automatically. You’ll have a capacity loss (if the data will fit on the disks) or a safety margin loss (if there’s too much data).
  • If the array does lose too many disks, BlobFS will attempt to allow access to anything that’s still readable. This lets you attempt a partial recovery.
  • All data (including filesystem metadata) is end-to-end checksummed, so you can be fairly certain that flakey equipment is not going to silently corrupt your data.

The basic idea is that you give it some disks and some data, and it will take care of the data as best it can. You leave the low-level decisions about RAID algorithms and device recovery up to BlobFS. It makes the best moves it can with the resources it has available.

This has a subtle but significant benefit in terms of reliability. It’s removing human intervention from the recovery process. You don’t have to say “OK, I’m going to offline that disk and add another one in its place and restripe the array” while hoping that you’re offlining the correct disk and restriping is the right thing to do and your replacement disk is in fact empty and not another disk in the array that contains the last redundant copy of your data. All of that logic is in the software. It takes care of your data as best it can.

disk-drive.jpeg

Some other things that may help reliability:

  • You can ask to remove a disk and BlobFS will move the data off that disk so you can remove it without risking array degradation.
  • If there’s spare space on a disk, automatically fill it with more redundant data. This implies that at any given moment, the array will be (from the perpective of the programmer) completely full. When you add more data, you’re actually removing an old redundant copy of something and replacing it with the new data.
  • Spare space can also be used for journalling and backtracking. One of the big criticisms of RAID arrays without a separate backup is that you can destroy everything very quickly through software - there’s no way to back out of deletions or a virus attack. With journalling and backtracking, however, this concern disappears - if you can backtrack to yesterday’s state before you ran that dodgy porn dialer, you don’t need backup tapes quite so much.

Use case

Bob has a collection of movies and music burned on CD that doesn’t infringe copyright at all. He’s worried about CD’s degrading over time - alreay some of them have holes and bad sectors.

He installs a shiny new 400GB hard drive in his computer, sets up BlobFS on it and copies all of his stuff across. As he ‘obtains’ new media, he just saves them to the BlobFS volume.

Bob starts to worry about his hard drive failing, so he buys another 400GB drive and adds it to the BlobFS volume. BlobFS silently replicates all of his data to the new drive. He still has 400GB (ish) of capacity, but now he can tolerate the loss of either drive (safety margin of 1).

Bob discovers the movie Napoleon Dynamite and has to download every copy in every language. He runs out of space. By now, hard drives are bigger and faster, so he adds a 750GB drive to the volume. BlobFS does two things: it makes another 750GB of space available to Bob, and it uses the space (currently free) to add even more redundancy to the array. Immediately after adding the drive, Bob’s collection of illicit movies has a safety factor of 2 - he can lose any two drives in the volume and not lose any data.

Time passes, and one of the original 400GB drives fails. Bob recieves an email from BlobFS telling him so. He pulls out the drive, replaces it with a 1.25TB drive, and goes back to his popcorn. BlobFS rebalances the data automatically and makes more free space available if it can.

Risks

There are undoubtedly some major technical hurdles to deal with:

  • Automatic array restriping and maintaining safety in the event of a power loss
  • Redundancy across different-sized devices

This list is dangerously short. Obviously, I don’t understand the problem well enough yet to be able to make an accurate assessment of its difficulty. I’ve never written a filesystem. I love the idea of doing so; it’s a meaty problem with some fun algorithms/data structures/performance decisions. It has actual engineering elements to it.

I think I’d really enjoy this project. It is a very large project for one person, though. Without having done any systematic research, I think there’s tremendous demand for something like this. It’s just a big project with some bigtechnical challenges and only me potentially working on it.

I would love to work on this, I really would. But it’s not going to pay the bills for a long time. I think I’d have to be self-sufficient from other income sources first before I could work on this (and expect to make a living from it).

Monetization

  • Sell personal licenses. This will impede widespread use; I’m sure there are already commercial systems that have similar characteristics, but nobody wants to pay for them.
  • Sell commercial licenses. This is similar to what ReiserFS does - you can use it free at home, but make any money off it and you’re supposed to pay for it. Which seems fair enough. I’d probably attempt this and see how well it works out.
  • Sell storage hardware. This seems pretty inevitable, and I love embedded work. 2TB of reliable disks at a low low price? Sold.

Uh-oh. I can’t remember where I left the current TPS reports!

December 5th, 2007

I use lots of different computers. I have a laptop which goes with me everywhere. I have a desktop at home with a nice monitor and great performance. I have a desktop at one client’s office. Then there’s the backup server, mobile phone and iPod.

I have data that I use on all of these machines. Keeping it all up-to-date everywhere is a hassle.

What I want is a program that will link up all of these different computers and keep my data the same everywhere. If I’m working on my desktop and want to go out and work at a cafe, I should be able to just pick up my laptop and go, and all of my work and all of my email will be right there, completely up to date.

When I go to my client’s office, I want all of my work to be up-to-date and ready to go. If I put some new music on my laptop, I want it to appear on my iPod automatically. If I disappear on a holiday, I still want to have my most crucial emails and contacts available on my mobile phone.

I don’t want my personal emails to be saved on the client’s office computer, because I don’t control it. I don’t want my MP3 collection to be saved on my mobile phone, because it doesn’t have enough space. I don’t want my source code to be saved on my iPod, because I’ll just lose it anyway.

I need some magic to make all of this happen. This is what my file synchronizer will do.

I’m concentrating on syncing a laptop to a desktop right now. That solves my biggest day-to-day hassle - I like to work on my desktop (it’s faster and more pleasant) but I do need the portability afforded my my laptop fairly often. Among the people I’ve spoken to, it’s also their biggest sync-related problem - a lot of people use their laptops for everything (”having two computers is too much of a hassle! I’d rather just have one computer and carry it around with me.“) but hate the performance and the constant reconfiguration hassle (”I have to plug in all of the peripherals and change the screen resolution and the font size and the volume and the network setup and my wallpaper looks bad on the big monitor and then I have to go out again anyway and set it all up again.“).

So that’s the dream. Laptop, desktop. Your stuff is the same on both. Always. You don’t have to think about it or copy files. It just is.

Doing things right vs. doing things fast

December 2nd, 2007

I’m encountering a very strong conflict between doing things right and doing things fast.

I want to get something working ASAP. The sync algorithm is difficult and too large to hold in my head at one time. I want to make sure all of my theory is correct. This is the push for fast.

Many of the decisions I’m making now will hurt me later on. I’m already hitting a lot of points where the existing sync algorithms just won’t work when you’re feeding in live and/or disconnected updates. And that’s before I even consider multi-way syncing (e.g. laptop/desktop/server/work computer). And so, I have a big urge to do things right.

Psychologically, doing things right is not going to work for me. I don’t work like that. I need results now and I need to think yeah, that’s cool. I don’t do well slogging away on something for long periods, especially with the technical risks that exist at the moment in this project.

I’m very much a cowboy coder: I’ll slap something together fast, and it’ll do the job, but it won’t be pretty. I get the biggest rush when something does its job for the first time. There’s a little bit that comes from incrementally adding functionality or fixing bugs - I also have a strong perfectionist streak. But I find some tasks extremely tedious - GUI programming (it’s been done a thousand times, and it’s so hard to get right) and web programming come to mind. They seem pointless - they’ve been done before, they’re uninteresting, there’s no rush that comes from seeing them come alive for the first time. They’re analogous to just drawing things on paper, and doing so on a computer is painful. I’d rather just draw them on paper.

I will tend to build up a lot of technical debt. This is why I like working for startups: they need stuff fast more urgently than they need it right. Assuming the startup survives long enough it will need stuff done right, but by that time they’ve usually hired other programmers that can do that instead of me. Usually my startup clients haven’t even known what they want to build in the beginning, so coming up with a perfect all-encompassing design and full test suites is pointless. It’ll probably just get thrown away.

It’s the same situation for my MicroISV. I will need stuff done right - eventually. But my top priority is to get enough income coming in, and that means getting a product out the door ASAP. And because I don’t yet know how this particular product is going to work - I don’t completely understand the problem - there’s an even greater incentive to ignore all of these scary problems that I can see looming (performance, cross-platform compatibility, correctness under wierd conditions). I’ll just push on and get something working well enough.

Concrete examples

I’m using an SQLite database to store file metadata. I’m a big fan of SQLite - it’s just so easy to use and implement. But it’s very complicated for what I want, which is essentially a tree of modtimes using filenames as a primary key. Oh, and with another dimension covering the ‘other machines’ data. Ordered by a journal id. Handling a million records. I don’t know of any relational database that is going to give acceptable performance under these conditions. It’s not an easy data structure to design in C, either, but I’m sure I can beat SQLite’s performance.

Thing is, there’s no need to right now. I have some horrible table joins and ORDER BY clauses, but they do the job. I haven’t run into performance problems yet, probably because:

  • I’m not doing live updates. I’m working under Unison conditions, which is ’scan for changes at the same time on both sides and sync immediately’. This simplifies things tremendously.
  • I’m not using large datasets. I have a few test directories with half a dozen files in each. Eventually, I want to be able to handle up to a million files (I have a Linux kernel tree and a Buildroot in my syncable data, for example).

I’m implementing this in Python right now. Python lets me crank out ideas fairly quickly. Eventually, I expect I’ll have to move to C for performance reasons. The amount of work that the Python implementation does every time it touches a file is enormous - it has to create an object which contains an object for each sync peer, as well as a database query or two. There’s just no need to optimize yet. Premature optimization is the root of all evil, and all that.

File synchronization algorithms: part 3 of monkey

November 26th, 2007

Algorithm 2 (a.k.a. ‘what everyone is using’) has some shortcomings:

  • Detecting changes takes a long time
  • It won’t detect renames or directory moves
  • There are still some cases where you need to resolve conflicts and/or merge files

There are also some usability issues:

  • You need to manually initiate a sync. You can’t just pick up your laptop and go anytime.
  • Performance sucks. I may have mentioned that a dozen or so times.
  • There’s nothing to stop you modifying a file on both sides; you have to remember which is the most recent and remember to sync before working on the other machine.

Here’s how I’ll fix these problems.

Constantly monitor for changes

The existing tools require you to manually initiate a sync, at which point you’ll have a few minutes of disk grinding. I’d rather have the program running constantly and being notified of changes as they happen. The common case is that only a few files will change between syncs - reading all of the files is inefficient.

I covered this in filesystem annotation. What I want is an API that notifies me when files change (or are created or deleted). I think inotify will do this, perhaps FAM. I have no idea what to use on Windows or OSX yet. On a technical level, this is an unsolved problem.

There is a risk here that if files are modified while the application is not running (and hence not receiving notifications) the modifications could be lost.

The fallback option is to scan the file trees while the machine is idle. If you’re checksumming files to detect changes, this can happen during idle time as well.

I think idle time is a grossly underutilized resource right now - we could be doing virus scanning, file indexing, backups and the like constantly instead of at intervals (3am cronjob) or while the user is trying to use the system (like most on-demand virus scanners).

Constantly synchronize changes

If you’re going to scan all of the time, you might as well copy files straight away rather than waiting until the user requests a sync. This will cut down the odds of a merge conflict somewhat, since the files are less likely to be modified simultaneously on both sides. This introduces the idea of a pair of machines being connected; while they are connected, their files are always synchronized. Since you’re probably modifying small amounts of data at a time, this will work reasonably well over a slow network connection.

Lock in-use files

Another way to prevent merge conflicts is to lock a file on machine A if it’s being written to on machine B. This prevents an application on machine A from modifying it at the same time.

Identify machines by a UUID rather than IP address

A common situation is to have a laptop and a desktop that you want synchronized together. You have the laptop at home and sync the files. You take the laptop to work, but because you’re on a different IP the sync program thinks it’s a different machine. If you give each machine a UUID or name, you can be (reasonably) sure of its identity and hence use the right indexes or file trees.

Checksum files or their metadata in order to detect renames

If you’ve got a checksum of each file (or just the modification time and size) and you detect a file deletion, you can look through any new files and see if they’re actually the same file. You can then infer that a file was moved or renamed rather than deleted and a new file created, saving time and bandwidth during the synchronization. It may be possible to optimize this further by looking at inode numbers or their equivalent on whatever filesystem is in use.

File synchronization algorithms, part 2 of elephant

November 21st, 2007

Don’t think about elephants.

The next refinement is to store history data when you look at the file trees. Every time you perform a synchronization you record some metadata for each file. You want to store the filename and the modification time. That way, when you do the next synchronization, you look at what changed between time X and time Y and apply those changes to the remote file tree, somewhat like generating a diff and then patching a tree. You do this twice - once for each direction (A to B and B to A). You can get conflicts, of course.

Conceptually, this looks like:

Compare this with the algorithm from part 1, which looks like this:

Note that if you have no history data, Algorithm 2 works exactly like Algorithm 1. Badly.

This all operates much like a version control system and has similar problems and implications. A VCS usually can’t detect renames of files or directories - you have to explicitly tell the VCS what you’ve done. When you want to perform a synchronization you have to traverse the entire directory tree to find out what’s changed - and this can be very time-consuming. The metadata has to be stored somewhere. Merges almost always require manual intervention and will often be unresolvable (either the user won’t know what to do and will just overwrite one side, or the file format won’t support lines-of-text style merging).

Also note the similar distinction between traditional client-server VCS (e.g. CVS, Perforce) and modern distributed VCS (Mercurial, git). Client-server VCS and propagates the nodes (or the actual files being worked on). Distributed VCS propagates the edges (or diffs). Algorithm 1 is looking purely at the file data and attempting to match it on both sides; algorithm 2 is looking at the changes between the ’sync points’ (or nodes) and propagating the changes.

The actions table for each file looks something like:

File A change File B change Action
 
Created (checksum P) Created (checksum P) Nothing
Created (checksum P) Created (checksum Q) Merge
Deleted No change Delete
Deleted Deleted Nothing
No change No change Nothing
Modified No change Use file A
Modified Modified Merge

(The actions for File A and File B can be interchanged - I didn’t feel like writing out those cases twice.)

If you include the possibility of renames (and horror of horrors, renames with modifies) then you can get a whole lot more combinations and it gets really nasty. I must give kudos to SourceGear for Vault for this: it does handle all of those nasty cases, a headache which I can do without.

Detecting what’s happened between time X and time Y is similarly mechanical. For a given file:

Time X Time Y Change
Does not exist Exists Created
Exists Does not exist Deleted
Checksum P Checksum P Nothing
Checksum P Checksum Q Modified

Without having looked at the source code, I’d say this is the algorithm that Unison uses. I’d also guess that most ‘proper’ synchronization programs use this. It’s the simplest thing that works in most cases.

Note that you also need to be able to reliably detect a change in a file. The (almost) infallible way to do this is to hash the file. I say almost because hash collisions do happen - they’re just extremely rare. ‘Extremely rare’ becomes a lot more common when you’re talking about a million files (32 bits of hash is not enough).

The other option is to look at the modification time of the file. Software can and does manipulate the modtime, however, and you might miss changes. Users might change the system time and confuse your sync program (if a change was made a long time ago). You might not be syncing to a device that has a real-time clock (some mobile phones, notably). You also have to sync the times between the two systems, but that’s not too hard.

Aaaanyway, the gist of it is:

  • Checksums: reliable, slow (you have to read the entire contents of every file)
  • Modification time: less reliable, much faster

Some filesystems such as JFFS2 keep a revision number on each block (roughly). If the revision number goes up, you can be assured that a write has happened regardless of what the modification time says. This is not a common feature, however, and probably not accessible to userspace programs anyway. There’s no easy solution here.

Next up: It still sucks. How to make it usable.

File synchronization algorithms, part 1 of lots

November 21st, 2007

“wherein I work out my design in public.”

You have two filesystem trees, A and B. You want the files on both sides to be the same.

Cases that you need to handle:

  • File exists on A but not on B (and vice-versa)
  • File exists on both and is identical
  • File exists on both and is different

Right about this point in time, you’re in trouble. (That was fast!) Only one of those situations can be handled automatically, and that’s if the file is identical on both sides. You need a lot of user input to figure out what the directories should look like, and users tend to say “too hard!” Unison assumes that if a file is present on one side and not on the other, it has just been created. So it copies it across. Already we’re in dangerous territory because this is frequently not what you want to do.

If the file exists and is different, you have to ask the user how to merge them or which one to pick. Asking regular users how to merge files is a bad idea. (Asking developers how to merge files is usually a bad idea.)

Sigh.

This algorithm is not going to work very well. It doesn’t handle any common cases, makes a lot of mistakes in its assumptions, and asks users too much information (which will probably be wrong anyway). Anyone using this algorithm in their synchronization product (*cough* Microsoft *cough*) is going to have a lousy product.

(Don’t get me wrong. I like Office. I like many Microsoft games. I’m not anti-Microsoft at all. It’s just Sturgeon’s Law: 90% of everything is crap.)

Unfortunately, this case is unavoidable on the very first synchronization of a pair of trees. We have no history data - even disconnected history data - and so cannot make informed decisions about what’s new, deleted or changed. The files just are or they are not and we can’t say which of the two trees is correct.

Read the thrilling (!) algorithm in Part Two for a possible solution (!!!11!)

Status report, 21 Nov 2007

November 21st, 2007

The day job is giving me serious grief, so very little project work is getting done. I’m stressed out and spending a lot of time resting. I do badly need a holiday, but if I wasn’t working in this environment I wouldn’t need a holiday nearly so much.

I’m doing a lot of thinking and planning for the file synchronizer - it’s my biggest ongoing computing problem right now - but not a lot of code is getting written. I’m also a bit nervous that I won’t have such a need for it in a few months and hence will lose interest in the project. I’d like to get something working ASAP, but the day job sucks up all of my energy. Without the day job I won’t need the synchronizer so much - so it’s a bit of a Catch-22.

I have Rentacoder and Elance bids out for some filesystem traversal code. I’m disappointed so far. I’ve been bid between USD$80 and $350. I sat down in front of the TV today, watched an episode of Daria and an episode of Futurama and had it done. Half of that was remembering that Python’s sqlite module doesn’t automatically commit to the database and wondering why there was no data in the database. Plus the hour I spent writing the spec and the hour I spent setting up accounts and the $10 Elance setup fee. There are a few bids which are attractive because the programmers obviously know their stuff, but I just can’t bring myself to spend $200 on an hour-long job which I’ve already finished.

There’s an obvious flipside here - if Rentacoder and Elance bids are so high, I can probably earn a bit of pocket money through them. Earning money by bidding on online jobs is very attractive to me. Of course, people go on there expecting to pay the coders crap. It might just be a matter of sorting through the bid requests and picking ones that I can do profitably and ignoring the requests with very low prices.

On the upside, the USD is in the shitter so stuff is cheap for me to buy. On the downside, one of my income sources pays in USD, so that’s suddenly become a lot less valuable.

I’ll be moving to Melbourne in January, right around linux.conf.au. I’m pretty excited. Rents are so much cheaper than Sydney, so it’s likely that I’ll be able to afford something with a nice view. Rent is still my biggest ongoing expense. Once I cut out expenses for crap cafe food at work and car fuel+maintenance (urgh! the damn thing keeps breaking!) it’ll account for 60% of my expenses. Living lean has its advantages.

Just thinking about the file synchronizer is useful at the moment, though. There’s a lot of unknowns that I haven’t worked out yet - it’s the sort of problem that can be described succinctly, but working out all of the intricacies of the solution (and a UI to suit) is fairly complicated. I’m torn between working out every nuance or just jumping in and getting something 80% working. The first milestone will be to write something that can replace Unison, and I’ve got that pretty well worked out. It’s just unsuitable for later features that I want to add (backups, Time Machine and flash-drive metadata exchange).

S&M (Sales & Marketing) and other things I swore I’d never do

November 13th, 2007

I used to say “I’ll never get involved with fundraising for a startup. I’ll write code, support users, write marketing material, even sell stuff. But I won’t go begging for cash.”

I’ve started begging for cash. I’ve been looking at government grants. This is just the latest in the a saga of Things I Swore I’d Never Do But Am Now Doing.

It’s a slippery slope. You start out as an elitist coder, laughing at those sales and marketing jocks and how useless they are. I mean, they speak in terms of synergies and paradigms! It’s the NEW PARADIGM that creates SYNERGIES to increase our ROI and improve RETURN RATES so we can INCREASE INVESTOR CONFIDENCE.

Oh yes. We did laugh. And now, such a statement makes perfect sense.

Why do coders start out with such a negative attitude towards sales and marketing? I recall my two big complaints being “they don’t do any real work!” and “how come they get commmissions and I don’t get squat?”

Of course, I’ve learned now. The work is just as real - it’s just in a different form.

There’s a big cultural difference. Sales, in particular, is about building relationships. Developers generally find this sort of approach disconcerting - I think we’re a distrustful bunch in general. So when sales guys come up and are all friendly for no particular reason, we’re sitting there thinking “What is this guy up to? Why is he being my friend?” The sales guy is just being a sales guy.

As for commissions, I think there are two factors here. Salespeople have always received commissions. That’s just how it is. Additionally, them reaching their goal is very easy to detect and measure - they get a purchase order (and preferably some cash). If I fix a dozen bugs, that doesn’t obviously result in increased revenues for the company.

This doesn’t work perfectly in the real world, of course. If programmers were paid by commission - say, they earn $5 for each bug that they fix - then they’d probably start writing a whole lot more bugs. Likewise, salespeople will aim to get just the incentive instead of looking at the bigger picture. They’ll start making crazy claims in order to get a sale at any cost because they’re not the ones who’ll have to make it happen. They’ll get the sale, but it’ll cost the company in the long run due to misallocated development resources.

I think for a second that maybe you could pay programmers for each external support call that they resolve - that way, they can’t just generate cash on their own. But then, they have no incentive to avoid the bugs in the first place - they’ll just ship them out, wait for the support calls to roll in and then ship out a fix. So you want to be paying them for not creating bugs - maybe start with a $500 fixed bonus and dock them $5 for each bug that comes up. Then you’ll see bugs not getting reported and some really cranky programmers. You’re now penalizing them for being imperfect. Most programmers I know already beat themselves up when a bug arises - docking their pay isn’t going to help at all.

I have seen sales commissions work against salespeople, however. It seems to be widely accepted that if you are going to get commissions you’ll also get paid less base salary. This is fine if you’re good at sales and not so fine if you’re bad at it. It also means that the quality of the product you’re selling and plain old luck has a big impact on your livelihood. Happen to find someone who already loves your product and wants to buy a thousand of them? Great! You didn’t have to do any work, but you profit just the same. Some doctor at a medical centre gets regular cash from me due to a referral she wrote a year ago - because she happened to be the first doctor available on the day. In this sense, the commission isn’t really earned.

So now I’m a developer who is familiar with the languages of sales and marketing and business. Just being able to understand what !developers are saying comes in handy, and they definitely notice. Being the optimist I am (hah!) I keep thinking “what’s the worst that could happen” with my business, and being able to communicate well with everyone in an organisation is one of the big plusses.

Filesystem annotation and things that suck. Also, procrastination.

November 7th, 2007

Today was a ‘designated project day’. I’m supposed to do no client work - just product development.

I didn’t get much work done.

I seem to have spent the entire day doing admin. This is a good thing - it’d been piling up quite badly over the last few months. I found a forgotten credit card statement. I shuffled a lot of money between accounts. I still haven’t filed all of my invoices or done all of my phone calls. I still haven’t attacked the pile of dishes in the sink - there’s been literally nothing clean for about three days now.

I need a personal assistant or something.

I did bolt on a new feature to the planner - you can now set reminders on tasks. It’s bolted on in the most literal sense possible - I added a new column to my task tree and display the reminder date there. It makes absolutely no sense from a UI point of view, but it solves my immediate problem.

I did have the chance to look over the code again with fresh eyes, seeing as I haven’t touched it in weeks. I’m embarrassed to release it publicly. It works, but the design is so awful that I’d be afraid a future employer or client would see it. It looks an awful lot like a VB application I wrote back in high school. Not good.

I’m also having immediate problems with file synchronization (surprise, surprise) and am eager to start on that. Unison is sucking a whole lot less now that I’ve completed a synchronization (previously, it would crap out because it was taking too long) but it still sucks in fundamental ways, such as not being able to sync multiple directories at once. As one might commonly do, given we have lots of stuff.

Tracker is also sucking, but a lot of people have complained about that. Sorry, Jamie, it’s broken. ‘Churns disk constantly’ is not an acceptable characteristic for any piece of desktop software.

Google Desktop for Linux seemed to suck a lot less in the constant-disk-churning department, but it also did something nasty that made every file access slow. And my Thunderbird would just lock hard for thirty seconds at a time every few minutes. Probably while it was reindexing. Sigh.

This is a weak segue into a more general problem area that I’ve been thinking about which is filesystem annotation. Essentially, what Tracker is doing is reading each file and creating some sort of index. Presumably, it also checksums each file so that it knows when it’s changed. It also sets inotify(s) so that it gets a message when files change.

Well, so does a file synchronizer. I just want to know if a file has changed, which means checksum. If we were sensible and using filesystems that actually knew when disk corruption had occurred like ZFS, I could just query some sort of checksum metadata in the filesystem, in a completely non-portable and non-futureproofed fashion. It would be fast, though. Rambling aside, I want to perform some actions when a file is changed, and I want to keep a little bit of data alongside each file.

Incremental backups (and rsync) have the same problem. You just want to know which files have changed. You don’t want to touch every file on the system. Making things even worse, Linux systems enable atime by default, which means that you’re writing to every file’s metadata every time you traverse the filesystem.

(You can’t just look at file modification times because a) lots of applications don’t change them, and b) they differ between non ntpdated systems. It’s a good way to get your files out of sync without realizing.)

And every one of these applications has the attribute that you want to traverse the filesystem as often as possible, either for reliability or accuracy.

Synergies are brewing here, I tells ya.

Basically, traversing a filesystem is dumb. But we have no better mechanism right now. We want to add file metadata after the fact and the filesystem designer has no way to know what you might want to add and when you might want to add it.

Because of this fundamental problem, I am without a good backup system. I am without file indexing and I am without file synchronization. Multi-gigahertz quad-core processors don’t help with any of that.

How do we fix this, then? (I’m more or less out of premeditated content here and into the exploratory). I think that if each of these applications could get a list of files that have been modified and the datetime on which they were modified, that would help a lot. I assume something like this has to be baked into either the kernel or libc; maybe Windows has an API for this already. I do recall Google Desktop sucking a lot less on there. The list would need to be expired at some point; you can’t track every modification forever or you’ll run out of RAM. Filesystems can probably do this much more efficiently as well, given they already track modification time. And you still might have to do a full rescan in case something modifies the filesystem while your program isn’t running.

If I had my way, it’d be a filesystem option. Every block or inode is checksummed; the block checksum and modtime are stored side-by-side and there’s a fast way to query these checksums/modtimes. A full dump would probably be too slow (multi-gigabyte Maildirs seem to be a good way to show up performance problems in all sorts of software), so a list list of recent changes would be good too. It has the RAM problem, however. Keeping a queue in RAM is more desirable than acting on them immediately, especially in these sort of applications where timeliness is good, but overall system performance is more important. You do the processing while the machine is idle.

Unfortunately, I’m at the limits of what idle speculation can gain me, and more to the point I have to solve these problems using the systems we have now, not some magic pixie-dust filesystem with freaking checksums (come on! seriously! how could every common filesystem in existence not checksum its data! do you not care about reliability at all?) So it looks like inotify, ionice and load detection for now. Undoubtedly, once I start building this stuff I’ll start to understand the problems better.

Back to the dishwashing.