Archive for November, 2007

File synchronization algorithms: part 3 of monkey

Monday, November 26th, 2007

Algorithm 2 (a.k.a. ‘what everyone is using’) has some shortcomings:

  • Detecting changes takes a long time
  • It won’t detect renames or directory moves
  • There are still some cases where you need to resolve conflicts and/or merge files

There are also some usability issues:

  • You need to manually initiate a sync. You can’t just pick up your laptop and go anytime.
  • Performance sucks. I may have mentioned that a dozen or so times.
  • There’s nothing to stop you modifying a file on both sides; you have to remember which is the most recent and remember to sync before working on the other machine.

Here’s how I’ll fix these problems.

Constantly monitor for changes

The existing tools require you to manually initiate a sync, at which point you’ll have a few minutes of disk grinding. I’d rather have the program running constantly and being notified of changes as they happen. The common case is that only a few files will change between syncs - reading all of the files is inefficient.

I covered this in filesystem annotation. What I want is an API that notifies me when files change (or are created or deleted). I think inotify will do this, perhaps FAM. I have no idea what to use on Windows or OSX yet. On a technical level, this is an unsolved problem.

There is a risk here that if files are modified while the application is not running (and hence not receiving notifications) the modifications could be lost.

The fallback option is to scan the file trees while the machine is idle. If you’re checksumming files to detect changes, this can happen during idle time as well.

I think idle time is a grossly underutilized resource right now - we could be doing virus scanning, file indexing, backups and the like constantly instead of at intervals (3am cronjob) or while the user is trying to use the system (like most on-demand virus scanners).

Constantly synchronize changes

If you’re going to scan all of the time, you might as well copy files straight away rather than waiting until the user requests a sync. This will cut down the odds of a merge conflict somewhat, since the files are less likely to be modified simultaneously on both sides. This introduces the idea of a pair of machines being connected; while they are connected, their files are always synchronized. Since you’re probably modifying small amounts of data at a time, this will work reasonably well over a slow network connection.

Lock in-use files

Another way to prevent merge conflicts is to lock a file on machine A if it’s being written to on machine B. This prevents an application on machine A from modifying it at the same time.

Identify machines by a UUID rather than IP address

A common situation is to have a laptop and a desktop that you want synchronized together. You have the laptop at home and sync the files. You take the laptop to work, but because you’re on a different IP the sync program thinks it’s a different machine. If you give each machine a UUID or name, you can be (reasonably) sure of its identity and hence use the right indexes or file trees.

Checksum files or their metadata in order to detect renames

If you’ve got a checksum of each file (or just the modification time and size) and you detect a file deletion, you can look through any new files and see if they’re actually the same file. You can then infer that a file was moved or renamed rather than deleted and a new file created, saving time and bandwidth during the synchronization. It may be possible to optimize this further by looking at inode numbers or their equivalent on whatever filesystem is in use.

File synchronization algorithms, part 2 of elephant

Wednesday, November 21st, 2007

Don’t think about elephants.

The next refinement is to store history data when you look at the file trees. Every time you perform a synchronization you record some metadata for each file. You want to store the filename and the modification time. That way, when you do the next synchronization, you look at what changed between time X and time Y and apply those changes to the remote file tree, somewhat like generating a diff and then patching a tree. You do this twice - once for each direction (A to B and B to A). You can get conflicts, of course.

Conceptually, this looks like:

Compare this with the algorithm from part 1, which looks like this:

Note that if you have no history data, Algorithm 2 works exactly like Algorithm 1. Badly.

This all operates much like a version control system and has similar problems and implications. A VCS usually can’t detect renames of files or directories - you have to explicitly tell the VCS what you’ve done. When you want to perform a synchronization you have to traverse the entire directory tree to find out what’s changed - and this can be very time-consuming. The metadata has to be stored somewhere. Merges almost always require manual intervention and will often be unresolvable (either the user won’t know what to do and will just overwrite one side, or the file format won’t support lines-of-text style merging).

Also note the similar distinction between traditional client-server VCS (e.g. CVS, Perforce) and modern distributed VCS (Mercurial, git). Client-server VCS and propagates the nodes (or the actual files being worked on). Distributed VCS propagates the edges (or diffs). Algorithm 1 is looking purely at the file data and attempting to match it on both sides; algorithm 2 is looking at the changes between the ’sync points’ (or nodes) and propagating the changes.

The actions table for each file looks something like:

File A change File B change Action
 
Created (checksum P) Created (checksum P) Nothing
Created (checksum P) Created (checksum Q) Merge
Deleted No change Delete
Deleted Deleted Nothing
No change No change Nothing
Modified No change Use file A
Modified Modified Merge

(The actions for File A and File B can be interchanged - I didn’t feel like writing out those cases twice.)

If you include the possibility of renames (and horror of horrors, renames with modifies) then you can get a whole lot more combinations and it gets really nasty. I must give kudos to SourceGear for Vault for this: it does handle all of those nasty cases, a headache which I can do without.

Detecting what’s happened between time X and time Y is similarly mechanical. For a given file:

Time X Time Y Change
Does not exist Exists Created
Exists Does not exist Deleted
Checksum P Checksum P Nothing
Checksum P Checksum Q Modified

Without having looked at the source code, I’d say this is the algorithm that Unison uses. I’d also guess that most ‘proper’ synchronization programs use this. It’s the simplest thing that works in most cases.

Note that you also need to be able to reliably detect a change in a file. The (almost) infallible way to do this is to hash the file. I say almost because hash collisions do happen - they’re just extremely rare. ‘Extremely rare’ becomes a lot more common when you’re talking about a million files (32 bits of hash is not enough).

The other option is to look at the modification time of the file. Software can and does manipulate the modtime, however, and you might miss changes. Users might change the system time and confuse your sync program (if a change was made a long time ago). You might not be syncing to a device that has a real-time clock (some mobile phones, notably). You also have to sync the times between the two systems, but that’s not too hard.

Aaaanyway, the gist of it is:

  • Checksums: reliable, slow (you have to read the entire contents of every file)
  • Modification time: less reliable, much faster

Some filesystems such as JFFS2 keep a revision number on each block (roughly). If the revision number goes up, you can be assured that a write has happened regardless of what the modification time says. This is not a common feature, however, and probably not accessible to userspace programs anyway. There’s no easy solution here.

Next up: It still sucks. How to make it usable.

File synchronization algorithms, part 1 of lots

Wednesday, November 21st, 2007

“wherein I work out my design in public.”

You have two filesystem trees, A and B. You want the files on both sides to be the same.

Cases that you need to handle:

  • File exists on A but not on B (and vice-versa)
  • File exists on both and is identical
  • File exists on both and is different

Right about this point in time, you’re in trouble. (That was fast!) Only one of those situations can be handled automatically, and that’s if the file is identical on both sides. You need a lot of user input to figure out what the directories should look like, and users tend to say “too hard!” Unison assumes that if a file is present on one side and not on the other, it has just been created. So it copies it across. Already we’re in dangerous territory because this is frequently not what you want to do.

If the file exists and is different, you have to ask the user how to merge them or which one to pick. Asking regular users how to merge files is a bad idea. (Asking developers how to merge files is usually a bad idea.)

Sigh.

This algorithm is not going to work very well. It doesn’t handle any common cases, makes a lot of mistakes in its assumptions, and asks users too much information (which will probably be wrong anyway). Anyone using this algorithm in their synchronization product (*cough* Microsoft *cough*) is going to have a lousy product.

(Don’t get me wrong. I like Office. I like many Microsoft games. I’m not anti-Microsoft at all. It’s just Sturgeon’s Law: 90% of everything is crap.)

Unfortunately, this case is unavoidable on the very first synchronization of a pair of trees. We have no history data - even disconnected history data - and so cannot make informed decisions about what’s new, deleted or changed. The files just are or they are not and we can’t say which of the two trees is correct.

Read the thrilling (!) algorithm in Part Two for a possible solution (!!!11!)

Status report, 21 Nov 2007

Wednesday, November 21st, 2007

The day job is giving me serious grief, so very little project work is getting done. I’m stressed out and spending a lot of time resting. I do badly need a holiday, but if I wasn’t working in this environment I wouldn’t need a holiday nearly so much.

I’m doing a lot of thinking and planning for the file synchronizer - it’s my biggest ongoing computing problem right now - but not a lot of code is getting written. I’m also a bit nervous that I won’t have such a need for it in a few months and hence will lose interest in the project. I’d like to get something working ASAP, but the day job sucks up all of my energy. Without the day job I won’t need the synchronizer so much - so it’s a bit of a Catch-22.

I have Rentacoder and Elance bids out for some filesystem traversal code. I’m disappointed so far. I’ve been bid between USD$80 and $350. I sat down in front of the TV today, watched an episode of Daria and an episode of Futurama and had it done. Half of that was remembering that Python’s sqlite module doesn’t automatically commit to the database and wondering why there was no data in the database. Plus the hour I spent writing the spec and the hour I spent setting up accounts and the $10 Elance setup fee. There are a few bids which are attractive because the programmers obviously know their stuff, but I just can’t bring myself to spend $200 on an hour-long job which I’ve already finished.

There’s an obvious flipside here - if Rentacoder and Elance bids are so high, I can probably earn a bit of pocket money through them. Earning money by bidding on online jobs is very attractive to me. Of course, people go on there expecting to pay the coders crap. It might just be a matter of sorting through the bid requests and picking ones that I can do profitably and ignoring the requests with very low prices.

On the upside, the USD is in the shitter so stuff is cheap for me to buy. On the downside, one of my income sources pays in USD, so that’s suddenly become a lot less valuable.

I’ll be moving to Melbourne in January, right around linux.conf.au. I’m pretty excited. Rents are so much cheaper than Sydney, so it’s likely that I’ll be able to afford something with a nice view. Rent is still my biggest ongoing expense. Once I cut out expenses for crap cafe food at work and car fuel+maintenance (urgh! the damn thing keeps breaking!) it’ll account for 60% of my expenses. Living lean has its advantages.

Just thinking about the file synchronizer is useful at the moment, though. There’s a lot of unknowns that I haven’t worked out yet - it’s the sort of problem that can be described succinctly, but working out all of the intricacies of the solution (and a UI to suit) is fairly complicated. I’m torn between working out every nuance or just jumping in and getting something 80% working. The first milestone will be to write something that can replace Unison, and I’ve got that pretty well worked out. It’s just unsuitable for later features that I want to add (backups, Time Machine and flash-drive metadata exchange).

S&M (Sales & Marketing) and other things I swore I’d never do

Tuesday, November 13th, 2007

I used to say “I’ll never get involved with fundraising for a startup. I’ll write code, support users, write marketing material, even sell stuff. But I won’t go begging for cash.”

I’ve started begging for cash. I’ve been looking at government grants. This is just the latest in the a saga of Things I Swore I’d Never Do But Am Now Doing.

It’s a slippery slope. You start out as an elitist coder, laughing at those sales and marketing jocks and how useless they are. I mean, they speak in terms of synergies and paradigms! It’s the NEW PARADIGM that creates SYNERGIES to increase our ROI and improve RETURN RATES so we can INCREASE INVESTOR CONFIDENCE.

Oh yes. We did laugh. And now, such a statement makes perfect sense.

Why do coders start out with such a negative attitude towards sales and marketing? I recall my two big complaints being “they don’t do any real work!” and “how come they get commmissions and I don’t get squat?”

Of course, I’ve learned now. The work is just as real - it’s just in a different form.

There’s a big cultural difference. Sales, in particular, is about building relationships. Developers generally find this sort of approach disconcerting - I think we’re a distrustful bunch in general. So when sales guys come up and are all friendly for no particular reason, we’re sitting there thinking “What is this guy up to? Why is he being my friend?” The sales guy is just being a sales guy.

As for commissions, I think there are two factors here. Salespeople have always received commissions. That’s just how it is. Additionally, them reaching their goal is very easy to detect and measure - they get a purchase order (and preferably some cash). If I fix a dozen bugs, that doesn’t obviously result in increased revenues for the company.

This doesn’t work perfectly in the real world, of course. If programmers were paid by commission - say, they earn $5 for each bug that they fix - then they’d probably start writing a whole lot more bugs. Likewise, salespeople will aim to get just the incentive instead of looking at the bigger picture. They’ll start making crazy claims in order to get a sale at any cost because they’re not the ones who’ll have to make it happen. They’ll get the sale, but it’ll cost the company in the long run due to misallocated development resources.

I think for a second that maybe you could pay programmers for each external support call that they resolve - that way, they can’t just generate cash on their own. But then, they have no incentive to avoid the bugs in the first place - they’ll just ship them out, wait for the support calls to roll in and then ship out a fix. So you want to be paying them for not creating bugs - maybe start with a $500 fixed bonus and dock them $5 for each bug that comes up. Then you’ll see bugs not getting reported and some really cranky programmers. You’re now penalizing them for being imperfect. Most programmers I know already beat themselves up when a bug arises - docking their pay isn’t going to help at all.

I have seen sales commissions work against salespeople, however. It seems to be widely accepted that if you are going to get commissions you’ll also get paid less base salary. This is fine if you’re good at sales and not so fine if you’re bad at it. It also means that the quality of the product you’re selling and plain old luck has a big impact on your livelihood. Happen to find someone who already loves your product and wants to buy a thousand of them? Great! You didn’t have to do any work, but you profit just the same. Some doctor at a medical centre gets regular cash from me due to a referral she wrote a year ago - because she happened to be the first doctor available on the day. In this sense, the commission isn’t really earned.

So now I’m a developer who is familiar with the languages of sales and marketing and business. Just being able to understand what !developers are saying comes in handy, and they definitely notice. Being the optimist I am (hah!) I keep thinking “what’s the worst that could happen” with my business, and being able to communicate well with everyone in an organisation is one of the big plusses.

Filesystem annotation and things that suck. Also, procrastination.

Wednesday, November 7th, 2007

Today was a ‘designated project day’. I’m supposed to do no client work - just product development.

I didn’t get much work done.

I seem to have spent the entire day doing admin. This is a good thing - it’d been piling up quite badly over the last few months. I found a forgotten credit card statement. I shuffled a lot of money between accounts. I still haven’t filed all of my invoices or done all of my phone calls. I still haven’t attacked the pile of dishes in the sink - there’s been literally nothing clean for about three days now.

I need a personal assistant or something.

I did bolt on a new feature to the planner - you can now set reminders on tasks. It’s bolted on in the most literal sense possible - I added a new column to my task tree and display the reminder date there. It makes absolutely no sense from a UI point of view, but it solves my immediate problem.

I did have the chance to look over the code again with fresh eyes, seeing as I haven’t touched it in weeks. I’m embarrassed to release it publicly. It works, but the design is so awful that I’d be afraid a future employer or client would see it. It looks an awful lot like a VB application I wrote back in high school. Not good.

I’m also having immediate problems with file synchronization (surprise, surprise) and am eager to start on that. Unison is sucking a whole lot less now that I’ve completed a synchronization (previously, it would crap out because it was taking too long) but it still sucks in fundamental ways, such as not being able to sync multiple directories at once. As one might commonly do, given we have lots of stuff.

Tracker is also sucking, but a lot of people have complained about that. Sorry, Jamie, it’s broken. ‘Churns disk constantly’ is not an acceptable characteristic for any piece of desktop software.

Google Desktop for Linux seemed to suck a lot less in the constant-disk-churning department, but it also did something nasty that made every file access slow. And my Thunderbird would just lock hard for thirty seconds at a time every few minutes. Probably while it was reindexing. Sigh.

This is a weak segue into a more general problem area that I’ve been thinking about which is filesystem annotation. Essentially, what Tracker is doing is reading each file and creating some sort of index. Presumably, it also checksums each file so that it knows when it’s changed. It also sets inotify(s) so that it gets a message when files change.

Well, so does a file synchronizer. I just want to know if a file has changed, which means checksum. If we were sensible and using filesystems that actually knew when disk corruption had occurred like ZFS, I could just query some sort of checksum metadata in the filesystem, in a completely non-portable and non-futureproofed fashion. It would be fast, though. Rambling aside, I want to perform some actions when a file is changed, and I want to keep a little bit of data alongside each file.

Incremental backups (and rsync) have the same problem. You just want to know which files have changed. You don’t want to touch every file on the system. Making things even worse, Linux systems enable atime by default, which means that you’re writing to every file’s metadata every time you traverse the filesystem.

(You can’t just look at file modification times because a) lots of applications don’t change them, and b) they differ between non ntpdated systems. It’s a good way to get your files out of sync without realizing.)

And every one of these applications has the attribute that you want to traverse the filesystem as often as possible, either for reliability or accuracy.

Synergies are brewing here, I tells ya.

Basically, traversing a filesystem is dumb. But we have no better mechanism right now. We want to add file metadata after the fact and the filesystem designer has no way to know what you might want to add and when you might want to add it.

Because of this fundamental problem, I am without a good backup system. I am without file indexing and I am without file synchronization. Multi-gigahertz quad-core processors don’t help with any of that.

How do we fix this, then? (I’m more or less out of premeditated content here and into the exploratory). I think that if each of these applications could get a list of files that have been modified and the datetime on which they were modified, that would help a lot. I assume something like this has to be baked into either the kernel or libc; maybe Windows has an API for this already. I do recall Google Desktop sucking a lot less on there. The list would need to be expired at some point; you can’t track every modification forever or you’ll run out of RAM. Filesystems can probably do this much more efficiently as well, given they already track modification time. And you still might have to do a full rescan in case something modifies the filesystem while your program isn’t running.

If I had my way, it’d be a filesystem option. Every block or inode is checksummed; the block checksum and modtime are stored side-by-side and there’s a fast way to query these checksums/modtimes. A full dump would probably be too slow (multi-gigabyte Maildirs seem to be a good way to show up performance problems in all sorts of software), so a list list of recent changes would be good too. It has the RAM problem, however. Keeping a queue in RAM is more desirable than acting on them immediately, especially in these sort of applications where timeliness is good, but overall system performance is more important. You do the processing while the machine is idle.

Unfortunately, I’m at the limits of what idle speculation can gain me, and more to the point I have to solve these problems using the systems we have now, not some magic pixie-dust filesystem with freaking checksums (come on! seriously! how could every common filesystem in existence not checksum its data! do you not care about reliability at all?) So it looks like inotify, ionice and load detection for now. Undoubtedly, once I start building this stuff I’ll start to understand the problems better.

Back to the dishwashing.

File synchronizers suck! Guess what, I’m procrastinating!

Friday, November 2nd, 2007

Yup, I’ve been ’slacking off’ in the sense that I haven’t worked on any business projects. Saying I’ve been snowed under with client work sounds like an excuse - I’ll always be snowed under with client work. There’s more of it than I can handle. The market economist in my head says that I should increase my hourly rate, but so far that’s been met with very poor responses.

I went to the Sydney OpenCoffee meet yesterday and had a blast. Finally, entrepreneurial types in Sydney! There was a lot of talking about mentoring and quipped that I just needed someone to kick me in the ass and tell me to do some work. Seems I’m not the only one with that problem.

I’ve decided what project 3 will be. A file sychronizer. Yup, another one. The problem that has become sorely obvious in the last couple of days is that all current file synchronizers suck. They either:

  • use timestamps to detect changes. This isn’t particularly accurate.
  • use checksums to detect changes. This is accurate, but a synchronization takes forever if you have any reasonable amount of data. My typical ‘working set’ is on the order of 4 gigabytes across a few hundred thousand files. What I want is to just sync my entire home directory, and that’s around 40 gigs.

I tried out Unison and jfilesync. Unison seems to work alright but is very slow - slow enough that I haven’t actually bothered completing a sync yet. jfilesync craps itself if you have non-files in your directory (such as named pipes) or if you have directory loops (such as many build environments) or if you breathe on it wrong, so that’s out too.

I’ve been doing the incremental rsync backup thing for a while and like it. I’d like my synchronizer to be able to handle backups too.

I don’t know how I’d achieve both performance and accuracy, yet. Obviously, checksumming every file on the system is not acceptable on a regular basis. That’s the sort of task a filesystem should be doing. Scarily enough, ZFS is the only modern filesystem which performs end-to-end checksumming; I consider that something of an outrage considering the abundance of CPU power we have nowadays. I’m hoping something like FAM can tell me when files change without requiring me to poll them or install magic kernel modules. I’ll do kernel modules, but they’re a massive support burden.

Stepping back for a moment, what I really want isn’t file synchronization. I want my working environment to be wherever I am. Carrying a laptop and doing all of your work on there solves a good chunk of the problem, but laptops are physically limited in what they’re capable of.

Location awareness is a hassle; I’m forever changing resolutions and font sizes and performance settings depending on where I am. Software such as MarcoPolo solves some of this by helping the computer figure out where it’s located automatically.

I like Philip Greenspun’s idea for using a mobile phone as a computer; you take your phone, plug it in at one location and all of your stuff pops up. Programs run on the larger CPU at your location. When you unplug, everything is somehow still on the phone, you take it to another location, plug it in, and everything pops up again. The technology is part of the way there; what’s still lacking is probably storage space on the phone. It’s certainly solvable if you paid enough money.

An idea I’ve been kicking around for a while is ‘process hibernation’. When you hibernate a process, its memory image and external references (files, sockets, etc) are saved to disk. You can then take that disk image, stick it on another computer and fire it up again exactly as if it never stopped running. That would be convenient, especially if you could do it for an entire X or Windows session.

You can sort of do it with X forwarding, but that’s too slow over a network. You can sort of do it by running out of a VMWare session saved to removable storage, but that’s too slow as well. I’d rather just carry a laptop.

I really have to get some work done on the planner, too. My ‘next task’ for a while has been to get the Windows version to work using py2exe, but I might push that back to do a Reminders module and a Scheduler module, given I actually need them now and don’t personally need a Windows port for the foreseeabe future.