Archive for the ‘Uncategorized’ Category

File synchronization algorithms: part 3 of monkey

Monday, November 26th, 2007

Algorithm 2 (a.k.a. ‘what everyone is using’) has some shortcomings:

  • Detecting changes takes a long time
  • It won’t detect renames or directory moves
  • There are still some cases where you need to resolve conflicts and/or merge files

There are also some usability issues:

  • You need to manually initiate a sync. You can’t just pick up your laptop and go anytime.
  • Performance sucks. I may have mentioned that a dozen or so times.
  • There’s nothing to stop you modifying a file on both sides; you have to remember which is the most recent and remember to sync before working on the other machine.

Here’s how I’ll fix these problems.

Constantly monitor for changes

The existing tools require you to manually initiate a sync, at which point you’ll have a few minutes of disk grinding. I’d rather have the program running constantly and being notified of changes as they happen. The common case is that only a few files will change between syncs - reading all of the files is inefficient.

I covered this in filesystem annotation. What I want is an API that notifies me when files change (or are created or deleted). I think inotify will do this, perhaps FAM. I have no idea what to use on Windows or OSX yet. On a technical level, this is an unsolved problem.

There is a risk here that if files are modified while the application is not running (and hence not receiving notifications) the modifications could be lost.

The fallback option is to scan the file trees while the machine is idle. If you’re checksumming files to detect changes, this can happen during idle time as well.

I think idle time is a grossly underutilized resource right now - we could be doing virus scanning, file indexing, backups and the like constantly instead of at intervals (3am cronjob) or while the user is trying to use the system (like most on-demand virus scanners).

Constantly synchronize changes

If you’re going to scan all of the time, you might as well copy files straight away rather than waiting until the user requests a sync. This will cut down the odds of a merge conflict somewhat, since the files are less likely to be modified simultaneously on both sides. This introduces the idea of a pair of machines being connected; while they are connected, their files are always synchronized. Since you’re probably modifying small amounts of data at a time, this will work reasonably well over a slow network connection.

Lock in-use files

Another way to prevent merge conflicts is to lock a file on machine A if it’s being written to on machine B. This prevents an application on machine A from modifying it at the same time.

Identify machines by a UUID rather than IP address

A common situation is to have a laptop and a desktop that you want synchronized together. You have the laptop at home and sync the files. You take the laptop to work, but because you’re on a different IP the sync program thinks it’s a different machine. If you give each machine a UUID or name, you can be (reasonably) sure of its identity and hence use the right indexes or file trees.

Checksum files or their metadata in order to detect renames

If you’ve got a checksum of each file (or just the modification time and size) and you detect a file deletion, you can look through any new files and see if they’re actually the same file. You can then infer that a file was moved or renamed rather than deleted and a new file created, saving time and bandwidth during the synchronization. It may be possible to optimize this further by looking at inode numbers or their equivalent on whatever filesystem is in use.

File synchronization algorithms, part 2 of elephant

Wednesday, November 21st, 2007

Don’t think about elephants.

The next refinement is to store history data when you look at the file trees. Every time you perform a synchronization you record some metadata for each file. You want to store the filename and the modification time. That way, when you do the next synchronization, you look at what changed between time X and time Y and apply those changes to the remote file tree, somewhat like generating a diff and then patching a tree. You do this twice - once for each direction (A to B and B to A). You can get conflicts, of course.

Conceptually, this looks like:

Compare this with the algorithm from part 1, which looks like this:

Note that if you have no history data, Algorithm 2 works exactly like Algorithm 1. Badly.

This all operates much like a version control system and has similar problems and implications. A VCS usually can’t detect renames of files or directories - you have to explicitly tell the VCS what you’ve done. When you want to perform a synchronization you have to traverse the entire directory tree to find out what’s changed - and this can be very time-consuming. The metadata has to be stored somewhere. Merges almost always require manual intervention and will often be unresolvable (either the user won’t know what to do and will just overwrite one side, or the file format won’t support lines-of-text style merging).

Also note the similar distinction between traditional client-server VCS (e.g. CVS, Perforce) and modern distributed VCS (Mercurial, git). Client-server VCS and propagates the nodes (or the actual files being worked on). Distributed VCS propagates the edges (or diffs). Algorithm 1 is looking purely at the file data and attempting to match it on both sides; algorithm 2 is looking at the changes between the ’sync points’ (or nodes) and propagating the changes.

The actions table for each file looks something like:

File A change File B change Action
 
Created (checksum P) Created (checksum P) Nothing
Created (checksum P) Created (checksum Q) Merge
Deleted No change Delete
Deleted Deleted Nothing
No change No change Nothing
Modified No change Use file A
Modified Modified Merge

(The actions for File A and File B can be interchanged - I didn’t feel like writing out those cases twice.)

If you include the possibility of renames (and horror of horrors, renames with modifies) then you can get a whole lot more combinations and it gets really nasty. I must give kudos to SourceGear for Vault for this: it does handle all of those nasty cases, a headache which I can do without.

Detecting what’s happened between time X and time Y is similarly mechanical. For a given file:

Time X Time Y Change
Does not exist Exists Created
Exists Does not exist Deleted
Checksum P Checksum P Nothing
Checksum P Checksum Q Modified

Without having looked at the source code, I’d say this is the algorithm that Unison uses. I’d also guess that most ‘proper’ synchronization programs use this. It’s the simplest thing that works in most cases.

Note that you also need to be able to reliably detect a change in a file. The (almost) infallible way to do this is to hash the file. I say almost because hash collisions do happen - they’re just extremely rare. ‘Extremely rare’ becomes a lot more common when you’re talking about a million files (32 bits of hash is not enough).

The other option is to look at the modification time of the file. Software can and does manipulate the modtime, however, and you might miss changes. Users might change the system time and confuse your sync program (if a change was made a long time ago). You might not be syncing to a device that has a real-time clock (some mobile phones, notably). You also have to sync the times between the two systems, but that’s not too hard.

Aaaanyway, the gist of it is:

  • Checksums: reliable, slow (you have to read the entire contents of every file)
  • Modification time: less reliable, much faster

Some filesystems such as JFFS2 keep a revision number on each block (roughly). If the revision number goes up, you can be assured that a write has happened regardless of what the modification time says. This is not a common feature, however, and probably not accessible to userspace programs anyway. There’s no easy solution here.

Next up: It still sucks. How to make it usable.

File synchronization algorithms, part 1 of lots

Wednesday, November 21st, 2007

“wherein I work out my design in public.”

You have two filesystem trees, A and B. You want the files on both sides to be the same.

Cases that you need to handle:

  • File exists on A but not on B (and vice-versa)
  • File exists on both and is identical
  • File exists on both and is different

Right about this point in time, you’re in trouble. (That was fast!) Only one of those situations can be handled automatically, and that’s if the file is identical on both sides. You need a lot of user input to figure out what the directories should look like, and users tend to say “too hard!” Unison assumes that if a file is present on one side and not on the other, it has just been created. So it copies it across. Already we’re in dangerous territory because this is frequently not what you want to do.

If the file exists and is different, you have to ask the user how to merge them or which one to pick. Asking regular users how to merge files is a bad idea. (Asking developers how to merge files is usually a bad idea.)

Sigh.

This algorithm is not going to work very well. It doesn’t handle any common cases, makes a lot of mistakes in its assumptions, and asks users too much information (which will probably be wrong anyway). Anyone using this algorithm in their synchronization product (*cough* Microsoft *cough*) is going to have a lousy product.

(Don’t get me wrong. I like Office. I like many Microsoft games. I’m not anti-Microsoft at all. It’s just Sturgeon’s Law: 90% of everything is crap.)

Unfortunately, this case is unavoidable on the very first synchronization of a pair of trees. We have no history data - even disconnected history data - and so cannot make informed decisions about what’s new, deleted or changed. The files just are or they are not and we can’t say which of the two trees is correct.

Read the thrilling (!) algorithm in Part Two for a possible solution (!!!11!)

Status report, 21 Nov 2007

Wednesday, November 21st, 2007

The day job is giving me serious grief, so very little project work is getting done. I’m stressed out and spending a lot of time resting. I do badly need a holiday, but if I wasn’t working in this environment I wouldn’t need a holiday nearly so much.

I’m doing a lot of thinking and planning for the file synchronizer - it’s my biggest ongoing computing problem right now - but not a lot of code is getting written. I’m also a bit nervous that I won’t have such a need for it in a few months and hence will lose interest in the project. I’d like to get something working ASAP, but the day job sucks up all of my energy. Without the day job I won’t need the synchronizer so much - so it’s a bit of a Catch-22.

I have Rentacoder and Elance bids out for some filesystem traversal code. I’m disappointed so far. I’ve been bid between USD$80 and $350. I sat down in front of the TV today, watched an episode of Daria and an episode of Futurama and had it done. Half of that was remembering that Python’s sqlite module doesn’t automatically commit to the database and wondering why there was no data in the database. Plus the hour I spent writing the spec and the hour I spent setting up accounts and the $10 Elance setup fee. There are a few bids which are attractive because the programmers obviously know their stuff, but I just can’t bring myself to spend $200 on an hour-long job which I’ve already finished.

There’s an obvious flipside here - if Rentacoder and Elance bids are so high, I can probably earn a bit of pocket money through them. Earning money by bidding on online jobs is very attractive to me. Of course, people go on there expecting to pay the coders crap. It might just be a matter of sorting through the bid requests and picking ones that I can do profitably and ignoring the requests with very low prices.

On the upside, the USD is in the shitter so stuff is cheap for me to buy. On the downside, one of my income sources pays in USD, so that’s suddenly become a lot less valuable.

I’ll be moving to Melbourne in January, right around linux.conf.au. I’m pretty excited. Rents are so much cheaper than Sydney, so it’s likely that I’ll be able to afford something with a nice view. Rent is still my biggest ongoing expense. Once I cut out expenses for crap cafe food at work and car fuel+maintenance (urgh! the damn thing keeps breaking!) it’ll account for 60% of my expenses. Living lean has its advantages.

Just thinking about the file synchronizer is useful at the moment, though. There’s a lot of unknowns that I haven’t worked out yet - it’s the sort of problem that can be described succinctly, but working out all of the intricacies of the solution (and a UI to suit) is fairly complicated. I’m torn between working out every nuance or just jumping in and getting something 80% working. The first milestone will be to write something that can replace Unison, and I’ve got that pretty well worked out. It’s just unsuitable for later features that I want to add (backups, Time Machine and flash-drive metadata exchange).

Filesystem annotation and things that suck. Also, procrastination.

Wednesday, November 7th, 2007

Today was a ‘designated project day’. I’m supposed to do no client work - just product development.

I didn’t get much work done.

I seem to have spent the entire day doing admin. This is a good thing - it’d been piling up quite badly over the last few months. I found a forgotten credit card statement. I shuffled a lot of money between accounts. I still haven’t filed all of my invoices or done all of my phone calls. I still haven’t attacked the pile of dishes in the sink - there’s been literally nothing clean for about three days now.

I need a personal assistant or something.

I did bolt on a new feature to the planner - you can now set reminders on tasks. It’s bolted on in the most literal sense possible - I added a new column to my task tree and display the reminder date there. It makes absolutely no sense from a UI point of view, but it solves my immediate problem.

I did have the chance to look over the code again with fresh eyes, seeing as I haven’t touched it in weeks. I’m embarrassed to release it publicly. It works, but the design is so awful that I’d be afraid a future employer or client would see it. It looks an awful lot like a VB application I wrote back in high school. Not good.

I’m also having immediate problems with file synchronization (surprise, surprise) and am eager to start on that. Unison is sucking a whole lot less now that I’ve completed a synchronization (previously, it would crap out because it was taking too long) but it still sucks in fundamental ways, such as not being able to sync multiple directories at once. As one might commonly do, given we have lots of stuff.

Tracker is also sucking, but a lot of people have complained about that. Sorry, Jamie, it’s broken. ‘Churns disk constantly’ is not an acceptable characteristic for any piece of desktop software.

Google Desktop for Linux seemed to suck a lot less in the constant-disk-churning department, but it also did something nasty that made every file access slow. And my Thunderbird would just lock hard for thirty seconds at a time every few minutes. Probably while it was reindexing. Sigh.

This is a weak segue into a more general problem area that I’ve been thinking about which is filesystem annotation. Essentially, what Tracker is doing is reading each file and creating some sort of index. Presumably, it also checksums each file so that it knows when it’s changed. It also sets inotify(s) so that it gets a message when files change.

Well, so does a file synchronizer. I just want to know if a file has changed, which means checksum. If we were sensible and using filesystems that actually knew when disk corruption had occurred like ZFS, I could just query some sort of checksum metadata in the filesystem, in a completely non-portable and non-futureproofed fashion. It would be fast, though. Rambling aside, I want to perform some actions when a file is changed, and I want to keep a little bit of data alongside each file.

Incremental backups (and rsync) have the same problem. You just want to know which files have changed. You don’t want to touch every file on the system. Making things even worse, Linux systems enable atime by default, which means that you’re writing to every file’s metadata every time you traverse the filesystem.

(You can’t just look at file modification times because a) lots of applications don’t change them, and b) they differ between non ntpdated systems. It’s a good way to get your files out of sync without realizing.)

And every one of these applications has the attribute that you want to traverse the filesystem as often as possible, either for reliability or accuracy.

Synergies are brewing here, I tells ya.

Basically, traversing a filesystem is dumb. But we have no better mechanism right now. We want to add file metadata after the fact and the filesystem designer has no way to know what you might want to add and when you might want to add it.

Because of this fundamental problem, I am without a good backup system. I am without file indexing and I am without file synchronization. Multi-gigahertz quad-core processors don’t help with any of that.

How do we fix this, then? (I’m more or less out of premeditated content here and into the exploratory). I think that if each of these applications could get a list of files that have been modified and the datetime on which they were modified, that would help a lot. I assume something like this has to be baked into either the kernel or libc; maybe Windows has an API for this already. I do recall Google Desktop sucking a lot less on there. The list would need to be expired at some point; you can’t track every modification forever or you’ll run out of RAM. Filesystems can probably do this much more efficiently as well, given they already track modification time. And you still might have to do a full rescan in case something modifies the filesystem while your program isn’t running.

If I had my way, it’d be a filesystem option. Every block or inode is checksummed; the block checksum and modtime are stored side-by-side and there’s a fast way to query these checksums/modtimes. A full dump would probably be too slow (multi-gigabyte Maildirs seem to be a good way to show up performance problems in all sorts of software), so a list list of recent changes would be good too. It has the RAM problem, however. Keeping a queue in RAM is more desirable than acting on them immediately, especially in these sort of applications where timeliness is good, but overall system performance is more important. You do the processing while the machine is idle.

Unfortunately, I’m at the limits of what idle speculation can gain me, and more to the point I have to solve these problems using the systems we have now, not some magic pixie-dust filesystem with freaking checksums (come on! seriously! how could every common filesystem in existence not checksum its data! do you not care about reliability at all?) So it looks like inotify, ionice and load detection for now. Undoubtedly, once I start building this stuff I’ll start to understand the problems better.

Back to the dishwashing.

File synchronizers suck! Guess what, I’m procrastinating!

Friday, November 2nd, 2007

Yup, I’ve been ’slacking off’ in the sense that I haven’t worked on any business projects. Saying I’ve been snowed under with client work sounds like an excuse - I’ll always be snowed under with client work. There’s more of it than I can handle. The market economist in my head says that I should increase my hourly rate, but so far that’s been met with very poor responses.

I went to the Sydney OpenCoffee meet yesterday and had a blast. Finally, entrepreneurial types in Sydney! There was a lot of talking about mentoring and quipped that I just needed someone to kick me in the ass and tell me to do some work. Seems I’m not the only one with that problem.

I’ve decided what project 3 will be. A file sychronizer. Yup, another one. The problem that has become sorely obvious in the last couple of days is that all current file synchronizers suck. They either:

  • use timestamps to detect changes. This isn’t particularly accurate.
  • use checksums to detect changes. This is accurate, but a synchronization takes forever if you have any reasonable amount of data. My typical ‘working set’ is on the order of 4 gigabytes across a few hundred thousand files. What I want is to just sync my entire home directory, and that’s around 40 gigs.

I tried out Unison and jfilesync. Unison seems to work alright but is very slow - slow enough that I haven’t actually bothered completing a sync yet. jfilesync craps itself if you have non-files in your directory (such as named pipes) or if you have directory loops (such as many build environments) or if you breathe on it wrong, so that’s out too.

I’ve been doing the incremental rsync backup thing for a while and like it. I’d like my synchronizer to be able to handle backups too.

I don’t know how I’d achieve both performance and accuracy, yet. Obviously, checksumming every file on the system is not acceptable on a regular basis. That’s the sort of task a filesystem should be doing. Scarily enough, ZFS is the only modern filesystem which performs end-to-end checksumming; I consider that something of an outrage considering the abundance of CPU power we have nowadays. I’m hoping something like FAM can tell me when files change without requiring me to poll them or install magic kernel modules. I’ll do kernel modules, but they’re a massive support burden.

Stepping back for a moment, what I really want isn’t file synchronization. I want my working environment to be wherever I am. Carrying a laptop and doing all of your work on there solves a good chunk of the problem, but laptops are physically limited in what they’re capable of.

Location awareness is a hassle; I’m forever changing resolutions and font sizes and performance settings depending on where I am. Software such as MarcoPolo solves some of this by helping the computer figure out where it’s located automatically.

I like Philip Greenspun’s idea for using a mobile phone as a computer; you take your phone, plug it in at one location and all of your stuff pops up. Programs run on the larger CPU at your location. When you unplug, everything is somehow still on the phone, you take it to another location, plug it in, and everything pops up again. The technology is part of the way there; what’s still lacking is probably storage space on the phone. It’s certainly solvable if you paid enough money.

An idea I’ve been kicking around for a while is ‘process hibernation’. When you hibernate a process, its memory image and external references (files, sockets, etc) are saved to disk. You can then take that disk image, stick it on another computer and fire it up again exactly as if it never stopped running. That would be convenient, especially if you could do it for an entire X or Windows session.

You can sort of do it with X forwarding, but that’s too slow over a network. You can sort of do it by running out of a VMWare session saved to removable storage, but that’s too slow as well. I’d rather just carry a laptop.

I really have to get some work done on the planner, too. My ‘next task’ for a while has been to get the Windows version to work using py2exe, but I might push that back to do a Reminders module and a Scheduler module, given I actually need them now and don’t personally need a Windows port for the foreseeabe future.

Making it easy for people to try your software

Monday, October 15th, 2007

I suspect that a great deal of the success of web-based applications is that it’s easy to try them. All you have to do is go to the website; no installing things, no dependencies, no virus scanners, no saving files. There are very low barriers to entry; there’s very little friction.

Personally, I find most web-based applications to be absolutely awful to use. Every UI is different. There aren’t any keyboard shortcuts. I like to have control over my data, and I like to be able to work offline.

For most people, these aren’t major problems. Witness the almost complete domination of webmail over traditional desktop email programs. Given that an ‘ordinary user’ doesn’t have a very reliable computer or backups, keeping all of your email online overseas is probably not a big problem. They’re going online to read email anyway, so it might as well live online.

Imagine, for a moment, the last time you installed some software on your computer. The last installation I did on Windows was the Symbian SDK. I had to choose between a number of download options, download a few hundred megabytes of files, do the installation, consult Google to find out why it wouldn’t work after being installed, go and download some dependencies, install them, blah blah blah blee blee blah blah.

The last installation I did on Linux was FlightGear, a flight simulator. I fired up aptitude, selected the flightgear package and told it to install. There was still a long wait for the download (also a few hundred megabytes) but all of the dependencies and installation happened automatically. The actual time I spent performing the installation was quite small; I just went on with other tasks while it happened. This is a much better situation.

The next best thing for Windows is usually to download a zip file which uncompresses and runs from the folder - it doesn’t have an install program. These programs also run nicely off thumb drives if you’re into that. I like these as a developer - writing and testing install programs is a royal pain in the behind.

Better still would be to just download an executable and run that. Fog Creek Copilot does this for their user agent. It even deletes itself when it’s done, which is extremely considerate. If you’ve got a program with lots of files - like the Python apps I’m planning to distribute - you could have it extract to a temporary directory and execute from there.

Can a desktop app get any easier? An even easier situation might be to click a link on a website and have the program fire up on your desktop. Java Web Start has the right idea, but the implementation is sloppy. The idea is that you click the magic JWS button on a website and the program is downloaded without your knowledge and executed on your computer. Great. You do a caching scheme so you don’t have to download repeatedly; you can set up a Start/GNOME/K/whatever menu item so people can come back later.

Thinking about this last option against a web application, the web application still seems smoother. I think it’s the download time. There’s also the scary “do you want to download this executable” box which most people see but click through anyway. The webapp still feels like it has less friction. The desktop app has a large ’speed bump’ where you have to wait for a download, and during that you start thinking about other things or mentally task switch. You have to think about where you’re going to store the file and do I have to uncompress it and does it have any viruses and how am I going to arrange the windows. The web app is just another tab, and even if I task switch away, it’s just there waiting. Web apps are fairly predictable, even if they are a bit boring at times.