File synchronization algorithms: part 3 of monkey

Algorithm 2 (a.k.a. ‘what everyone is using’) has some shortcomings:

  • Detecting changes takes a long time
  • It won’t detect renames or directory moves
  • There are still some cases where you need to resolve conflicts and/or merge files

There are also some usability issues:

  • You need to manually initiate a sync. You can’t just pick up your laptop and go anytime.
  • Performance sucks. I may have mentioned that a dozen or so times.
  • There’s nothing to stop you modifying a file on both sides; you have to remember which is the most recent and remember to sync before working on the other machine.

Here’s how I’ll fix these problems.

Constantly monitor for changes

The existing tools require you to manually initiate a sync, at which point you’ll have a few minutes of disk grinding. I’d rather have the program running constantly and being notified of changes as they happen. The common case is that only a few files will change between syncs - reading all of the files is inefficient.

I covered this in filesystem annotation. What I want is an API that notifies me when files change (or are created or deleted). I think inotify will do this, perhaps FAM. I have no idea what to use on Windows or OSX yet. On a technical level, this is an unsolved problem.

There is a risk here that if files are modified while the application is not running (and hence not receiving notifications) the modifications could be lost.

The fallback option is to scan the file trees while the machine is idle. If you’re checksumming files to detect changes, this can happen during idle time as well.

I think idle time is a grossly underutilized resource right now - we could be doing virus scanning, file indexing, backups and the like constantly instead of at intervals (3am cronjob) or while the user is trying to use the system (like most on-demand virus scanners).

Constantly synchronize changes

If you’re going to scan all of the time, you might as well copy files straight away rather than waiting until the user requests a sync. This will cut down the odds of a merge conflict somewhat, since the files are less likely to be modified simultaneously on both sides. This introduces the idea of a pair of machines being connected; while they are connected, their files are always synchronized. Since you’re probably modifying small amounts of data at a time, this will work reasonably well over a slow network connection.

Lock in-use files

Another way to prevent merge conflicts is to lock a file on machine A if it’s being written to on machine B. This prevents an application on machine A from modifying it at the same time.

Identify machines by a UUID rather than IP address

A common situation is to have a laptop and a desktop that you want synchronized together. You have the laptop at home and sync the files. You take the laptop to work, but because you’re on a different IP the sync program thinks it’s a different machine. If you give each machine a UUID or name, you can be (reasonably) sure of its identity and hence use the right indexes or file trees.

Checksum files or their metadata in order to detect renames

If you’ve got a checksum of each file (or just the modification time and size) and you detect a file deletion, you can look through any new files and see if they’re actually the same file. You can then infer that a file was moved or renamed rather than deleted and a new file created, saving time and bandwidth during the synchronization. It may be possible to optimize this further by looking at inode numbers or their equivalent on whatever filesystem is in use.

Leave a Reply