Archive for the ‘Uncategorized’ Category

Improving on ZFS: BlobFS

Thursday, December 13th, 2007

I have a love-hate relationship with ZFS. On one hand, I love how easy it is to operate and how it makes it difficult to damage the array. On the other hand, I hate how it interoperates poorly with Linux and how there are some really crucial operations (like vdev removal or array growing) that aren’t implemented and haven’t been for a long time, despite pleas from the community.

I think it could be easier and better. ‘Final word in filesystems’? Yeah, and 640kb should be enough for anybody.

This is intended for home file servers - big files, mostly reads, performance not a big factor. An assumption is that there is no backup, so the data needs to be as safe and correct and recoverable as possible.

The idea

  • You have some disks. BlobFS will make them a big filesystem with a specified level of redundancy.
  • You can select the number of disks you can lose and hence the level of redundancy (the ’safety margin’). Changes are implemented automatically.
  • If you pull out a disk, the data will rebalance across the array automatically. You’ll have a capacity loss (if the data will fit on the disks) or a safety margin loss (if there’s too much data).
  • If the array does lose too many disks, BlobFS will attempt to allow access to anything that’s still readable. This lets you attempt a partial recovery.
  • All data (including filesystem metadata) is end-to-end checksummed, so you can be fairly certain that flakey equipment is not going to silently corrupt your data.

The basic idea is that you give it some disks and some data, and it will take care of the data as best it can. You leave the low-level decisions about RAID algorithms and device recovery up to BlobFS. It makes the best moves it can with the resources it has available.

This has a subtle but significant benefit in terms of reliability. It’s removing human intervention from the recovery process. You don’t have to say “OK, I’m going to offline that disk and add another one in its place and restripe the array” while hoping that you’re offlining the correct disk and restriping is the right thing to do and your replacement disk is in fact empty and not another disk in the array that contains the last redundant copy of your data. All of that logic is in the software. It takes care of your data as best it can.

disk-drive.jpeg

Some other things that may help reliability:

  • You can ask to remove a disk and BlobFS will move the data off that disk so you can remove it without risking array degradation.
  • If there’s spare space on a disk, automatically fill it with more redundant data. This implies that at any given moment, the array will be (from the perpective of the programmer) completely full. When you add more data, you’re actually removing an old redundant copy of something and replacing it with the new data.
  • Spare space can also be used for journalling and backtracking. One of the big criticisms of RAID arrays without a separate backup is that you can destroy everything very quickly through software - there’s no way to back out of deletions or a virus attack. With journalling and backtracking, however, this concern disappears - if you can backtrack to yesterday’s state before you ran that dodgy porn dialer, you don’t need backup tapes quite so much.

Use case

Bob has a collection of movies and music burned on CD that doesn’t infringe copyright at all. He’s worried about CD’s degrading over time - alreay some of them have holes and bad sectors.

He installs a shiny new 400GB hard drive in his computer, sets up BlobFS on it and copies all of his stuff across. As he ‘obtains’ new media, he just saves them to the BlobFS volume.

Bob starts to worry about his hard drive failing, so he buys another 400GB drive and adds it to the BlobFS volume. BlobFS silently replicates all of his data to the new drive. He still has 400GB (ish) of capacity, but now he can tolerate the loss of either drive (safety margin of 1).

Bob discovers the movie Napoleon Dynamite and has to download every copy in every language. He runs out of space. By now, hard drives are bigger and faster, so he adds a 750GB drive to the volume. BlobFS does two things: it makes another 750GB of space available to Bob, and it uses the space (currently free) to add even more redundancy to the array. Immediately after adding the drive, Bob’s collection of illicit movies has a safety factor of 2 - he can lose any two drives in the volume and not lose any data.

Time passes, and one of the original 400GB drives fails. Bob recieves an email from BlobFS telling him so. He pulls out the drive, replaces it with a 1.25TB drive, and goes back to his popcorn. BlobFS rebalances the data automatically and makes more free space available if it can.

Risks

There are undoubtedly some major technical hurdles to deal with:

  • Automatic array restriping and maintaining safety in the event of a power loss
  • Redundancy across different-sized devices

This list is dangerously short. Obviously, I don’t understand the problem well enough yet to be able to make an accurate assessment of its difficulty. I’ve never written a filesystem. I love the idea of doing so; it’s a meaty problem with some fun algorithms/data structures/performance decisions. It has actual engineering elements to it.

I think I’d really enjoy this project. It is a very large project for one person, though. Without having done any systematic research, I think there’s tremendous demand for something like this. It’s just a big project with some bigtechnical challenges and only me potentially working on it.

I would love to work on this, I really would. But it’s not going to pay the bills for a long time. I think I’d have to be self-sufficient from other income sources first before I could work on this (and expect to make a living from it).

Monetization

  • Sell personal licenses. This will impede widespread use; I’m sure there are already commercial systems that have similar characteristics, but nobody wants to pay for them.
  • Sell commercial licenses. This is similar to what ReiserFS does - you can use it free at home, but make any money off it and you’re supposed to pay for it. Which seems fair enough. I’d probably attempt this and see how well it works out.
  • Sell storage hardware. This seems pretty inevitable, and I love embedded work. 2TB of reliable disks at a low low price? Sold.

Uh-oh. I can’t remember where I left the current TPS reports!

Wednesday, December 5th, 2007

I use lots of different computers. I have a laptop which goes with me everywhere. I have a desktop at home with a nice monitor and great performance. I have a desktop at one client’s office. Then there’s the backup server, mobile phone and iPod.

I have data that I use on all of these machines. Keeping it all up-to-date everywhere is a hassle.

What I want is a program that will link up all of these different computers and keep my data the same everywhere. If I’m working on my desktop and want to go out and work at a cafe, I should be able to just pick up my laptop and go, and all of my work and all of my email will be right there, completely up to date.

When I go to my client’s office, I want all of my work to be up-to-date and ready to go. If I put some new music on my laptop, I want it to appear on my iPod automatically. If I disappear on a holiday, I still want to have my most crucial emails and contacts available on my mobile phone.

I don’t want my personal emails to be saved on the client’s office computer, because I don’t control it. I don’t want my MP3 collection to be saved on my mobile phone, because it doesn’t have enough space. I don’t want my source code to be saved on my iPod, because I’ll just lose it anyway.

I need some magic to make all of this happen. This is what my file synchronizer will do.

I’m concentrating on syncing a laptop to a desktop right now. That solves my biggest day-to-day hassle - I like to work on my desktop (it’s faster and more pleasant) but I do need the portability afforded my my laptop fairly often. Among the people I’ve spoken to, it’s also their biggest sync-related problem - a lot of people use their laptops for everything (”having two computers is too much of a hassle! I’d rather just have one computer and carry it around with me.“) but hate the performance and the constant reconfiguration hassle (”I have to plug in all of the peripherals and change the screen resolution and the font size and the volume and the network setup and my wallpaper looks bad on the big monitor and then I have to go out again anyway and set it all up again.“).

So that’s the dream. Laptop, desktop. Your stuff is the same on both. Always. You don’t have to think about it or copy files. It just is.

File synchronization algorithms: part 3 of monkey

Monday, November 26th, 2007

Algorithm 2 (a.k.a. ‘what everyone is using’) has some shortcomings:

  • Detecting changes takes a long time
  • It won’t detect renames or directory moves
  • There are still some cases where you need to resolve conflicts and/or merge files

There are also some usability issues:

  • You need to manually initiate a sync. You can’t just pick up your laptop and go anytime.
  • Performance sucks. I may have mentioned that a dozen or so times.
  • There’s nothing to stop you modifying a file on both sides; you have to remember which is the most recent and remember to sync before working on the other machine.

Here’s how I’ll fix these problems.

Constantly monitor for changes

The existing tools require you to manually initiate a sync, at which point you’ll have a few minutes of disk grinding. I’d rather have the program running constantly and being notified of changes as they happen. The common case is that only a few files will change between syncs - reading all of the files is inefficient.

I covered this in filesystem annotation. What I want is an API that notifies me when files change (or are created or deleted). I think inotify will do this, perhaps FAM. I have no idea what to use on Windows or OSX yet. On a technical level, this is an unsolved problem.

There is a risk here that if files are modified while the application is not running (and hence not receiving notifications) the modifications could be lost.

The fallback option is to scan the file trees while the machine is idle. If you’re checksumming files to detect changes, this can happen during idle time as well.

I think idle time is a grossly underutilized resource right now - we could be doing virus scanning, file indexing, backups and the like constantly instead of at intervals (3am cronjob) or while the user is trying to use the system (like most on-demand virus scanners).

Constantly synchronize changes

If you’re going to scan all of the time, you might as well copy files straight away rather than waiting until the user requests a sync. This will cut down the odds of a merge conflict somewhat, since the files are less likely to be modified simultaneously on both sides. This introduces the idea of a pair of machines being connected; while they are connected, their files are always synchronized. Since you’re probably modifying small amounts of data at a time, this will work reasonably well over a slow network connection.

Lock in-use files

Another way to prevent merge conflicts is to lock a file on machine A if it’s being written to on machine B. This prevents an application on machine A from modifying it at the same time.

Identify machines by a UUID rather than IP address

A common situation is to have a laptop and a desktop that you want synchronized together. You have the laptop at home and sync the files. You take the laptop to work, but because you’re on a different IP the sync program thinks it’s a different machine. If you give each machine a UUID or name, you can be (reasonably) sure of its identity and hence use the right indexes or file trees.

Checksum files or their metadata in order to detect renames

If you’ve got a checksum of each file (or just the modification time and size) and you detect a file deletion, you can look through any new files and see if they’re actually the same file. You can then infer that a file was moved or renamed rather than deleted and a new file created, saving time and bandwidth during the synchronization. It may be possible to optimize this further by looking at inode numbers or their equivalent on whatever filesystem is in use.

File synchronization algorithms, part 2 of elephant

Wednesday, November 21st, 2007

Don’t think about elephants.

The next refinement is to store history data when you look at the file trees. Every time you perform a synchronization you record some metadata for each file. You want to store the filename and the modification time. That way, when you do the next synchronization, you look at what changed between time X and time Y and apply those changes to the remote file tree, somewhat like generating a diff and then patching a tree. You do this twice - once for each direction (A to B and B to A). You can get conflicts, of course.

Conceptually, this looks like:

Compare this with the algorithm from part 1, which looks like this:

Note that if you have no history data, Algorithm 2 works exactly like Algorithm 1. Badly.

This all operates much like a version control system and has similar problems and implications. A VCS usually can’t detect renames of files or directories - you have to explicitly tell the VCS what you’ve done. When you want to perform a synchronization you have to traverse the entire directory tree to find out what’s changed - and this can be very time-consuming. The metadata has to be stored somewhere. Merges almost always require manual intervention and will often be unresolvable (either the user won’t know what to do and will just overwrite one side, or the file format won’t support lines-of-text style merging).

Also note the similar distinction between traditional client-server VCS (e.g. CVS, Perforce) and modern distributed VCS (Mercurial, git). Client-server VCS and propagates the nodes (or the actual files being worked on). Distributed VCS propagates the edges (or diffs). Algorithm 1 is looking purely at the file data and attempting to match it on both sides; algorithm 2 is looking at the changes between the ’sync points’ (or nodes) and propagating the changes.

The actions table for each file looks something like:

File A change File B change Action
 
Created (checksum P) Created (checksum P) Nothing
Created (checksum P) Created (checksum Q) Merge
Deleted No change Delete
Deleted Deleted Nothing
No change No change Nothing
Modified No change Use file A
Modified Modified Merge

(The actions for File A and File B can be interchanged - I didn’t feel like writing out those cases twice.)

If you include the possibility of renames (and horror of horrors, renames with modifies) then you can get a whole lot more combinations and it gets really nasty. I must give kudos to SourceGear for Vault for this: it does handle all of those nasty cases, a headache which I can do without.

Detecting what’s happened between time X and time Y is similarly mechanical. For a given file:

Time X Time Y Change
Does not exist Exists Created
Exists Does not exist Deleted
Checksum P Checksum P Nothing
Checksum P Checksum Q Modified

Without having looked at the source code, I’d say this is the algorithm that Unison uses. I’d also guess that most ‘proper’ synchronization programs use this. It’s the simplest thing that works in most cases.

Note that you also need to be able to reliably detect a change in a file. The (almost) infallible way to do this is to hash the file. I say almost because hash collisions do happen - they’re just extremely rare. ‘Extremely rare’ becomes a lot more common when you’re talking about a million files (32 bits of hash is not enough).

The other option is to look at the modification time of the file. Software can and does manipulate the modtime, however, and you might miss changes. Users might change the system time and confuse your sync program (if a change was made a long time ago). You might not be syncing to a device that has a real-time clock (some mobile phones, notably). You also have to sync the times between the two systems, but that’s not too hard.

Aaaanyway, the gist of it is:

  • Checksums: reliable, slow (you have to read the entire contents of every file)
  • Modification time: less reliable, much faster

Some filesystems such as JFFS2 keep a revision number on each block (roughly). If the revision number goes up, you can be assured that a write has happened regardless of what the modification time says. This is not a common feature, however, and probably not accessible to userspace programs anyway. There’s no easy solution here.

Next up: It still sucks. How to make it usable.

File synchronization algorithms, part 1 of lots

Wednesday, November 21st, 2007

“wherein I work out my design in public.”

You have two filesystem trees, A and B. You want the files on both sides to be the same.

Cases that you need to handle:

  • File exists on A but not on B (and vice-versa)
  • File exists on both and is identical
  • File exists on both and is different

Right about this point in time, you’re in trouble. (That was fast!) Only one of those situations can be handled automatically, and that’s if the file is identical on both sides. You need a lot of user input to figure out what the directories should look like, and users tend to say “too hard!” Unison assumes that if a file is present on one side and not on the other, it has just been created. So it copies it across. Already we’re in dangerous territory because this is frequently not what you want to do.

If the file exists and is different, you have to ask the user how to merge them or which one to pick. Asking regular users how to merge files is a bad idea. (Asking developers how to merge files is usually a bad idea.)

Sigh.

This algorithm is not going to work very well. It doesn’t handle any common cases, makes a lot of mistakes in its assumptions, and asks users too much information (which will probably be wrong anyway). Anyone using this algorithm in their synchronization product (*cough* Microsoft *cough*) is going to have a lousy product.

(Don’t get me wrong. I like Office. I like many Microsoft games. I’m not anti-Microsoft at all. It’s just Sturgeon’s Law: 90% of everything is crap.)

Unfortunately, this case is unavoidable on the very first synchronization of a pair of trees. We have no history data - even disconnected history data - and so cannot make informed decisions about what’s new, deleted or changed. The files just are or they are not and we can’t say which of the two trees is correct.

Read the thrilling (!) algorithm in Part Two for a possible solution (!!!11!)

Status report, 21 Nov 2007

Wednesday, November 21st, 2007

The day job is giving me serious grief, so very little project work is getting done. I’m stressed out and spending a lot of time resting. I do badly need a holiday, but if I wasn’t working in this environment I wouldn’t need a holiday nearly so much.

I’m doing a lot of thinking and planning for the file synchronizer - it’s my biggest ongoing computing problem right now - but not a lot of code is getting written. I’m also a bit nervous that I won’t have such a need for it in a few months and hence will lose interest in the project. I’d like to get something working ASAP, but the day job sucks up all of my energy. Without the day job I won’t need the synchronizer so much - so it’s a bit of a Catch-22.

I have Rentacoder and Elance bids out for some filesystem traversal code. I’m disappointed so far. I’ve been bid between USD$80 and $350. I sat down in front of the TV today, watched an episode of Daria and an episode of Futurama and had it done. Half of that was remembering that Python’s sqlite module doesn’t automatically commit to the database and wondering why there was no data in the database. Plus the hour I spent writing the spec and the hour I spent setting up accounts and the $10 Elance setup fee. There are a few bids which are attractive because the programmers obviously know their stuff, but I just can’t bring myself to spend $200 on an hour-long job which I’ve already finished.

There’s an obvious flipside here - if Rentacoder and Elance bids are so high, I can probably earn a bit of pocket money through them. Earning money by bidding on online jobs is very attractive to me. Of course, people go on there expecting to pay the coders crap. It might just be a matter of sorting through the bid requests and picking ones that I can do profitably and ignoring the requests with very low prices.

On the upside, the USD is in the shitter so stuff is cheap for me to buy. On the downside, one of my income sources pays in USD, so that’s suddenly become a lot less valuable.

I’ll be moving to Melbourne in January, right around linux.conf.au. I’m pretty excited. Rents are so much cheaper than Sydney, so it’s likely that I’ll be able to afford something with a nice view. Rent is still my biggest ongoing expense. Once I cut out expenses for crap cafe food at work and car fuel+maintenance (urgh! the damn thing keeps breaking!) it’ll account for 60% of my expenses. Living lean has its advantages.

Just thinking about the file synchronizer is useful at the moment, though. There’s a lot of unknowns that I haven’t worked out yet - it’s the sort of problem that can be described succinctly, but working out all of the intricacies of the solution (and a UI to suit) is fairly complicated. I’m torn between working out every nuance or just jumping in and getting something 80% working. The first milestone will be to write something that can replace Unison, and I’ve got that pretty well worked out. It’s just unsuitable for later features that I want to add (backups, Time Machine and flash-drive metadata exchange).

Filesystem annotation and things that suck. Also, procrastination.

Wednesday, November 7th, 2007

Today was a ‘designated project day’. I’m supposed to do no client work - just product development.

I didn’t get much work done.

I seem to have spent the entire day doing admin. This is a good thing - it’d been piling up quite badly over the last few months. I found a forgotten credit card statement. I shuffled a lot of money between accounts. I still haven’t filed all of my invoices or done all of my phone calls. I still haven’t attacked the pile of dishes in the sink - there’s been literally nothing clean for about three days now.

I need a personal assistant or something.

I did bolt on a new feature to the planner - you can now set reminders on tasks. It’s bolted on in the most literal sense possible - I added a new column to my task tree and display the reminder date there. It makes absolutely no sense from a UI point of view, but it solves my immediate problem.

I did have the chance to look over the code again with fresh eyes, seeing as I haven’t touched it in weeks. I’m embarrassed to release it publicly. It works, but the design is so awful that I’d be afraid a future employer or client would see it. It looks an awful lot like a VB application I wrote back in high school. Not good.

I’m also having immediate problems with file synchronization (surprise, surprise) and am eager to start on that. Unison is sucking a whole lot less now that I’ve completed a synchronization (previously, it would crap out because it was taking too long) but it still sucks in fundamental ways, such as not being able to sync multiple directories at once. As one might commonly do, given we have lots of stuff.

Tracker is also sucking, but a lot of people have complained about that. Sorry, Jamie, it’s broken. ‘Churns disk constantly’ is not an acceptable characteristic for any piece of desktop software.

Google Desktop for Linux seemed to suck a lot less in the constant-disk-churning department, but it also did something nasty that made every file access slow. And my Thunderbird would just lock hard for thirty seconds at a time every few minutes. Probably while it was reindexing. Sigh.

This is a weak segue into a more general problem area that I’ve been thinking about which is filesystem annotation. Essentially, what Tracker is doing is reading each file and creating some sort of index. Presumably, it also checksums each file so that it knows when it’s changed. It also sets inotify(s) so that it gets a message when files change.

Well, so does a file synchronizer. I just want to know if a file has changed, which means checksum. If we were sensible and using filesystems that actually knew when disk corruption had occurred like ZFS, I could just query some sort of checksum metadata in the filesystem, in a completely non-portable and non-futureproofed fashion. It would be fast, though. Rambling aside, I want to perform some actions when a file is changed, and I want to keep a little bit of data alongside each file.

Incremental backups (and rsync) have the same problem. You just want to know which files have changed. You don’t want to touch every file on the system. Making things even worse, Linux systems enable atime by default, which means that you’re writing to every file’s metadata every time you traverse the filesystem.

(You can’t just look at file modification times because a) lots of applications don’t change them, and b) they differ between non ntpdated systems. It’s a good way to get your files out of sync without realizing.)

And every one of these applications has the attribute that you want to traverse the filesystem as often as possible, either for reliability or accuracy.

Synergies are brewing here, I tells ya.

Basically, traversing a filesystem is dumb. But we have no better mechanism right now. We want to add file metadata after the fact and the filesystem designer has no way to know what you might want to add and when you might want to add it.

Because of this fundamental problem, I am without a good backup system. I am without file indexing and I am without file synchronization. Multi-gigahertz quad-core processors don’t help with any of that.

How do we fix this, then? (I’m more or less out of premeditated content here and into the exploratory). I think that if each of these applications could get a list of files that have been modified and the datetime on which they were modified, that would help a lot. I assume something like this has to be baked into either the kernel or libc; maybe Windows has an API for this already. I do recall Google Desktop sucking a lot less on there. The list would need to be expired at some point; you can’t track every modification forever or you’ll run out of RAM. Filesystems can probably do this much more efficiently as well, given they already track modification time. And you still might have to do a full rescan in case something modifies the filesystem while your program isn’t running.

If I had my way, it’d be a filesystem option. Every block or inode is checksummed; the block checksum and modtime are stored side-by-side and there’s a fast way to query these checksums/modtimes. A full dump would probably be too slow (multi-gigabyte Maildirs seem to be a good way to show up performance problems in all sorts of software), so a list list of recent changes would be good too. It has the RAM problem, however. Keeping a queue in RAM is more desirable than acting on them immediately, especially in these sort of applications where timeliness is good, but overall system performance is more important. You do the processing while the machine is idle.

Unfortunately, I’m at the limits of what idle speculation can gain me, and more to the point I have to solve these problems using the systems we have now, not some magic pixie-dust filesystem with freaking checksums (come on! seriously! how could every common filesystem in existence not checksum its data! do you not care about reliability at all?) So it looks like inotify, ionice and load detection for now. Undoubtedly, once I start building this stuff I’ll start to understand the problems better.

Back to the dishwashing.

File synchronizers suck! Guess what, I’m procrastinating!

Friday, November 2nd, 2007

Yup, I’ve been ’slacking off’ in the sense that I haven’t worked on any business projects. Saying I’ve been snowed under with client work sounds like an excuse - I’ll always be snowed under with client work. There’s more of it than I can handle. The market economist in my head says that I should increase my hourly rate, but so far that’s been met with very poor responses.

I went to the Sydney OpenCoffee meet yesterday and had a blast. Finally, entrepreneurial types in Sydney! There was a lot of talking about mentoring and quipped that I just needed someone to kick me in the ass and tell me to do some work. Seems I’m not the only one with that problem.

I’ve decided what project 3 will be. A file sychronizer. Yup, another one. The problem that has become sorely obvious in the last couple of days is that all current file synchronizers suck. They either:

  • use timestamps to detect changes. This isn’t particularly accurate.
  • use checksums to detect changes. This is accurate, but a synchronization takes forever if you have any reasonable amount of data. My typical ‘working set’ is on the order of 4 gigabytes across a few hundred thousand files. What I want is to just sync my entire home directory, and that’s around 40 gigs.

I tried out Unison and jfilesync. Unison seems to work alright but is very slow - slow enough that I haven’t actually bothered completing a sync yet. jfilesync craps itself if you have non-files in your directory (such as named pipes) or if you have directory loops (such as many build environments) or if you breathe on it wrong, so that’s out too.

I’ve been doing the incremental rsync backup thing for a while and like it. I’d like my synchronizer to be able to handle backups too.

I don’t know how I’d achieve both performance and accuracy, yet. Obviously, checksumming every file on the system is not acceptable on a regular basis. That’s the sort of task a filesystem should be doing. Scarily enough, ZFS is the only modern filesystem which performs end-to-end checksumming; I consider that something of an outrage considering the abundance of CPU power we have nowadays. I’m hoping something like FAM can tell me when files change without requiring me to poll them or install magic kernel modules. I’ll do kernel modules, but they’re a massive support burden.

Stepping back for a moment, what I really want isn’t file synchronization. I want my working environment to be wherever I am. Carrying a laptop and doing all of your work on there solves a good chunk of the problem, but laptops are physically limited in what they’re capable of.

Location awareness is a hassle; I’m forever changing resolutions and font sizes and performance settings depending on where I am. Software such as MarcoPolo solves some of this by helping the computer figure out where it’s located automatically.

I like Philip Greenspun’s idea for using a mobile phone as a computer; you take your phone, plug it in at one location and all of your stuff pops up. Programs run on the larger CPU at your location. When you unplug, everything is somehow still on the phone, you take it to another location, plug it in, and everything pops up again. The technology is part of the way there; what’s still lacking is probably storage space on the phone. It’s certainly solvable if you paid enough money.

An idea I’ve been kicking around for a while is ‘process hibernation’. When you hibernate a process, its memory image and external references (files, sockets, etc) are saved to disk. You can then take that disk image, stick it on another computer and fire it up again exactly as if it never stopped running. That would be convenient, especially if you could do it for an entire X or Windows session.

You can sort of do it with X forwarding, but that’s too slow over a network. You can sort of do it by running out of a VMWare session saved to removable storage, but that’s too slow as well. I’d rather just carry a laptop.

I really have to get some work done on the planner, too. My ‘next task’ for a while has been to get the Windows version to work using py2exe, but I might push that back to do a Reminders module and a Scheduler module, given I actually need them now and don’t personally need a Windows port for the foreseeabe future.

Making it easy for people to try your software

Monday, October 15th, 2007

I suspect that a great deal of the success of web-based applications is that it’s easy to try them. All you have to do is go to the website; no installing things, no dependencies, no virus scanners, no saving files. There are very low barriers to entry; there’s very little friction.

Personally, I find most web-based applications to be absolutely awful to use. Every UI is different. There aren’t any keyboard shortcuts. I like to have control over my data, and I like to be able to work offline.

For most people, these aren’t major problems. Witness the almost complete domination of webmail over traditional desktop email programs. Given that an ‘ordinary user’ doesn’t have a very reliable computer or backups, keeping all of your email online overseas is probably not a big problem. They’re going online to read email anyway, so it might as well live online.

Imagine, for a moment, the last time you installed some software on your computer. The last installation I did on Windows was the Symbian SDK. I had to choose between a number of download options, download a few hundred megabytes of files, do the installation, consult Google to find out why it wouldn’t work after being installed, go and download some dependencies, install them, blah blah blah blee blee blah blah.

The last installation I did on Linux was FlightGear, a flight simulator. I fired up aptitude, selected the flightgear package and told it to install. There was still a long wait for the download (also a few hundred megabytes) but all of the dependencies and installation happened automatically. The actual time I spent performing the installation was quite small; I just went on with other tasks while it happened. This is a much better situation.

The next best thing for Windows is usually to download a zip file which uncompresses and runs from the folder - it doesn’t have an install program. These programs also run nicely off thumb drives if you’re into that. I like these as a developer - writing and testing install programs is a royal pain in the behind.

Better still would be to just download an executable and run that. Fog Creek Copilot does this for their user agent. It even deletes itself when it’s done, which is extremely considerate. If you’ve got a program with lots of files - like the Python apps I’m planning to distribute - you could have it extract to a temporary directory and execute from there.

Can a desktop app get any easier? An even easier situation might be to click a link on a website and have the program fire up on your desktop. Java Web Start has the right idea, but the implementation is sloppy. The idea is that you click the magic JWS button on a website and the program is downloaded without your knowledge and executed on your computer. Great. You do a caching scheme so you don’t have to download repeatedly; you can set up a Start/GNOME/K/whatever menu item so people can come back later.

Thinking about this last option against a web application, the web application still seems smoother. I think it’s the download time. There’s also the scary “do you want to download this executable” box which most people see but click through anyway. The webapp still feels like it has less friction. The desktop app has a large ’speed bump’ where you have to wait for a download, and during that you start thinking about other things or mentally task switch. You have to think about where you’re going to store the file and do I have to uncompress it and does it have any viruses and how am I going to arrange the windows. The web app is just another tab, and even if I task switch away, it’s just there waiting. Web apps are fairly predictable, even if they are a bit boring at times.