Archive for December, 2007

MicroISV’s in Sydney

Monday, December 31st, 2007

Here’s a quick list of MicroISV’s that I know of in Sydney. I know there are lots that I’ve missed. Email me if I’ve forgotten you!

(I’m considering a bootstrapped software company of four people or less to be a MicroISV).

  • tiinker by Deep Grey Labs: an intelligent news aggregator that uses datamining techniques to find the best stories for you
  • pulse by zutubi: a really slick continuous integration system

Honourable mentions to people who gave it a try and are getting back up to try again:

Oddly enough, I can name a whole lot more Brisbane micro-ISV’s, despite having only spent three days there.

Status report, 28 Dec 2007

Saturday, December 29th, 2007

Another day, another status update.

I’m getting a lot done. What’s amazing me is the sheer volume of work that needs to be completed. I feel like I could keep someone busy just writing material and investigating sales opportunities and gathering requirements and so on.

Yesterday and today, I:

  • Bought domains for my filesystem and file synchronizer.
  • Set up WordPress on integrifs.com and syncdroid.com.
  • Came up with something to share a single WordPress installation between multiple sites. Instead of WordPress-MU (which assumes you’re hosting all sites under the same domain) I put a switch statement in index.php to change database names based on the hostname in use. This is a whole lot easier, especially when the time comes to upgrade WordPress.
  • Wrote a bunch of marketing material for IntegriFS: overview, how it works, download and a few drafts.
  • Cranked out a dozen pages that I have to write for SyncDroid. I’m just putting the titles into WordPress right now and leaving the pages themselves as drafts, so I can log in whenever and wherever and fill them in.
  • Ditto for IntegriFS. There’s a lot of writing to do.
  • Set up some Google AdWords for SyncDroid. There’s still a lot more work to be done there.
  • Finally found a use for all of those random news aggregators: finding potential customers! I’ve been tracking interesting discussions and gathering a lot of useful information.
  • Made some good progress on a WordPress theme that I’ll share across all of my sites. Each site will have its own colour scheme to differentiate it.
  • Explored a bunch of competitors to SyncDroid. Without exception, they look complicated. They have fancy conflict resolution screens. Complexity is what I want to avoid with SyncDroid. I want it to be as automatic as possible. I want to put in an Undo feature so that if it does make a mistake, you can back it out anyway.
  • Importantly, got a lot of the IntegriFS design and strategy out of my head and onto paper. This is the crux of Getting Things Done: get it out of your head. For the last week or so I’ve been buzzing with ideas and thoughts, and my mind is finally clear to look at other things that are important in the short(er) term. I need to get some bigger pieces of paper, because at the moment it looks something like this:

integrifs-design.jpg

Note that I’ve written no code at all in the last two days. This would have surprised me a week ago. Right now, my focus is to get a bit of attention from people. This should help me work out what sort of demand there is, whether my marketing efforts are working and hopefully even get a bit of feedback from potential users. If it turns out that I’m on the wrong track or there’s no interest at all, I won’t waste time writing code.

There are a few different directions I could push in for SyncDroid. The base technology - detecting, logging and propagating file changes - is useful to several applications. The first ones that come to mind are:

  • Automated real-time backups
  • Synchronization via USB drives
  • Time Machine (snapshotting and rewinding a filesystem)

I wanted to implement all of these, but at this stage I’m focusing on laptop-desktop synchronization. With the new office, however, my own use case has changed. I’m synchronizing over the Internet using Unison, but if I need to move any large amounts of data around that’ll need to go over a USB drive. Hence, using a USB drive as the go-between becomes attractive. Another nifty idea relating to this is to use a mobile phone (connected via Bluetooth) instead of a USB drive.

The Time Machine opportunity is interesting. It’s a very early market with few competitors. There are no compelling implementations for Windows, yet. I have no doubt that there will be an effective free one in time. With the base technology already in place, though, it’s not a big technical challenge to toss another UI over the top. There’s still a huge marketing effort to do, but I’m getting better at that all the time.

Status report, 27 Dec 2007

Thursday, December 27th, 2007

Things are steaming along pretty nicely.

Marketing

After listening to a podcast feature Greg Gianforte at a MySQL conference, I realised that I should be putting effort into marketing now rather than after my product was ready. It seems somewhat obvious now, but I was getting excited about B*-trees and block allocation algorithms without realising that I should do more to establish demand ahead of time.

The filesystem has a marketing name, a website, and is being filled with content. I’m really excited about the idea, actually. It’s a new area of research; I get to build something that I believe in; it hasn’t been done before; there are lots of good monetization options. It is a big product, and so I’m rapidly making my ’six products in one year’ slogan irrelevant. I’m not too bothered. I have some real passion for this idea, and I think that’ll be more important in the long run.

Filling in the website makes me think about what features I intend to support - it’s like a simple spec. And it makes me more excited about what I’m going to build.

I have done a little bit of coding - mostly exploratory stuff with FUSE. It all seems to be making sense. I’ve done a bit of kernel work before, so there’s nothing too scary there.

I haven’t started doing any marketing for the file synchronizer. This is partly because I’m not thinking about it so much with the filesystem. This is a trap that I frequently fall into - I work on something while I’m passionate about it, then forget about it before completion. I do have a good body of code there, however, so I should start working on that at some point.

The big lesson that I’m learning right now is that there is an awful lot of work to be done. I can’t just sprint to a release. It’s going to take consistent effort over a period of months to get these things out the door, and probably more effort to get people to actually use them (let alone buy something!) I need to just be calm and patient and keep putting in consistent effort.

AdWords

I’ve set up a bunch of AdWords ads for the filesystem. They’ll go active when I have enough website content to justify it. They’re always a good way to generate more ideas - I had suspected that Windows users might be interested in something like this, but was interested to learn that a lot of people have been searching for ways to get ZFS to run on Windows. This is encouraging to me - it’s a big market opening. Getting ZFS running on Linux is a bit dodgy right now, and Windows is not even under consideration. Along with the general confusion and problems running RAID arrays under Windows, there’s big demand for highly reliable storage on commodity Windows machines.

Office space

I’ve got some office space. Hurrah! It gets me out of the house and the others I’m sharing it with are great fun. I don’t think I’d go back to working from my bedroom for any length of time. There’s some productivity cost in terms of distractions, but there’s big benefits to my moods and motivation which more than makes up for it.

I am still experiencing difficulty concentrating after about six hours of solid work and finding that it’s somewhat diet-related. Along with a recent Slashdot article on nootropics, I’m going to have to investigate dietary supplements more. Now that I’m actually interested in working longer hours, it may be useful.

Day job

The day job is still there, I guess, but on holidays. I haven’t done much work for them for a while. I have had to start transferring money from my investment accounts, which is a bit scary. But it’s early days yet. I always have the option of seeking out more paying work; I just don’t need to right now.

Improving on ZFS: BlobFS

Thursday, December 13th, 2007

I have a love-hate relationship with ZFS. On one hand, I love how easy it is to operate and how it makes it difficult to damage the array. On the other hand, I hate how it interoperates poorly with Linux and how there are some really crucial operations (like vdev removal or array growing) that aren’t implemented and haven’t been for a long time, despite pleas from the community.

I think it could be easier and better. ‘Final word in filesystems’? Yeah, and 640kb should be enough for anybody.

This is intended for home file servers - big files, mostly reads, performance not a big factor. An assumption is that there is no backup, so the data needs to be as safe and correct and recoverable as possible.

The idea

  • You have some disks. BlobFS will make them a big filesystem with a specified level of redundancy.
  • You can select the number of disks you can lose and hence the level of redundancy (the ’safety margin’). Changes are implemented automatically.
  • If you pull out a disk, the data will rebalance across the array automatically. You’ll have a capacity loss (if the data will fit on the disks) or a safety margin loss (if there’s too much data).
  • If the array does lose too many disks, BlobFS will attempt to allow access to anything that’s still readable. This lets you attempt a partial recovery.
  • All data (including filesystem metadata) is end-to-end checksummed, so you can be fairly certain that flakey equipment is not going to silently corrupt your data.

The basic idea is that you give it some disks and some data, and it will take care of the data as best it can. You leave the low-level decisions about RAID algorithms and device recovery up to BlobFS. It makes the best moves it can with the resources it has available.

This has a subtle but significant benefit in terms of reliability. It’s removing human intervention from the recovery process. You don’t have to say “OK, I’m going to offline that disk and add another one in its place and restripe the array” while hoping that you’re offlining the correct disk and restriping is the right thing to do and your replacement disk is in fact empty and not another disk in the array that contains the last redundant copy of your data. All of that logic is in the software. It takes care of your data as best it can.

disk-drive.jpeg

Some other things that may help reliability:

  • You can ask to remove a disk and BlobFS will move the data off that disk so you can remove it without risking array degradation.
  • If there’s spare space on a disk, automatically fill it with more redundant data. This implies that at any given moment, the array will be (from the perpective of the programmer) completely full. When you add more data, you’re actually removing an old redundant copy of something and replacing it with the new data.
  • Spare space can also be used for journalling and backtracking. One of the big criticisms of RAID arrays without a separate backup is that you can destroy everything very quickly through software - there’s no way to back out of deletions or a virus attack. With journalling and backtracking, however, this concern disappears - if you can backtrack to yesterday’s state before you ran that dodgy porn dialer, you don’t need backup tapes quite so much.

Use case

Bob has a collection of movies and music burned on CD that doesn’t infringe copyright at all. He’s worried about CD’s degrading over time - alreay some of them have holes and bad sectors.

He installs a shiny new 400GB hard drive in his computer, sets up BlobFS on it and copies all of his stuff across. As he ‘obtains’ new media, he just saves them to the BlobFS volume.

Bob starts to worry about his hard drive failing, so he buys another 400GB drive and adds it to the BlobFS volume. BlobFS silently replicates all of his data to the new drive. He still has 400GB (ish) of capacity, but now he can tolerate the loss of either drive (safety margin of 1).

Bob discovers the movie Napoleon Dynamite and has to download every copy in every language. He runs out of space. By now, hard drives are bigger and faster, so he adds a 750GB drive to the volume. BlobFS does two things: it makes another 750GB of space available to Bob, and it uses the space (currently free) to add even more redundancy to the array. Immediately after adding the drive, Bob’s collection of illicit movies has a safety factor of 2 - he can lose any two drives in the volume and not lose any data.

Time passes, and one of the original 400GB drives fails. Bob recieves an email from BlobFS telling him so. He pulls out the drive, replaces it with a 1.25TB drive, and goes back to his popcorn. BlobFS rebalances the data automatically and makes more free space available if it can.

Risks

There are undoubtedly some major technical hurdles to deal with:

  • Automatic array restriping and maintaining safety in the event of a power loss
  • Redundancy across different-sized devices

This list is dangerously short. Obviously, I don’t understand the problem well enough yet to be able to make an accurate assessment of its difficulty. I’ve never written a filesystem. I love the idea of doing so; it’s a meaty problem with some fun algorithms/data structures/performance decisions. It has actual engineering elements to it.

I think I’d really enjoy this project. It is a very large project for one person, though. Without having done any systematic research, I think there’s tremendous demand for something like this. It’s just a big project with some bigtechnical challenges and only me potentially working on it.

I would love to work on this, I really would. But it’s not going to pay the bills for a long time. I think I’d have to be self-sufficient from other income sources first before I could work on this (and expect to make a living from it).

Monetization

  • Sell personal licenses. This will impede widespread use; I’m sure there are already commercial systems that have similar characteristics, but nobody wants to pay for them.
  • Sell commercial licenses. This is similar to what ReiserFS does - you can use it free at home, but make any money off it and you’re supposed to pay for it. Which seems fair enough. I’d probably attempt this and see how well it works out.
  • Sell storage hardware. This seems pretty inevitable, and I love embedded work. 2TB of reliable disks at a low low price? Sold.

Uh-oh. I can’t remember where I left the current TPS reports!

Wednesday, December 5th, 2007

I use lots of different computers. I have a laptop which goes with me everywhere. I have a desktop at home with a nice monitor and great performance. I have a desktop at one client’s office. Then there’s the backup server, mobile phone and iPod.

I have data that I use on all of these machines. Keeping it all up-to-date everywhere is a hassle.

What I want is a program that will link up all of these different computers and keep my data the same everywhere. If I’m working on my desktop and want to go out and work at a cafe, I should be able to just pick up my laptop and go, and all of my work and all of my email will be right there, completely up to date.

When I go to my client’s office, I want all of my work to be up-to-date and ready to go. If I put some new music on my laptop, I want it to appear on my iPod automatically. If I disappear on a holiday, I still want to have my most crucial emails and contacts available on my mobile phone.

I don’t want my personal emails to be saved on the client’s office computer, because I don’t control it. I don’t want my MP3 collection to be saved on my mobile phone, because it doesn’t have enough space. I don’t want my source code to be saved on my iPod, because I’ll just lose it anyway.

I need some magic to make all of this happen. This is what my file synchronizer will do.

I’m concentrating on syncing a laptop to a desktop right now. That solves my biggest day-to-day hassle - I like to work on my desktop (it’s faster and more pleasant) but I do need the portability afforded my my laptop fairly often. Among the people I’ve spoken to, it’s also their biggest sync-related problem - a lot of people use their laptops for everything (”having two computers is too much of a hassle! I’d rather just have one computer and carry it around with me.“) but hate the performance and the constant reconfiguration hassle (”I have to plug in all of the peripherals and change the screen resolution and the font size and the volume and the network setup and my wallpaper looks bad on the big monitor and then I have to go out again anyway and set it all up again.“).

So that’s the dream. Laptop, desktop. Your stuff is the same on both. Always. You don’t have to think about it or copy files. It just is.

Doing things right vs. doing things fast

Sunday, December 2nd, 2007

I’m encountering a very strong conflict between doing things right and doing things fast.

I want to get something working ASAP. The sync algorithm is difficult and too large to hold in my head at one time. I want to make sure all of my theory is correct. This is the push for fast.

Many of the decisions I’m making now will hurt me later on. I’m already hitting a lot of points where the existing sync algorithms just won’t work when you’re feeding in live and/or disconnected updates. And that’s before I even consider multi-way syncing (e.g. laptop/desktop/server/work computer). And so, I have a big urge to do things right.

Psychologically, doing things right is not going to work for me. I don’t work like that. I need results now and I need to think yeah, that’s cool. I don’t do well slogging away on something for long periods, especially with the technical risks that exist at the moment in this project.

I’m very much a cowboy coder: I’ll slap something together fast, and it’ll do the job, but it won’t be pretty. I get the biggest rush when something does its job for the first time. There’s a little bit that comes from incrementally adding functionality or fixing bugs - I also have a strong perfectionist streak. But I find some tasks extremely tedious - GUI programming (it’s been done a thousand times, and it’s so hard to get right) and web programming come to mind. They seem pointless - they’ve been done before, they’re uninteresting, there’s no rush that comes from seeing them come alive for the first time. They’re analogous to just drawing things on paper, and doing so on a computer is painful. I’d rather just draw them on paper.

I will tend to build up a lot of technical debt. This is why I like working for startups: they need stuff fast more urgently than they need it right. Assuming the startup survives long enough it will need stuff done right, but by that time they’ve usually hired other programmers that can do that instead of me. Usually my startup clients haven’t even known what they want to build in the beginning, so coming up with a perfect all-encompassing design and full test suites is pointless. It’ll probably just get thrown away.

It’s the same situation for my MicroISV. I will need stuff done right - eventually. But my top priority is to get enough income coming in, and that means getting a product out the door ASAP. And because I don’t yet know how this particular product is going to work - I don’t completely understand the problem - there’s an even greater incentive to ignore all of these scary problems that I can see looming (performance, cross-platform compatibility, correctness under wierd conditions). I’ll just push on and get something working well enough.

Concrete examples

I’m using an SQLite database to store file metadata. I’m a big fan of SQLite - it’s just so easy to use and implement. But it’s very complicated for what I want, which is essentially a tree of modtimes using filenames as a primary key. Oh, and with another dimension covering the ‘other machines’ data. Ordered by a journal id. Handling a million records. I don’t know of any relational database that is going to give acceptable performance under these conditions. It’s not an easy data structure to design in C, either, but I’m sure I can beat SQLite’s performance.

Thing is, there’s no need to right now. I have some horrible table joins and ORDER BY clauses, but they do the job. I haven’t run into performance problems yet, probably because:

  • I’m not doing live updates. I’m working under Unison conditions, which is ’scan for changes at the same time on both sides and sync immediately’. This simplifies things tremendously.
  • I’m not using large datasets. I have a few test directories with half a dozen files in each. Eventually, I want to be able to handle up to a million files (I have a Linux kernel tree and a Buildroot in my syncable data, for example).

I’m implementing this in Python right now. Python lets me crank out ideas fairly quickly. Eventually, I expect I’ll have to move to C for performance reasons. The amount of work that the Python implementation does every time it touches a file is enormous - it has to create an object which contains an object for each sync peer, as well as a database query or two. There’s just no need to optimize yet. Premature optimization is the root of all evil, and all that.