Data storage and performance

I’m finally getting SyncDroid to the point where I can start testing some of my initial assumptions about the project:

  • Scanning while idle will not cause perceptible slowdown to the user
  • There is enough idle time on a typical desktop session to get useful work done

Without these, I’m stuck writing an ordinary file synchronizer (or a slow one) and that’s not really worthwhile to me.

PySQLite Bad. APSW Good.

One of the big question marks has always been ‘how am I going to store the data on disk?’ Most other programs with this requirement use Berkeley DB or SQLite. As I’ve previously mentioned, SQLite can be fast if you’re intelligent about it. That said, I’ve never used a large SQLite database - anything past about 10,000 rows. Performance issues will show up well before that if you’re doing something silly (like starts-with queries on a string column).

I want to use an external database for the data integrity features. People will unplug USB drives mid-transaction and I don’t want that to trash my entire data store.

I started out using SQLite and common ORMs to simplify the database implementation. Skipping over a few frustrating days of development, I came to the following conclusions:

  • Most ORMs will not let you access multiple physical databases at the same time - they implicitly store a reference back to the database. This is no good for me, because I have databases on the host and on any USB drives that get connected. This ruled out Autumn, which was looking beautifully simple.
  • Storm’s SQLite transaction behavior is a bit weird - weird enough to be unusable. They do document it, thankfully, and the codebase is very clean, so it’s likely I’ll use it in future (non-SQLite) projects. These problems are mostly because:
  • PySQLite completely mangles SQLite’s transaction model. Seriously, guys. SQLite is not typical as far as SQL databases go, but its behaviour is very well documented and not difficult to understand. If you’d just butt out and let SQLite do its thing, you’d be fine, but noooo, you gotta be clever about it…

Aaanyway, this rules out anything based on PySQLite as an ORM. Which means everything. So I wrote my own using APSW (Another Python SQLite Wrapper). This turned out to not be a big deal - the operations I perform on the database are quite simple by design. Once I worked out a schema bug, this worked beautifully.

Performance

My use case for this application is:

  • My maildirs (170,000 files)
  • A Buildroot repository (240,000 files)
  • A new OpenEmbedded repository (80,000 files)
  • My git repositories (15,000 files)
  • Personal data and records (15,000 files)

I do not intend to be kind to my own software.

Right away, I noticed I was hitting the 120-transactions-per-second issue in SQLite (linked to fsync() and the rotational rate of 7200RPM drives - this is on my laptop). No big deal. It’s still pretty quick.

(I actually have concerns about USB flash drives due to their extremely slow fsync() performance - depending on the filesystem, it might require a few erases/rewrites. Erases are horrendously slow on flash devices, which is why a lot of people are seeing poor write performance on SSDs).

I left it overnight to scan my data. When I got up, we were up to 140,000 files and adding maybe three per second instead of thirty. Grr. (On the upside, Python is not immediately a performance issue. Take that, C++ coders!)

In a past life, I was a performance geek. I heart oprofile. The results:

      samples|      %|
    ------------------
        66522 88.5084 libsqlite3.so.0.8.6
         3963  5.2728 libpthread-2.7.so
         3247  4.3202 python2.5

I took a shower, realized that I hadn’t added appropriate indexes to the tables, and retested:

  samples|      %|
------------------
    81119 99.5264 no-vmlinux
      105  0.1288 python2.5
    CPU_CLK_UNHALT...|
      samples|      %|
    ------------------
           40 38.0952 libsqlite3.so.0.8.6
           30 28.5714 libpthread-2.7.so
           18 17.1429 libc-2.7.so
           17 16.1905 python2.5

The vast majority of the time is now spent in the kernel, not in the SQLite library.

I’m not sure what conclusion to draw from this, yet. Perhaps ’showers are good’. So far, it does not appear that either Python or SQLite will cause performance problems.

3 Responses to “Data storage and performance”

  1. Jared Kuolt Says:

    It seems rather uncommon to support multiple concurrent databases, though I do see the need. Thanks for at least considering Autumn and the “…beautifully simple” comment. That’s almost exactly the point!

    I’d like to know your thoughts on how best to support this, however. Autumn’s development is ramping up and if I were to support this I prefer to do so early on.

    Thanks!

  2. Seun Osewa Says:

    What’s wrong with pysqlite, to be more specific? Is it slower?

  3. Ian Says:

    I can’t comment on pysqlite’s performance - I didn’t use it for long enough.

    My main complaint with it is the way it delays and modifies user transactions. I’m guessing this is in an attempt to prevent deadlocks, which are easy to produce *if* you don’t understand sqlite’s concurrency model. Unfortunately, I found that pysqlite’s modifications were doing more harm than good in my application, which makes extensive use of concurrency and is not able to tolerate multi-second timeout delays.

    I’m comfortable with sqlite’s concurrency model (it’s not hard!) and so made much better progress by managing it myself. In this case, all of my transactions are designed to be very short-lived (as they always should be) and so sqlite handles it without additional ‘trickery’.

Leave a Reply