I released multigtfs 0.4.0 on June 21, 2014, then 0.4.1 on July 11th, and 0.4.2 on July 20th.
multigtfs is a Django app that supports importing and exporting of GTFS feeds, a standard format for describing transit schedules. multigtfs allows multiple feeds to be stored in the database at once.
These releases are a big step forward. The highlights are:
- Much faster imports and exports. For one large GTFS feed, the import went from 12 hours to 41 minutes. In a more typical feed, the import went from 29 minutes to 1 minute, and export from 5 minutes to 24 seconds.
- Support for extra columns, used by many agencies to extend GTFS.
- Extra configuration to make the Django admin usable after importing lots of data.
- Python3 support
Breaking Changes
There are a couple of breaking changes for 0.3 users. I’ve dropped support for South 0.7, since it doesn’t work with Python 3. Trips used to be associated with multiple services, because I found a (badly written) test feed with that feature years ago. There’s now just one service per trip.
Speed
The speed-up of imports and exports required two changes: batching database writes, and pre-loading related item data. After batching 1000 writes at a time with bulk_create, importing my test feed dropped from 28 to 9 minutes. Creating a lookup table for related items sped it up to 1 minute.
Exports weren’t as bad as imports, but they are now faster too, dropping from 5 minutes to 24 seconds.
There are further optimizations possible. 2/3 of the import time is spent creating GeoDjango equivalents of shapes, trips, and routes. This could be skipped for users that don’t need them, or possibly sped up by a developer more familiar with GeoDjango. Know anybody?
Extra Columns
It is rare, but not too rare, for an agency to add additional columns to their GTFS files that aren’t in the specification. Sometimes this extra data supports their specific needs, and other additions are generally useful and become part of the spec.
Previously, the import would halt on these extra columns. The user would have to unzip the feed, load the file into a spreadsheet application, remove the column, zip the modified feed, and try again. Painful. Now, these extra columns are stored in a JSON field, which means they can be imported, manipulated, and exported.
Django Admin
The Django admin mostly worked, but would slow to a crawl when loading lots of real-world data. I selectively configured the admin with raw_id_fields, and now the admin is much more useful.
Python 3 and Unicode
No one asked for Python 3, but all the cool kids are doing it. I picked the path of a single code base that runs in either 2 or 3, using the Django guide and the six library bundled since Django 1.5. The conversion was smooth but tedious, with almost every file requiring a change.
The painful part was unicode changes in Python3. I expected some unicode pain, but there’s some additional ugliness around encoding. In Python 2, the csv
module expects to work with bytes, but in Python 3, it expects unicode strings. That’s OK, since the builtin open
returns bytes in Python 2 and strings in Python 3. However, the zipfile
module has an open
method that returns bytes in both Python 2 and 3. This requires different code for Python 2 and 3, on both import and export.
I’m not confident that multigtfs has solid support for non-ASCII feeds yet. If you have a feed that correctly uses non-ASCII characters, please let me know, and I’ll make a new test feed.
What’s Next
The next version of multigtfs will have actual documentation (the README is getting out of control), and will continue adding all the bits a modern Django app is expected to have.
Or, Django 1.7 will be released, and I’ll rush to add support. Release candidate 2 was pushed out this week, so it’s getting close. This is potentially a tricky update, since migrations have been brought into Django, replacing South.
Whichever way it goes, I plan to publish the blog post announcing the release at the same time. These retroactive release announcements get way too long.
Leave a Reply