Archive for the ‘Uncategorized’ Category

multigtfs 0.4.0 released (and 0.4.1, and 0.4.2)

August 2, 2014

I released multigtfs 0.4.0 on June 21, 2014, then 0.4.1 on July 11th, and 0.4.2 on July 20th.

multigtfs is a Django app that supports importing and exporting of GTFS feeds, a standard format for describing transit schedules. multigtfs allows multiple feeds to be stored in the database at once.

These releases are a big step forward. The highlights are:

  • Much faster imports and exports. For one large GTFS feed, the import went from 12 hours to 41 minutes. In a more typical feed, the import went from 29 minutes to 1 minute, and export from 5 minutes to 24 seconds.
  • Support for extra columns, used by many agencies to extend GTFS.
  • Extra configuration to make the Django admin usable after importing lots of data.
  • Python3 support

(more…)

Advertisements

Django management commands and out-of-memory kills

June 19, 2014

tl;dr: Smart developers run database-heavy operations as management commands or celery tasks.  If DEBUG=True, you’ll be recording each database access in memory.  Ensure DEBUG=False to avoid running out of memory.

I’m working on issue 28 for django-multi-gtfs, speeding up importing a GTFS feed.  My experience has been with Tulsa Transit’s GTFS feed, a 1.3 MB zip file, which takes about 30 minutes to import on my laptop.  arichev was importing the Southeast Queensland’s GTFS feed, a 21 MB feed which takes 9-12 hours to import.  It shouldn’t take half-a-day to import 3 million rows, even if many of them include PostGIS geometries.  So, it is time to optimize.

My optimization procedure is to:

  1. Measure the performance of the existing code against real-world data, to establish the benchmark,
  2. Create a representative data set that will demonstrate the problem but allow for reasonable iteration,
  3. Make changes and measure improvements against the test data, until significant improvements are gained,
  4. Measure against the real-world data again, and report the gains.

I was having a lot of trouble with step 1.  First, I had to clear ticket 10, which was preventing me from importing the Southeast Queensland data at all.  Next, I started an import on my MacBook Pro (2.4 GHz Intel Core i7 w/ 8 GB RAM) overnight, and woke to find my laptop unusable, requiring a hard reboot.  I spent some pennies and moved to the cloud.  A t1.micro instance was unable to import the Tulsa feed, killed by the out of memory killer.  That seemed odd, but I knew the free tier instances were memory constrained (600 MB RAM), with no swap.  I then moved to a m3.medium instance (3.75 GB RAM).  The Tulsa feed imported in 50 minutes (vs. 30 minutes on my laptop, which isn’t surprising).  However, importing the Southeast Queensland feed also trigger the OOM killer.

How can importing 165MB of text use up almost 4 GB of memory?

My suspicion was a memory leak, or some out-of-control caching.  Memory leaks are rare in Python, but possible.  Caching is a more likely danger – some helpful code that speeds up a web request could cause an offline task to eat up memory.  I remembered a recent Planet Python post on memory leaks, and quickly found mem_top.  There’s a few variations that all do the same basic thing, asking Python about current memory usage and identify the top memory users.  mem_top is an easy pip install, and easy to call:

from mem_top import mem_top
logging.debug(mem_top())

I added this code in several places, tested it against a small sample feed, and then ran it against a small but real feed. The output looks like this:

multigtfs.models.base - DEBUG - Memory hogs: 
refs:
102371	<type 'list'> [{u'time': u'0.003', u'sql': u'SELECT postgis_lib_version()'}, {u'time': u'0.001', u'sql': u'INSERT 
2510	<type 'dict'> {'_multiprocessing': <module '_multiprocessing' from '/Users/john/.virtualenvs/explore/lib/python2.7
444	<type 'dict'> {'SocketType': <class 'socket._socketobject'>, 'getaddrinfo': , 'AI_N
440	<type 'dict'> {'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_N
437	<type 'list'> [,  {'SocketType': <type '_socket.socket'>, 'getaddrinfo': , 'AI_NUMERICS
364	<type 'dict'> {'SocketType': <type '_socket.socket'>, 'getaddrinfo': , 'AI_NUMERICS
330	<type 'dict'> {'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_N
330	<type 'dict'> {'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_N
328	<type 'dict'> {'empty_provider': <pkg_resources.EmptyProvider instance at 0x105781ab8>, 'ExtractionError': 
5067	 <type 'tuple'>
4765	 <type 'dict'>
2736	 <type 'list'>
1862	 <type 'cell'>
1857	 <type 'weakref'>
1423	 <type 'builtin_function_or_method'>
1326	 <type 'wrapper_descriptor'>
1200	 <type 'type'>
1072	 <type 'getset_descriptor'>

The top memory consumer, which grew as the feed was imported, was that list of dictionaries, with timing and SQL queries.  There’s not enough here to nail down the issue, but it was enough to trigger my memory of the common advice to never run Django with DEBUG=True in production.  They’ve listed it in the FAQ under “Why is Django leaking memory?”, which then links to “How can I see the raw SQL queries Django is running?”  When DEBUG=True in your settings, you can access the SQL queries used by the connection:

>>> from django.db import connection
>>> connection.queries
[{'sql': 'SELECT polls_polls.id, polls_polls.question, polls_polls.pub_date FROM polls_polls',
'time': '0.002'}]

This is done in db/backends/__init__py:

    def cursor(self):
        """
        Creates a cursor, opening a connection if necessary.
        """
        self.validate_thread_sharing()
        if (self.use_debug_cursor or
            (self.use_debug_cursor is None and settings.DEBUG)):
            cursor = self.make_debug_cursor(self._cursor())
        else:
            cursor = util.CursorWrapper(self._cursor(), self)
        return cursor

make_debug_cursor returns CursorDebugWrapper from db/backend/util.py, which does some extra work:

class CursorDebugWrapper(CursorWrapper):
    def execute(self, sql, params=None):
        start = time()
        try:
            return super(CursorDebugWrapper, self).execute(sql, params)
        finally:
            stop = time()
            duration = stop - start
            sql = self.db.ops.last_executed_query(self.cursor, sql, params)
            self.db.queries.append({
                'sql': sql,
                'time': "%.3f" % duration,
            })
            logger.debug('(%.3f) %s; args=%s' % (duration, sql, params),
                extra={'duration': duration, 'sql': sql, 'params': params}

When you are running Django as a web server, the request_started signal will reset the queries to an empty list, so you’ll get fresh data for the next request. This is exactly what you want for single-machine debugging, and it’s how tools like debug toolbar give you valuable information about your SQL queries. Lots of SQL queries are usually the thing making your requests slow. The solution is usually to add smarter indexes, combine similar queries to reduce total database requests, read from a cache instead of the database, and defer writes to an async task runner like celery. I usually leave these optimizations until I’m pretty sure the business team is happy with the design.

This same query caching logic runs in offline tasks. Loading an image from disk into a BLOB? The base64-encoded version is in memory. Crunching usage data for engagement metrics? Here’s a handful of SQL queries for every instance in the system. And, surprising to me, running a management command also logs all of the SQL queries.

Setting DEBUG=False will disable this behavior. I suspect New Relic acts like DEBUG="Yes and lots of it". For one project, I was running my celery tasks under the New Relic Python agent, but I’m going to stop doing that, and see if some memory issues around stats crunching go away.

I want DEBUG=True for local testing, but I want to be able to run database-intensive management commands without having to remember to turn it off.  My solution is to add this snippet to the management command:

        # Disable database query logging
        from django.conf import settings
        if settings.DEBUG:
            from django.db import connection
            connection.use_debug_cursor = False

This uses the undocumented use_debug_cursor attribute, so it may not work with exotic database backends. It only disables debugging on the default connection, so you’ll have to do more work if you use multiple database connections. Iterating through django.db.connections should do the trick (see the reset_queries code).

Also, it looks like changes are in progress for Django 1.7. The “memory leak” question has been removed from the FAQ, and the queries code has recently been updated to use a limited queue, and to always turn on database query recording when DEBUG=True. Maybe this change will take care of the issue, but I’ll miss the ability to disable this for management commands, since I just discovered it last night.  I’ve added ticket #22873, we’ll see if they agree with my patch.

Update: The Django developer doesn’t want to change the behavior back.  The attribute use_debug_cursor should probably be named force_debug_cursor, since the intention is to force SQL logging for support of assertNumQueries.  I’ve got ideas for supporting my use case in Django 1.7, but the code will get more complex.  Oh well.

Where the time goes

November 2, 2009

I want to share a new thing I started today: I’m keeping a record of where my time goes. When I start a new task, I write down the time and what I’m doing. It is quite simple, but I’m seeing benefits already.

TimeNotebook.jpg

A few months ago, I was listening to the Manager Tools podcast One Less Thing, and they had a useful idea for anyone – use your calendar to record what you’ve done, and review it monthly to check that you are doing what is important, and find something to stop doing. It seemed almost like an engineering process – by measuring where the time went, you can adjust where the new time goes, and create a feedback loop.

I thought of how to do it electronically, using Google Calendar and my new work-provided Blackberry. I could use colors to designate what I considered important and unimportant. I could add notes for future searching. I could write an app to extract a summary of my month. I could write an app to automatically add entries based on the window where I was using the mouse the most!

Yeah, none of that happened.

As I was watching Merlin Mann’s latest monologue on productivity, I remembered the idea, and decided right now was the time to start, and I needed to start low-tech.

I grabbed a legal pad, and wrote “3:23 Play w/ Issac”, because my kid was crying and I was the only parent around. Also, I don’t know how to spell his name (sorry, Isaac). After a minute, I realized this was going nowhere, and decided instead to put him in the stroller and walk him and the dog down the street to get some coffee. Maybe he’d go to sleep?

It was a very nice Fall day, just the right level of chill and absolutely beautiful. Lots of people were out walking their babies and much better behaved dogs. Isaac enjoyed it all, and fell asleep on the walk back. Mission accomplished! However, I was a bit distressed to discover that it took 30 minutes, plus 8 minutes prep time, to go for the walk. Maybe next time I’ll go to the closer Starbucks rather than the superior Ristretto Roasters.

I swapped to a Moleskine cashier, because it fits in a pocket more nicely than a legal pad, and the tiny space keeps me from writing in too much detail.  I’ll see if I can keep it up.

Two wireless networks with WRT54G and OpenWrt

July 8, 2009

When I first moved to Portland, it was just me and my laptop and a few days until Comcast turned on the Internet.  Luckily, some neighbors had open wireless access points, and I was able to get maps, check bus routes, and move up my appointment for high-speed internet.  I wanted to extend the same courtesy to my neighbors and guests, so I configured my own open access point.  If Bruce Schneier does it, why shouldn’t I?

A friend of mine strongly disagreed. His argument was simple – I’m not Bruce Schneier, and neither is anyone in my family. Good point. He convinced me to put our computers on a encrypted channel, but I still wanted to provide easy access for guests and neighbors.  Luckily, there are some open source projects that make this possible.

I have a Linksys WRT54GL wireless router, which I bought for the express purpose of installing an open-source firmware and accessing the additional features of the Broadcom BCM5352 chip that drives it. A friend suggested the X-Wrt project, which adds administration web pages to the OpenWRT firmware. The installation wasn’t too difficult, once I figured out which image to use, and the web interface made the configuration painless.  It was easy to setup a WPA2-protected network, which worked great for our laptops and the Wii (the X-Box and TiVo don’t support WPA2).

Once I had a working encrypted wireless network, I started configuring the guest wireless network, and it was really tricky.  Tricky, but possible.  Isn’t that the way it is with Linux?  I’ve documented all the painful details, including 19 screenshots, after the jump.

(more…)

アキラ, a Successful Long Project

March 29, 2009

I recently finished Akira Club, the companion to Katsuhiro Ōtomo’s long Akira series (アキラ in Japanese). It collects all the original title pages from the serialized version that appeared in Young Magazine, as well as sketches, recollections, and other Akira memorabilia.

Cover of Akira Club

Akira Club

I enjoyed the peek at the creative process. Sketches are paired with the final drawings as they appear in the magazine. Katsuhiro would often redraw pages for the collected works. The conversion from the Japanese right-to-left originals to the Western left-to-right presentation was rarely as easy as flipping the image. A complete redraw was required to get the text bubbles right, and for details like keeping the colonel’s suit buttoned on the right side.

With each episode, he would include a short report on his life, such as the June 20th, 1983 entry:

I went to Kanazawa and saw the elephant that opened its toes. Tatsuhiko Yamigami’s manga was true!

Here’s the September 3rd, 1984 entry:

I’ve done nothing but work this summer. I hate tanned people!

Amazingly, Katsuhiro was using Twitter-style updates 25 years ago.

The Akira project took thirteen years, from 1982 to 1995. This included 120 episodes of 16 to 20 pages, six Japanese collections, six international collections, an animated motion picture, and finally Akira Club. Katsuhiro’s final words are:

With this book, at long last, I can finally put the final period on the final sentence of Akira. Looking back on it now, it seemed very long and very short… And for every time I felt “Oh, I drew that well,” there was a time I felt I hadn’t drawn enough. Well, that’s how these things are, I guess… Several years ago — I forget how to describe my feelings exactly — I found this expression scrawled in a corner of my drawing board:

“If you eat, whether it makes you laugh or cry, it all comes out shit.” I apologize to those who were eating.

I am fascinated with long projects like Akira. At the start, did Katsuhiro know it would take 200 pages to finish? If he knew, would he have started? How much of his life did he have to put on hold? There are 19 months between episode 87 and 88 – how hard was it to take up the project again after that long of a pause? Was his ill health caused in part by the project? How much of the work was driven by the artist’s impulse, and how much was the necessity of paying the bills?

I’ve seldom heard or read of an artist in the middle of a long project express contentment about the work. They just aren’t pleased with what they produce – it must be improved. This must be combined with real pressures to release, often before the artist is ready. Successful projects require a useful vision and deadlines. Take away the vision, and you get an unfocused mess (see Dreaming in Code). Take away the deadlines, and you get endless polishing and starting over for little gain (see Chinese Democracy).

So, what keeps your long projects going?  What are they lacking?

First Post

September 24, 2008

This isn’t the blog I wanted.

The blog I wanted was more of a blog / wiki hybrid, like Martin Fowler’s Bliki.  I’d carefully craft articles over several days, publishing the drafts in a special area for the preview comments.  When an article was fully cooked, it would be posted to the blog for wider consumption.  Links would bring together articles written at different times, with placeholders for articles still to be written.  Over time, it would evolve into an encyclopedia of my interests, with a generous helping of working code examples.  I even planned special markup for code, so that samples could be automatically tested and flagged for updating when they broke. It would be a glorious expression of my philosophy on work and life.

That blog will never get written.

Visions of perfection either remain visions, or they get imperfectly created.  I like to think about what is possible, but the huge amount of work to “get it right” means that I never start.  For me, procrastination isn’t laziness, it’s an unwillingness to risk imperfection.

It’s been about two years since I named my blog and conceived of the design.  All I have to show for it is a coming soon page. I’ll never have the time to implement the best blog I can imagine, so I’ll have to settle for the one I can do.

I’m already failing.  It’s been several hours since I decided to do it now.  I had to stop researching hosting providers (do it now, wordpress.com is good enough, we can always move the content later).  I had to close the themes pages (do it now, the theme is pointless without content).  I threw away several drafts of this post (do it now, no one’s first post looks any good).

I’ll continue to fail.  But that’s the point.  Failing but doing is better than not failing and not doing.

Plus, the next post will look awesome compared to this garbage.