tl;dr: Smart developers run database-heavy operations as management commands or celery tasks. If DEBUG=True
, you’ll be recording each database access in memory. Ensure DEBUG=False
to avoid running out of memory.
I’m working on issue 28 for django-multi-gtfs, speeding up importing a GTFS feed. My experience has been with Tulsa Transit’s GTFS feed, a 1.3 MB zip file, which takes about 30 minutes to import on my laptop. arichev was importing the Southeast Queensland’s GTFS feed, a 21 MB feed which takes 9-12 hours to import. It shouldn’t take half-a-day to import 3 million rows, even if many of them include PostGIS geometries. So, it is time to optimize.
My optimization procedure is to:
- Measure the performance of the existing code against real-world data, to establish the benchmark,
- Create a representative data set that will demonstrate the problem but allow for reasonable iteration,
- Make changes and measure improvements against the test data, until significant improvements are gained,
- Measure against the real-world data again, and report the gains.
I was having a lot of trouble with step 1. First, I had to clear ticket 10, which was preventing me from importing the Southeast Queensland data at all. Next, I started an import on my MacBook Pro (2.4 GHz Intel Core i7 w/ 8 GB RAM) overnight, and woke to find my laptop unusable, requiring a hard reboot. I spent some pennies and moved to the cloud. A t1.micro instance was unable to import the Tulsa feed, killed by the out of memory killer. That seemed odd, but I knew the free tier instances were memory constrained (600 MB RAM), with no swap. I then moved to a m3.medium instance (3.75 GB RAM). The Tulsa feed imported in 50 minutes (vs. 30 minutes on my laptop, which isn’t surprising). However, importing the Southeast Queensland feed also trigger the OOM killer.
How can importing 165MB of text use up almost 4 GB of memory?
My suspicion was a memory leak, or some out-of-control caching. Memory leaks are rare in Python, but possible. Caching is a more likely danger – some helpful code that speeds up a web request could cause an offline task to eat up memory. I remembered a recent Planet Python post on memory leaks, and quickly found mem_top. There’s a few variations that all do the same basic thing, asking Python about current memory usage and identify the top memory users. mem_top is an easy pip install, and easy to call:
from mem_top import mem_top logging.debug(mem_top())
I added this code in several places, tested it against a small sample feed, and then ran it against a small but real feed. The output looks like this:
multigtfs.models.base - DEBUG - Memory hogs: refs: 102371 <type 'list'> [{u'time': u'0.003', u'sql': u'SELECT postgis_lib_version()'}, {u'time': u'0.001', u'sql': u'INSERT 2510 <type 'dict'> {'_multiprocessing': <module '_multiprocessing' from '/Users/john/.virtualenvs/explore/lib/python2.7 444 <type 'dict'> {'SocketType': <class 'socket._socketobject'>, 'getaddrinfo': , 'AI_N 440 <type 'dict'> {'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_N 437 <type 'list'> [, {'SocketType': <type '_socket.socket'>, 'getaddrinfo': , 'AI_NUMERICS 364 <type 'dict'> {'SocketType': <type '_socket.socket'>, 'getaddrinfo': , 'AI_NUMERICS 330 <type 'dict'> {'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_N 330 <type 'dict'> {'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_N 328 <type 'dict'> {'empty_provider': <pkg_resources.EmptyProvider instance at 0x105781ab8>, 'ExtractionError': 5067 <type 'tuple'> 4765 <type 'dict'> 2736 <type 'list'> 1862 <type 'cell'> 1857 <type 'weakref'> 1423 <type 'builtin_function_or_method'> 1326 <type 'wrapper_descriptor'> 1200 <type 'type'> 1072 <type 'getset_descriptor'>
The top memory consumer, which grew as the feed was imported, was that list of dictionaries, with timing and SQL queries. There’s not enough here to nail down the issue, but it was enough to trigger my memory of the common advice to never run Django with DEBUG=True
in production. They’ve listed it in the FAQ under “Why is Django leaking memory?”, which then links to “How can I see the raw SQL queries Django is running?” When DEBUG=True
in your settings, you can access the SQL queries used by the connection:
>>> from django.db import connection >>> connection.queries [{'sql': 'SELECT polls_polls.id, polls_polls.question, polls_polls.pub_date FROM polls_polls', 'time': '0.002'}]
This is done in db/backends/__init__py:
def cursor(self): """ Creates a cursor, opening a connection if necessary. """ self.validate_thread_sharing() if (self.use_debug_cursor or (self.use_debug_cursor is None and settings.DEBUG)): cursor = self.make_debug_cursor(self._cursor()) else: cursor = util.CursorWrapper(self._cursor(), self) return cursor
make_debug_cursor
returns CursorDebugWrapper from db/backend/util.py, which does some extra work:
class CursorDebugWrapper(CursorWrapper): def execute(self, sql, params=None): start = time() try: return super(CursorDebugWrapper, self).execute(sql, params) finally: stop = time() duration = stop - start sql = self.db.ops.last_executed_query(self.cursor, sql, params) self.db.queries.append({ 'sql': sql, 'time': "%.3f" % duration, }) logger.debug('(%.3f) %s; args=%s' % (duration, sql, params), extra={'duration': duration, 'sql': sql, 'params': params}
When you are running Django as a web server, the request_started
signal will reset the queries to an empty list, so you’ll get fresh data for the next request. This is exactly what you want for single-machine debugging, and it’s how tools like debug toolbar give you valuable information about your SQL queries. Lots of SQL queries are usually the thing making your requests slow. The solution is usually to add smarter indexes, combine similar queries to reduce total database requests, read from a cache instead of the database, and defer writes to an async task runner like celery. I usually leave these optimizations until I’m pretty sure the business team is happy with the design.
This same query caching logic runs in offline tasks. Loading an image from disk into a BLOB? The base64-encoded version is in memory. Crunching usage data for engagement metrics? Here’s a handful of SQL queries for every instance in the system. And, surprising to me, running a management command also logs all of the SQL queries.
Setting DEBUG=False
will disable this behavior. I suspect New Relic acts like DEBUG="Yes and lots of it"
. For one project, I was running my celery tasks under the New Relic Python agent, but I’m going to stop doing that, and see if some memory issues around stats crunching go away.
I want DEBUG=True
for local testing, but I want to be able to run database-intensive management commands without having to remember to turn it off. My solution is to add this snippet to the management command:
# Disable database query logging from django.conf import settings if settings.DEBUG: from django.db import connection connection.use_debug_cursor = False
This uses the undocumented use_debug_cursor
attribute, so it may not work with exotic database backends. It only disables debugging on the default connection, so you’ll have to do more work if you use multiple database connections. Iterating through django.db.connections
should do the trick (see the reset_queries
code).
Also, it looks like changes are in progress for Django 1.7. The “memory leak” question has been removed from the FAQ, and the queries code has recently been updated to use a limited queue, and to always turn on database query recording when DEBUG=True
. Maybe this change will take care of the issue, but I’ll miss the ability to disable this for management commands, since I just discovered it last night. I’ve added ticket #22873, we’ll see if they agree with my patch.
Update: The Django developer doesn’t want to change the behavior back. The attribute use_debug_cursor
should probably be named force_debug_cursor
, since the intention is to force SQL logging for support of assertNumQueries. I’ve got ideas for supporting my use case in Django 1.7, but the code will get more complex. Oh well.