Archive for the ‘Uncategorized’ Category

One More Week: Designing the next Taskcluster monitoring system

January 22, 2022

I’ve got a date for the team transition: Wednesday February 9th, 4½ weeks after my original date. I was also told in confidence the reason, and given a chance to continue on my current team. I’m still learning toward the new team, but I’m taking the weekend to think about it.

Our dogs Asha, Link, and Finn enjoy a light snow

Monday was Martin Luther King Jr. Day and a Mozilla holiday, so it was a short work week. I spent a little time helping with some tests in PRs on the Relay project. The tests are not as mature as I’d like, but things are improving in January, such as PR 1493, Domain recipient tests and coverage, by groovecoder. I look forward to talking to the current team and determining what code quality issues are slowing down progress.

The Taskcluster team is focusing on reliability and observability this quarter. I put together a summary of Taskcluster, what monitoring is available, and a suggestion for a monitoring strategy. Most of the summary was based on the Reference section of the Taskcluster docs, which describes the 3 platform services and the 7 core services, as well as the API, log messages produced, and Pulse / AMQP messages emitted and consumed. Existing monitoring involves an (unused) New Relic hook, watching log messages in structured logs, and smoke tests.

The docs do not cover the breakdown of 10 services to 20 or more micro-services – usually a “front-end” HTTP API and one or more backend processing tasks running in a loop. In my experience, the API services are reliable and well-tested, while the backend processesing tasks cause problems and are difficult to (automatically) monitor and debug. My belief may be influenced by the worker-manager scanner, which syncs status with the cloud providers and is prone to slow execution cycles (see bug 1736329, gecko-3/decision workers not taking tasks, for analysis via logs, and bug 1735411, windows 2004 test worker backlog (gecko-t/win10-64-2004, for a analysis of one incident and a historical code review). The provisioner, which provisions new on-demand servers, has its own issues, such as a work strike due to bad recovery from cloud provider issues.

My recommendation is to use the Pulse message bus to periodically broadcast status of backend services, and add a monitoring service that listens to these messages, gathers status from the micro-services with a Web API, and aggregates the data into a unified JSON status report. This can be consumed by Telescope, which can set limits and alert thresholds. The feedback has been that AMQP can be hard to debug and reason about, that there can be issues with message volume (bug 1717128, Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster, may be a data point), and that a new core monitoring service may be overkill. Some counter-proposals have been per-service health endpoints, or using an existing monitoring system like statsd or Prometheus.

Next week we’ll refine the recommendations and discuss it some more. Once we get consensus next week or the following, I plan to summarize as a public Taskcluster RFC. If the team move goes as planned, that may be where my active involvement ends. I’m optimistic that the team (including new Berlin hire Yaraslau Kurmyza) can move Taskcluster to the next level.

We use community-tc to build Taskcluster, and our usage of Taskcluster helps align us with the users of FirefoxCI. Pete Moore recently released v44.4.0, which took a while due to intermittent errors in the automated release process. These broken releases broke one of the CI build steps (issue 5067) due to missing images, and then when then new code fully shipped, the build broke in a different way due to lingering Python 2.7 code (issue 5070). Matt Boris was also tripped by an intermittent error on PR 5069, Fix styling of label on GitHub Quickstart Page, and Taskcluster doesn’t fully support re-runs (issue 4950). Testing PRs and releasing new versions is far too fragile, but is not the focus of the reliability effort in Q1. Hopefully the team will find some time to improve these aspects as well.

The three-day weekend gave me time to finish Learned Optimism by Martin Seligman. My sister recommended this in summer 2021, and it took me to November to start it. I got ⅔ of the way through, and then started over from chapter 1, taking notes as I read, 13,000 words worth. I found it interesting and helpful, and I’m trying the flexible optimism exercises to alter my thought patterns. My review on Goodreads has more details.

Asha’s Graduation Ceremony

The kids had an even shorter 3 day week. The COVID numbers dropped (0.82% of staff and kids, down from 1.28% last week), but we won’t know until next week if this is a true downward trend. We’re hearing daily stories of COVID positive tests – a neighbor on Wednesday, my brother-in-law on Friday, and plenty through the work networks. A COVID scare cancelled our training class last week, but our 24-week-old, 70lb puppy graduated on Friday.

Recommendations and links:

  • I’ve been using the comment
    <!-- vim:set wrap linebreak breakat&vim: -->
    in my Markdown notes, as suggested by this stackoverflow answer. It has made editing long lines more pleasant than regular wrapping.
  • OmSave for Safari has been a big improvement over OmniFocus’s default “Sent To” method, with the ability to set a project and tag, place the URL in the note, and even add some of the content.
  • We’re working through Season 3 of Buffy the Vampire Slayer, and it is still good fun. Our Senior finds some of the Senior year discussions too stressful to binge it. The creator Joss Whedon has gone from nerd hero to a problematic figure in the last few years, and the Vulture feature The Undoing of Joss Whedon goes into the details.
  • I found that Whedon profile link through Matt Enloe’s “What’s Good” newsletter. Newsletters in my email are the new RSS feed, and OmniFocus my new “Read Later” service.

One More Week: gpg key refresh

January 14, 2022

I’m going to blog about what happened in a week, instead of only blogging when I think I have something significant and permanent to say and I’m in the mood to blog.

I’m still moving teams within Mozilla, but the transition is now pending without a date. I’m trying to make the best of the this in-between time.

The transition continues to be slow.
“Snail” by jamieanne, CC BY-ND 2.0

At Mozilla, we use GPG for signing commits. We sometimes use it for encrypting secrets at rest, but the current favorite secret sharing tools are 1Password and magic-wormhole. When I was setting up my keys, I was convinced that an expiring key was a good idea, because if I lose access to the private key and can’t revoke it, at least it will expire eventually. However, this means that I need to refresh the expiration date. The article How to change the expiration date of a GPG key by George Notaras was published in 2010, but GPG doesn’t change much, so it is still relevant.

Signed commits with “Verified” tag

I schedule expiration for February, and set myself a reminder to refresh in January, along with the steps I take to do it. I publish keys 9ECA 5960 3107 8B1E and 082C 735D 154F B750 to keys.openpgp.org, after gpg.mozilla.org was taken down after a June 2019 attack. I also sync then to Keybase and recreate them in my Github account.

I cleaned up the MLS / Ichnaea documentation. PR 1764 includes a refresh of the deployment docs which is a couple of years overdue. I also updated the Location entry on the wiki and some other internal documents. This the end of my “quick” list of MLS updates, so I’m moving on to Taskcluster: fixing some build issues, reviewing and merging dependency updates, and thinking about how to monitor deployment health. I got a little stuck with building the generic-worker on my M1 MacBook, and a failing CI test, but both are surmountable.

For my next team, I read through the fx-private-relay codebase. I found this tip for getting the list of files tracked in a repo:

git ls-tree --full-tree --name-only -r HEAD

I then manipulated the output to to turn it into a list of Markdown links to the GitHub repository, and checked off each file as I viewed it in SourceTree, or on GitHub if it was more than 50 lines of code. Most of the action is in the email app, and a lot of that in emails/models.py and emails/views.py. There’s not as many tests as I would expect, and some integration tests may cover a bunch of functionality.

In my non-work life, the schools are struggling with Covid-19. Isaac was in remote learning one day, and got two notices of exposure, meaning he was in class with a Covid-positive person. Ainsley has two remote learning days. I’m so glad they are both vaccinated, and wear their masks. I’m so disappointed with Oklahoma’s leadership.

I got a promotion for a local pizza place’s 17th anniversary special, and got it for delivery. It was delivered almost 5 hours after ordering. It was a bad combination of a fantastic promotion, an online ordering system that didn’t turn itself off, and using a gig economy service for delivery. I don’t want to shame anyone, so I’m avoiding names. I went from a hero for ordering pizza, to an adequate dad for making spaghetti.

Finally, my grandmother turned 93 this week! I’m so grateful for her, and that she and my grandfather have stayed safe through this pandemic.

Recommendations and links:

  • magic-wormhole is magic for transferring files securely. If you need to resend, change the filename first, to avoid problems conflicting with the original file.
  • I’m taking notes in Markdown, and started using Marked 2 again to view them. The plugin itspriddle/vim-marked adds a :MarkedOpen command, that opens a rendered version and updates when saved.
  • I printed a few photos and a board-backed poster with Walgreens Photo. The photo webapp is OK, but had a couple of hiccups. If you can, crop to the desired size in a desktop photo app first. I got them printed the same day, and the results were decent and competitive with printing myself.
  • You can support the authors of some of the best MDN Web Docs content through Open Collective.

One More Week: Slow Transitions, Multi-Platform Image Creation on CircleCI

January 7, 2022

I’m going to blog about what happened in a week, instead of only blogging when I think I have something significant and permanent to say and I’m in the mood to blog.

I’m moving teams within Mozilla. This was supposed to be my first week on the Privacy and Security team, working on Firefox Relay, an email aliasing service. I spent some time talking with Se Yeon Kim, my on-boarding buddy, about the team, the Slack channels, and the other side projects we continue to maintain (she has shavar, I have ichnaea). The transfer process between teams is slow, so I have unexpected bonus time to work on old projects for the System Engineering team.

turtle
The mascot for team transitions
“turtle” by Jazminator

For Relay, I got the development environment working, mostly following the README. The tricky bit was getting the requirements installed with Python 3.7.12 via pyenv. I was getting errors for cryptography ('openssl/opensslv.h' file not found) and psycopg2 (ld: library not found for -lssl). The error message led to the solution from Building cryptography on macOS:

LDFLAGS="-L"(brew --prefix openssl@1.1)"/lib" CFLAGS="-I"(brew --prefix openssl@1.1)"/include" pip install -r requirements.txt

I didn’t contribute a change because I feel this is standard for developing on the native system versus a Docker container – you need to figure out stuff like this. However, I did notice a warning, which lead to re-opening issue #1201, django.urls.path warning when calling runserver function, and PR #1447, Fix swagger API schema endpoint. I reported a further warning to axnsan12/drf-yasg as issue #765, ruamel.yaml 0.17.0 has deprecated dump(), load().

For my SysEng work, I’m tackling small stuff that I can work to completion, but not anything that can risk breaking production when I’m gone. I have plenty of these things, too small to make it to quarterly planning but too large to work on in the gaps between other projects.

I spent some time on generating multi-platform docker images for Ichnaea, following the guide Building Docker images for multiple operating system architectures. This required moving from the docker executor to a machine executor, to enable buildx and installing different engines for QEMU emulation. This was the easy part, and the guide was very useful. However, building just linux/amd64 and linux/arm64 took 35 minutes, versus 5 minutes for linux/amd64 in a docker executor. I decided slower builds was not worth it, since I’ve worked out the kinks in M1 MacBook development, and we’re not planning to ship on ARM. I closed the experiment in PR 1762, made some config changes in PR 1763, and added notes to issue 1636. If I try again, I think I will build the images on native machines (i.e., ARM images on ARM machines), and combine the Docker manifests in a final step.

There’s persistent slowness in provisioning Azure workers, and I’ve looked into it between other projects. I used SourceTree to walk through the history of the Taskcluster Azure provider, and added my notes to bug 1735411. It looks like the incremental provisioning and de-provisioning of Azure workers is required, and has been built up over months. The problem is each increment requires a worker scanner loop, and these loops can be as long as 90 minutes, meaning it takes hours to get a worker up and running. The next step is to speed up the worker scanner loop, ideally a minute or less, so that each of those iterations is shorter. That could be optimization, or it could be time-limiting the work done in each loop. It will hopefully be interesting work for the next person.

Ichnaea dependency updates are batched at the start of the month, and I was able to get through them quickly with pmac (PR #1759 and PR #1761). None of the updates were significant. We’re still blocked on the SQLAlchemy update (unknown work to update database code) and Python 3.10 (packaging issue with a dependency).

In my non-work life, the kids were back in school Tuesday for a 4-day week. Masks are now recommended again, but we’re not as bad as NYC (yet). Both kids missed some time for health reasons, but not Covid. Asha, our Great Dane puppy, is up to 60 lbs, and will be back in training class tonight after a winter break.

Asha and Finn hanging out on the couch

Random recommendations from me and others:

  • The family has gotten into Project Zomboid, a top-down zombie survival game where you always eventually die, but you feel like you’re getting better.
  • I continue to play Castlevania: Grimore of Souls, a free-to-play-but-pay-to-win game that was ported to Apple Arcade, lost all the pay-to-win features, but continues to have an addictive upgrade cycle. It takes 5-10 minutes to play a level, but I find it easy to repeat for hours.
  • I’m slowly watching Get Back, the epic Beatles documentary. It’s 8 hours total, but each day is 15-30 minutes, so I’m watching a day at a time. I like the Beatles, but I am also fascinated by the team dynamics between the band members, and with those that support them and rely on them for their job.
  • My client-side Sentry event filtering code in Socorro silently broke. WillKG fixed it, and launched a new project kent as a fake Sentry server for testing integrations.
  • Will KG recommends CPython Internals, still likes Introduction to Algorithms, but thinks the Dragon book is showing its age.
  • Pete Moore recommends tup as a faster alternative to make, which is capable of building Firefox.

multigtfs 0.4.0 released (and 0.4.1, and 0.4.2)

August 2, 2014

I released multigtfs 0.4.0 on June 21, 2014, then 0.4.1 on July 11th, and 0.4.2 on July 20th.

multigtfs is a Django app that supports importing and exporting of GTFS feeds, a standard format for describing transit schedules. multigtfs allows multiple feeds to be stored in the database at once.

These releases are a big step forward. The highlights are:

  • Much faster imports and exports. For one large GTFS feed, the import went from 12 hours to 41 minutes. In a more typical feed, the import went from 29 minutes to 1 minute, and export from 5 minutes to 24 seconds.
  • Support for extra columns, used by many agencies to extend GTFS.
  • Extra configuration to make the Django admin usable after importing lots of data.
  • Python3 support

(more…)

Django management commands and out-of-memory kills

June 19, 2014

tl;dr: Smart developers run database-heavy operations as management commands or celery tasks.  If DEBUG=True, you’ll be recording each database access in memory.  Ensure DEBUG=False to avoid running out of memory.

I’m working on issue 28 for django-multi-gtfs, speeding up importing a GTFS feed.  My experience has been with Tulsa Transit’s GTFS feed, a 1.3 MB zip file, which takes about 30 minutes to import on my laptop.  arichev was importing the Southeast Queensland’s GTFS feed, a 21 MB feed which takes 9-12 hours to import.  It shouldn’t take half-a-day to import 3 million rows, even if many of them include PostGIS geometries.  So, it is time to optimize.

My optimization procedure is to:

  1. Measure the performance of the existing code against real-world data, to establish the benchmark,
  2. Create a representative data set that will demonstrate the problem but allow for reasonable iteration,
  3. Make changes and measure improvements against the test data, until significant improvements are gained,
  4. Measure against the real-world data again, and report the gains.

I was having a lot of trouble with step 1.  First, I had to clear ticket 10, which was preventing me from importing the Southeast Queensland data at all.  Next, I started an import on my MacBook Pro (2.4 GHz Intel Core i7 w/ 8 GB RAM) overnight, and woke to find my laptop unusable, requiring a hard reboot.  I spent some pennies and moved to the cloud.  A t1.micro instance was unable to import the Tulsa feed, killed by the out of memory killer.  That seemed odd, but I knew the free tier instances were memory constrained (600 MB RAM), with no swap.  I then moved to a m3.medium instance (3.75 GB RAM).  The Tulsa feed imported in 50 minutes (vs. 30 minutes on my laptop, which isn’t surprising).  However, importing the Southeast Queensland feed also trigger the OOM killer.

How can importing 165MB of text use up almost 4 GB of memory?

My suspicion was a memory leak, or some out-of-control caching.  Memory leaks are rare in Python, but possible.  Caching is a more likely danger – some helpful code that speeds up a web request could cause an offline task to eat up memory.  I remembered a recent Planet Python post on memory leaks, and quickly found mem_top.  There’s a few variations that all do the same basic thing, asking Python about current memory usage and identify the top memory users.  mem_top is an easy pip install, and easy to call:

from mem_top import mem_top
logging.debug(mem_top())

I added this code in several places, tested it against a small sample feed, and then ran it against a small but real feed. The output looks like this:

multigtfs.models.base - DEBUG - Memory hogs: 
refs:
102371	<type 'list'> [{u'time': u'0.003', u'sql': u'SELECT postgis_lib_version()'}, {u'time': u'0.001', u'sql': u'INSERT 
2510	<type 'dict'> {'_multiprocessing': <module '_multiprocessing' from '/Users/john/.virtualenvs/explore/lib/python2.7
444	<type 'dict'> {'SocketType': <class 'socket._socketobject'>, 'getaddrinfo': , 'AI_N
440	<type 'dict'> {'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_N
437	<type 'list'> [,  {'SocketType': <type '_socket.socket'>, 'getaddrinfo': , 'AI_NUMERICS
364	<type 'dict'> {'SocketType': <type '_socket.socket'>, 'getaddrinfo': , 'AI_NUMERICS
330	<type 'dict'> {'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_N
330	<type 'dict'> {'WTERMSIG': , 'lseek': , 'EX_IOERR': 74, 'EX_N
328	<type 'dict'> {'empty_provider': <pkg_resources.EmptyProvider instance at 0x105781ab8>, 'ExtractionError': 
5067	 <type 'tuple'>
4765	 <type 'dict'>
2736	 <type 'list'>
1862	 <type 'cell'>
1857	 <type 'weakref'>
1423	 <type 'builtin_function_or_method'>
1326	 <type 'wrapper_descriptor'>
1200	 <type 'type'>
1072	 <type 'getset_descriptor'>

The top memory consumer, which grew as the feed was imported, was that list of dictionaries, with timing and SQL queries.  There’s not enough here to nail down the issue, but it was enough to trigger my memory of the common advice to never run Django with DEBUG=True in production.  They’ve listed it in the FAQ under “Why is Django leaking memory?”, which then links to “How can I see the raw SQL queries Django is running?”  When DEBUG=True in your settings, you can access the SQL queries used by the connection:

>>> from django.db import connection
>>> connection.queries
[{'sql': 'SELECT polls_polls.id, polls_polls.question, polls_polls.pub_date FROM polls_polls',
'time': '0.002'}]

This is done in db/backends/__init__py:

    def cursor(self):
        """
        Creates a cursor, opening a connection if necessary.
        """
        self.validate_thread_sharing()
        if (self.use_debug_cursor or
            (self.use_debug_cursor is None and settings.DEBUG)):
            cursor = self.make_debug_cursor(self._cursor())
        else:
            cursor = util.CursorWrapper(self._cursor(), self)
        return cursor

make_debug_cursor returns CursorDebugWrapper from db/backend/util.py, which does some extra work:

class CursorDebugWrapper(CursorWrapper):
    def execute(self, sql, params=None):
        start = time()
        try:
            return super(CursorDebugWrapper, self).execute(sql, params)
        finally:
            stop = time()
            duration = stop - start
            sql = self.db.ops.last_executed_query(self.cursor, sql, params)
            self.db.queries.append({
                'sql': sql,
                'time': "%.3f" % duration,
            })
            logger.debug('(%.3f) %s; args=%s' % (duration, sql, params),
                extra={'duration': duration, 'sql': sql, 'params': params}

When you are running Django as a web server, the request_started signal will reset the queries to an empty list, so you’ll get fresh data for the next request. This is exactly what you want for single-machine debugging, and it’s how tools like debug toolbar give you valuable information about your SQL queries. Lots of SQL queries are usually the thing making your requests slow. The solution is usually to add smarter indexes, combine similar queries to reduce total database requests, read from a cache instead of the database, and defer writes to an async task runner like celery. I usually leave these optimizations until I’m pretty sure the business team is happy with the design.

This same query caching logic runs in offline tasks. Loading an image from disk into a BLOB? The base64-encoded version is in memory. Crunching usage data for engagement metrics? Here’s a handful of SQL queries for every instance in the system. And, surprising to me, running a management command also logs all of the SQL queries.

Setting DEBUG=False will disable this behavior. I suspect New Relic acts like DEBUG="Yes and lots of it". For one project, I was running my celery tasks under the New Relic Python agent, but I’m going to stop doing that, and see if some memory issues around stats crunching go away.

I want DEBUG=True for local testing, but I want to be able to run database-intensive management commands without having to remember to turn it off.  My solution is to add this snippet to the management command:

        # Disable database query logging
        from django.conf import settings
        if settings.DEBUG:
            from django.db import connection
            connection.use_debug_cursor = False

This uses the undocumented use_debug_cursor attribute, so it may not work with exotic database backends. It only disables debugging on the default connection, so you’ll have to do more work if you use multiple database connections. Iterating through django.db.connections should do the trick (see the reset_queries code).

Also, it looks like changes are in progress for Django 1.7. The “memory leak” question has been removed from the FAQ, and the queries code has recently been updated to use a limited queue, and to always turn on database query recording when DEBUG=True. Maybe this change will take care of the issue, but I’ll miss the ability to disable this for management commands, since I just discovered it last night.  I’ve added ticket #22873, we’ll see if they agree with my patch.

Update: The Django developer doesn’t want to change the behavior back.  The attribute use_debug_cursor should probably be named force_debug_cursor, since the intention is to force SQL logging for support of assertNumQueries.  I’ve got ideas for supporting my use case in Django 1.7, but the code will get more complex.  Oh well.

Where the time goes

November 2, 2009

I want to share a new thing I started today: I’m keeping a record of where my time goes. When I start a new task, I write down the time and what I’m doing. It is quite simple, but I’m seeing benefits already.

TimeNotebook.jpg

A few months ago, I was listening to the Manager Tools podcast One Less Thing, and they had a useful idea for anyone – use your calendar to record what you’ve done, and review it monthly to check that you are doing what is important, and find something to stop doing. It seemed almost like an engineering process – by measuring where the time went, you can adjust where the new time goes, and create a feedback loop.

I thought of how to do it electronically, using Google Calendar and my new work-provided Blackberry. I could use colors to designate what I considered important and unimportant. I could add notes for future searching. I could write an app to extract a summary of my month. I could write an app to automatically add entries based on the window where I was using the mouse the most!

Yeah, none of that happened.

As I was watching Merlin Mann’s latest monologue on productivity, I remembered the idea, and decided right now was the time to start, and I needed to start low-tech.

I grabbed a legal pad, and wrote “3:23 Play w/ Issac”, because my kid was crying and I was the only parent around. Also, I don’t know how to spell his name (sorry, Isaac). After a minute, I realized this was going nowhere, and decided instead to put him in the stroller and walk him and the dog down the street to get some coffee. Maybe he’d go to sleep?

It was a very nice Fall day, just the right level of chill and absolutely beautiful. Lots of people were out walking their babies and much better behaved dogs. Isaac enjoyed it all, and fell asleep on the walk back. Mission accomplished! However, I was a bit distressed to discover that it took 30 minutes, plus 8 minutes prep time, to go for the walk. Maybe next time I’ll go to the closer Starbucks rather than the superior Ristretto Roasters.

I swapped to a Moleskine cashier, because it fits in a pocket more nicely than a legal pad, and the tiny space keeps me from writing in too much detail.  I’ll see if I can keep it up.

Two wireless networks with WRT54G and OpenWrt

July 8, 2009

When I first moved to Portland, it was just me and my laptop and a few days until Comcast turned on the Internet.  Luckily, some neighbors had open wireless access points, and I was able to get maps, check bus routes, and move up my appointment for high-speed internet.  I wanted to extend the same courtesy to my neighbors and guests, so I configured my own open access point.  If Bruce Schneier does it, why shouldn’t I?

A friend of mine strongly disagreed. His argument was simple – I’m not Bruce Schneier, and neither is anyone in my family. Good point. He convinced me to put our computers on a encrypted channel, but I still wanted to provide easy access for guests and neighbors.  Luckily, there are some open source projects that make this possible.

I have a Linksys WRT54GL wireless router, which I bought for the express purpose of installing an open-source firmware and accessing the additional features of the Broadcom BCM5352 chip that drives it. A friend suggested the X-Wrt project, which adds administration web pages to the OpenWRT firmware. The installation wasn’t too difficult, once I figured out which image to use, and the web interface made the configuration painless.  It was easy to setup a WPA2-protected network, which worked great for our laptops and the Wii (the X-Box and TiVo don’t support WPA2).

Once I had a working encrypted wireless network, I started configuring the guest wireless network, and it was really tricky.  Tricky, but possible.  Isn’t that the way it is with Linux?  I’ve documented all the painful details, including 19 screenshots, after the jump.

(more…)

アキラ, a Successful Long Project

March 29, 2009

I recently finished Akira Club, the companion to Katsuhiro Ōtomo’s long Akira series (アキラ in Japanese). It collects all the original title pages from the serialized version that appeared in Young Magazine, as well as sketches, recollections, and other Akira memorabilia.

Cover of Akira Club

I enjoyed the peek at the creative process. Sketches are paired with the final drawings as they appear in the magazine. Katsuhiro would often redraw pages for the collected works. The conversion from the Japanese right-to-left originals to the Western left-to-right presentation was rarely as easy as flipping the image. A complete redraw was required to get the text bubbles right, and for details like keeping the colonel’s suit buttoned on the right side.

With each episode, he would include a short report on his life, such as the June 20th, 1983 entry:

I went to Kanazawa and saw the elephant that opened its toes. Tatsuhiko Yamigami’s manga was true!

Here’s the September 3rd, 1984 entry:

I’ve done nothing but work this summer. I hate tanned people!

Amazingly, Katsuhiro was using Twitter-style updates 25 years ago.

The Akira project took thirteen years, from 1982 to 1995. This included 120 episodes of 16 to 20 pages, six Japanese collections, six international collections, an animated motion picture, and finally Akira Club. Katsuhiro’s final words are:

With this book, at long last, I can finally put the final period on the final sentence of Akira. Looking back on it now, it seemed very long and very short… And for every time I felt “Oh, I drew that well,” there was a time I felt I hadn’t drawn enough. Well, that’s how these things are, I guess… Several years ago — I forget how to describe my feelings exactly — I found this expression scrawled in a corner of my drawing board:

“If you eat, whether it makes you laugh or cry, it all comes out shit.” I apologize to those who were eating.

I am fascinated with long projects like Akira. At the start, did Katsuhiro know it would take 200 pages to finish? If he knew, would he have started? How much of his life did he have to put on hold? There are 19 months between episode 87 and 88 – how hard was it to take up the project again after that long of a pause? Was his ill health caused in part by the project? How much of the work was driven by the artist’s impulse, and how much was the necessity of paying the bills?

I’ve seldom heard or read of an artist in the middle of a long project express contentment about the work. They just aren’t pleased with what they produce – it must be improved. This must be combined with real pressures to release, often before the artist is ready. Successful projects require a useful vision and deadlines. Take away the vision, and you get an unfocused mess (see Dreaming in Code). Take away the deadlines, and you get endless polishing and starting over for little gain (see Chinese Democracy).

So, what keeps your long projects going?  What are they lacking?

First Post

September 24, 2008

This isn’t the blog I wanted.

The blog I wanted was more of a blog / wiki hybrid, like Martin Fowler’s Bliki.  I’d carefully craft articles over several days, publishing the drafts in a special area for the preview comments.  When an article was fully cooked, it would be posted to the blog for wider consumption.  Links would bring together articles written at different times, with placeholders for articles still to be written.  Over time, it would evolve into an encyclopedia of my interests, with a generous helping of working code examples.  I even planned special markup for code, so that samples could be automatically tested and flagged for updating when they broke. It would be a glorious expression of my philosophy on work and life.

That blog will never get written.

Visions of perfection either remain visions, or they get imperfectly created.  I like to think about what is possible, but the huge amount of work to “get it right” means that I never start.  For me, procrastination isn’t laziness, it’s an unwillingness to risk imperfection.

It’s been about two years since I named my blog and conceived of the design.  All I have to show for it is a coming soon page. I’ll never have the time to implement the best blog I can imagine, so I’ll have to settle for the one I can do.

I’m already failing.  It’s been several hours since I decided to do it now.  I had to stop researching hosting providers (do it now, wordpress.com is good enough, we can always move the content later).  I had to close the themes pages (do it now, the theme is pointless without content).  I threw away several drafts of this post (do it now, no one’s first post looks any good).

I’ll continue to fail.  But that’s the point.  Failing but doing is better than not failing and not doing.

Plus, the next post will look awesome compared to this garbage.