One More Week: Terrier Programming

I spent most of my work week chasing down a bug. I worked late, took short meals, and was generally absorbed in the problem, even when I was in an unrelated meeting or “not at work”. It’s a kind of flow where I’m making progress, and it feels like I’m minutes away from a break though, for hours and days. It’s like I’m a terrier, that has the sniff of vermin, and I’m sure it is around the next corner.

Our rat terrier Link, watching for rodents at a state park in 2020.

The bug was part of the Taskcluster update from Node 14 to 16, started in PR 5095. Everything passed but a single test. The other parts were easy – a test that changed due to Node 15, and taking the next step suggested by a well-commented TODO. Surely this would be similar, maybe a morning of work. However, the string just got longer the more I pulled.

The details are a bit boring (feel free to skim or skip!), and similar to past efforts when I’ve updated platforms:

  • On Monday, I narrowed CI to just test the one broken test, in a failed effort to simplify debugging. I did notice the broken test was the only one using a particular class. This led me to another Taskcluster repository, taskcluster/docker-exec-websocket-client.
  • On Tuesday, I followed that repo to another, taskcluster/docker-exec-websocket-server, which seemed to be the source of the issue. It had a similar test that worked on v14 and failed on v16. I spent the day trying and failing to get Taskcluster’s Community TC deployment to run the tests, so I could demonstrate the fix when I fixed it. This was also the day Taskcluster switched to Dependabot, and by the end of the day Community TC was too busy building other updates to build my next attempt at a CI configuration.
  • On Wednesday, I switched to a different worker type, and got CI running by the afternoon. I showed that minor dependency updates did not fix the issue. I also narrowed the issue to Node v16.0.0, the first v16 release.
  • On Thursday, I tried major dependency updates, in both the client and server libraries. This required reading some release notes and careful updates that slowed progress. Something went very wrong with the last update, requiring a reboot. I called it a night with an OS update.
  • On Friday, I updated the last dependency, but still the bug persisted. This meant the issue was probably in our code and how it interacted with v16. I got into a cycle of adding debug statements, re-running tests, comparing the output under v14 and v16, and starting over. While tests ran, I read copious release notes for v16, looking for clues. After lunch, I gave myself a 4 PM deadline, to give me a chance to catch up on email. At 3:30 PM, I noticed a side-note in the dockerode README, that a “hijacking” feature in Docker Engine v1.22 (February 2016!) allows stdin, stdout, and stderr on the same websocket. I tried the setting, and it fixed the error! I spend the next two hours removing all my debug statements and packaging it up for PR 35.

There’s more to do Monday, but it is just details and sequencing code merges.

This is typical of bug work. When I started last week, I thought it would take most of an afternoon to update to Node v16. Instead of 4 hours, it is going to be closer to 80 hours. If I’d known, I might have spent my last weeks on the SysEng team doing something else. At the same time, it will be a real benefit to be on Node v16, and to have this bug fixed (which is also be broken in Node v14 and earlier, just in a different way that wasn’t tested). If I’d stopped at 3 PM, 30 minutes from a solution, there is a real chance the next developer would need to spend 2 weeks to get to the same point, or the team may have decided to accept a known bug. And maybe that would have been the right choice, but I’m glad to have the win!

Finn catches a snowball while Asha watches

I’m too familiar with terrier programming mode. My focus was on chasing down the bug, and I was confident I could, but I started to neglect other things. Kourosh Dini’s Weekly Wind Down newsletter was on this topic, which he calls the Dark Side of Flow. It is seductive to get focused on a single task to the exclusion of others, justifying it with a deadline, but this can lead to hyper-focus on urgent tasks followed by extended procrastination for the non-urgent tasks of life and work, and a feeling of despair and being out of control.

It was not as bad as last year’s CTMS projects, where I spent day after day plowing through the task backlog without pause to make the deadlines. This time, I read my email, most days. I stopped around 6, most days. I did some laundry, did some dishes, made a few meals, played some games, and watched some TV. We had a February snow that cancelled school for three days, and I took time on Thursday to go sledding with the kid, then shovel the driveway. I got behind on Email and my OmniFocus projects and this blog, but I caught up today (Saturday).

It is fun to be in terrier programming, and fun to catch the vermin and kill it. It is important to know the goal before I start, and to agree that it is worth the unknown time it will take. Even in the middle of a multi-day flow, there’s still time to keep up with the important stuff.

Recommendations and links:

  • Kurosh Dini writes about productivity as a way to live the good life, and I appreciate his tips and his insights. Sign up for his newsletter for a taste.
  • why-is-node-running is a useful package to figure out why Node is waiting, which I found in a Stack Overflow comment. We were using mocha --exit, but this helped me quickly figure out that we needed to shut down some more resources.
  • I’m a fan of integration-level tests as documentation. A test should verify that a feature works, and maybe tell why that feature is important. One hard part of this bug was that it was clear in retrospect that one aspect (the exit code) was working, but not another (the output of the command). That test should have broken years ago, and someone should have been able to read the test to determine what feature just broke. A test should be more useful than a Check Engine light.
  • Splendor is a good game for 2 to 4 players. It is easy enough to learn, the coins have a weight like good poker chips, and a game finishes quickly. By the time someone reaches 15 points, everyone feels like they were two turns from winning. The kids enjoy it, but it still takes some persuasion to get them to play.
  • Billy Preston is a bright presence in the Beatles’ Get Back. One great moment in Part 3 involves him getting fascinated by Lennon’s new Stylophone. My kid got his Stylophone recreation last year, and enjoyed playing it for longer than almost any other instrument.

Advertisement

Tags:

One Response to “One More Week: Terrier Programming”

  1. One More Week: Fixing a 7-Year-Old Bug | 5! Says:

    […] Some things blow up. « One More Week: Terrier Programming […]

Leave a Reply to One More Week: Fixing a 7-Year-Old Bug | 5! Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: