Archive for February, 2022

One More Week: My first Relay production incident

February 28, 2022

We had a short outage in production this week. Exciting! The system recovered, no data was lost, and we learned a bit about our environment. Maybe.

Se Yeon checked the website, and found that it was unresponsive. The health checkpoints were not responding, and Kubernetes was shutting down app nodes until there wasn’t anything left. Aaron scaled them up, and the system recovered. But why did it go down?

We use several Amazon services for Relay. The Simple Email Service (SES) receives emails from the world, and stores them in the Simple Storage Service (S3). The Simple Notification Service (SNS) tells the Relay app about the new message, by calling the /emails/sns-inbound endpoint. To process the request, the Relay app then fetches the email from S3, analyzes it, forwards an email using SES, and then tells SNS that it is done. If there’s an error, the SNS will retry again, over and over, for an hour. It then sticks the email in another bucket, handled by a different process.

A delay in getting data from S3 results in SNS giving up. This is my first Mermaid sequence diagram, I’ll get better.

My best guess is that reading from S3 was slower than usual, taking longer than SNS was willing to wait. While Relay continued to wait on S3, SNS gave up, which is treated like an error, and tried calling /emails/sns-inbound again. This repeat request was also slow, and tied up another Relay inbound connection. Soon, the Relay app was doing nothing but waiting on S3, and there was no more connection capacity, including for the health check. With health checks not responding, Kubernetes did what we told it – terminated the hung apps and started new ones. Maybe it helped, or maybe they proceeded to fill up all the connections and wait on S3.

The good news is that the system did not lose data. SNS churned for a bit, but when the issue resolved itself, it retried the messages and they were successfully processed.

This did expose a few issues. An S3 error handler had an error itself, causing more issues. The default S3 client automatically retries (a “legacy retry mode” with exponential backoff). We switched to the “standard retry mode” and turned off retries for the S3 client, since we want to fail quick and let SNS retry (PR 1567). We also tuned our alerting to tell us things are bad before our users do.

There are some long-term issues as well. I’m guessing at the cause of the incident, based on timing data for the /emails/sns-inbound endpoint, the S3 client configuration, and some previous experience. However, we don’t have detailed metrics on external API calls, much less trends leading up to the incident, to say for sure what happened. We could use some better telemetry and automated reporting. We’re doing a lot of work in an endpoint, and instead we could capture the data and process it in a different job, but that could just move the issue to somewhere less visible. I prefer to add metrics, measure, and then adjust the infrastructure, but I think implementing more scalable infrastructure first is valid as well.

Leaky Bore, Montecollina” by Photoma*, CC BY-NC-ND 2.0

This is pretty typical for a project that glues together third-party services. It is hard to tell what data you’ll get and how fast it will come. Some issues don’t appear until you’re at production load, or at 2x growth, or 10x growth. I’m pretty happy with the scope of this incident, and that the stream is still a trickle and not a flood.

I had Monday off, but the kids had a whole week without school. We also had a winter storm, starting with hail that resembled Sonic ice cubes, and would have shut down school if it was in session. Everything around me was screaming “Winter Break!”, and it was cold in my basement office. I didn’t take time off, because it was a short work week anyway, but made sure to spend some evenings with the kids and in front of the fireplace.

The dogs didn’t get a lot of walks due to the winter storm. My wife and I took them to a local park, to train the puppy and get the older dogs some exercise. The ground alternated between mud and ice, and the dogs were ungovernable. Very little training was done, Finn cracked the ice and swam a bit, and many squirrels were threatened. February is (hopefully!) the end of winter weather, and very much overstays its welcome.

Some recommendations:

  • Don’t commit to doing something weekly, unless you really enjoy it. Days late with this one!
  • I’m enjoying Babylon 5 on HBO Max. I watched it back in the 90’s, but missed the last seasons in college. I think they remastered the CGI, but I don’t watch it for the effects. The characters are excellent, and they slowly tease out the “long plot” over several episodes. The aliens steal the show – Peter Jurasik’s Londo, Andreas Katsula’s G’Kar, and Bill Mumy’s Lennier are fun characters, and well acted. The action is slow enough that I can fold laundry and still catch what’s happening.

One More Week: First weeks on Firefox Relay

February 22, 2022

I joined the Firefox Relay branch of the Privacy and Security team on February 9th. It was quite shocking to join a new team in the middle of a sprint, and that has two to three times the people working on it as my previous team. It has been nearly two work weeks since I started, and I’m starting to find my place.

Finn the dog looks up from the floor in the back seat of a car
Finn found his own spot for our car trip. Not shown: two other dogs and four humans.

Relay allow users to create an email alias that they can submit to online services instead of their regular email. Emails sent by the service are forwarded to their “real” email, and they can reply to these emails but not send new ones, which reduces a bunch of possible abuses. Users can create five aliases for free, and can pay for more aliases, and for additional features such as a custom domain. If they no longer want email from that service, they can turn off that alias and stop getting email. In addition, the service can’t use their primary email to figure out who they are on other services, or relationships to other people via email.

Mozilla has a few initiatives where we’re trying to get users to pay for a service directly. The Mozilla VPN is developed by the Privacy and Security team as well, and Pocket is still going strong. The team is growing, with many new hires, and I’m one of the people that have been at Mozilla the longest. There’s also a start-up mentality about announcing new features as part of releases, rather than as they are in development. This is a switch from the “develop in the open” strategy of Firefox and other Mozilla products, and I’m going to err on the side of silence about my feature work as I figure it our. The code remains open source, so the truly curious can see features a few days in advance.

I started preparing to join the new team back in December. I created a team document from our company phonebook, with everyone’s name, profile picture, location, and other information. If they had a GitHub profile, I followed them. I continued working on this after joining the team, adding what they looked like in Zoom as well as their “Video off” images, collecting facts and information, and starting a “interview” list of things I want to know (and can ask, since I’m new). It felt creepy at first, but I believe it has sped the process of putting names to faces, and has helped me navigate meetings with a dozen or more participants.

I am still exhausted at the end of most days. It will take time for me to build my social muscles up again, and be used to multiple meetings a day with several people.

I read through the code back in January, and haven’t repeated that this month. I’m not thrilled with the development process, which requires setting up external services to test everything. At the moment, the application is some light Django plus a bunch of glue between several cloud services, and you need the configured cloud services to do much of anything. Also, the application requires incoming email, which gets forwarded to another email address. Email is hard to fake, but email addresses are easy to acquire, so most testing is integration testing. Difficult development comes with the territory.

I’m worried bugs are slipping through the cracks since most wierdness will only happen in production, and it can be challenging to reproduce bugs locally, even when you have details of the data. I’ve tackled issues like this before, but it should be worth surveying the current development environment and see if there are better tools and techniques for working with code that relies heavily on external services. I also think we’ll need some intense logging and event processing, to find those sharp corners of the internet as we hit them. I’m forming a presentation framework in my head, which means I should start prototyping soon rather than get stuck polish my pitch.

An orchid with 4 or 5 blooms, the bottom one fully open
The third bloom for this orchid

Outside of work, the weather has been predictably unpredictable for February. Some days are crisp and pleasant, others like today are below freezing and windy. The kids had a three-day winter weather break, which seconded as a Covid-19 break, and they are now on a “President’s Week” break. The weather co-operated on Sunday, when we drove an hour to Greenleaf State Park with the dogs and took a long hike near the lake. My wife twisted her ankle a little due to Asha pulling, and my teen got a bit worn down, but otherwise it went very well. We stopped at Donna’s Malt Shop in Bragg, OK, so none of us lost any weight.

Some recommendations:

  • Piranesi by Susanna Clarke – This was a pleasant, atmospheric book, with a narrator that figures things out a little slower than the reader, or even comes to the opposite conclusions. It reminded me of The Slow Regard of Silent Things by Patrick Rothfuss.
  • Maintain an orchid – We picked out an orchid for our anniversary a few years ago, and I got a second for an occasion I can’t remember. I’ve thought of orchids as tender hothouse flowers, and learned they instead are hardy as weeds, and will re-bloom if given the occasional drink and time. Our anniversary orchid is on the third bloom, and will be due for a re-potting in a couple of months.
  • sparkmeter/sentry2csv – It exports Sentry issues to a CSV file, which I can then upload to Google sheets for collaborating and sorting. It worked as advertised, even on Mozilla’s self-hosted, sometimes-buggy Sentry instance.
  • Working Effectively with Legacy Code by Michael Feathers – The VPN team is looking at standards for code coverage, a favorite topic of mine. I got to recommend this book, and was thrilled by the table of contents again:
    • Chapter 13: I Need to Make a Change, but I Don’t Know What Tests to Write
    • Chapter 14: Dependencies on Libraries Are Killing Me
    • Chapter 15: My Application is All API Calls
    • Chapter 16: I Don’t Understand the Code Well Enough to Change It
    • Chapter 17: My Application Has No Structure
    • Chapter 18: My Test Code Is in the Way
    • … and it just goes on. It really addresses the stress of testing and development.
  • K Lars Lohn (Two Braids) has posted a new maze, The Ant in the Sunflower, and is starting a Patreon to support his work.

One More Week: Fixing a 7-Year-Old Bug

February 13, 2022

This was my last few days with the System Engineering team, and the first few with the Privacy and Security team. I barely finished the Node 16 update before switching, and was already a bit tired before being overwhelmed by new faces and a ton of documentation.

Our rat terrier Link, attacking his brother under the couch, August 2021

I spent last week working in two neglected libraries, taskcluster/docker-exec-websocket-client and taskcluster/docker-exec-websocket-server, getting them in CI, updating to Node v16, and eventually finding a setting to fix an issue. It felt good to find that bug before my weekend!

The code review of PR 35 and the follow-up (in PR 36 and 37) took a lot of the day, then releasing the 3.0.0 versions of each library. It was late Monday before I bumped the libraries in the original Taskcluster PR 5095. When the updated code ran in CI, I was devastated – the test was still broken. Well, broken in a different way. Previously, it timed out, and now it completed, but now it said the input (a random megabyte of data) didn’t match the output (that data piped through cat in the container). I now had one day left to fix it.

I spent a bit of time Tuesday trying to run it locally, but that had been a dead-end a few weeks ago as well. I tried just the library updates with the existing Node v14 version, and the test failed in exactly the same way! I narrowed the CI jobs to just this test, and added some debugging, and then noticed that the test code was:

assert(data.compare(buf), 'buffer mismatch');

The string .compare() function is in a lot of languages (I first encountered it as strcmp in C), and is often used when sorting. It returns -1 if the first item “sorts” earlier, 1 if the second item sorts earlier, and 0 if they are equal. With the library updates, the strings were now equal, .compare() was returning 0, and the assert was failing on a falsy value. Reading the code, the author of this test wanted to use .equals(), returning true if equal or false if not. I’ve made a similar mistake, and seen others as well. I appreciate that Python 3 dropped the cmp operator and __cmp__ magic method, and eliminated a lot of developer confusion.

Digging in a little, this change was made 7 years ago, around Node 0.10.x or 0.12.x, and close to when .compare() and .equals() were added. Before that, a manual byte-by-byte method was used to compare the buffers, which was probably correct. So, this code has been broken for 7 years, and it took Node v16 to break it in a new way and get some attention.

Luckily, this just took a day to figure out, and we were able to merge Node v16 on my last day. I then transitioned to the Privacy and Security team, to work on Firefox Relay. It is a very different project than the ones I’ve been working on for the last three years, but I think tests and testability will continue to be a focus on the new team. I’ll write more about the new team next week.

I’m a bit tired to make my own recommendations, so here’s some stuff my 11-year-old has been into:

  • Root Beer Milk – I saw this at the store and got it on a whim, and the kid likes it and asked for more. It seemed like a bizarre combo at first, but it is basically a blended root beer float.
  • Roblox – He and his friends continue to play this weekly. It feels a bit like the Flash game scene of a decade or two ago, where game developers can experiment with game types and incentives, or copy ideas from other games, TV shows, or movies. There’s a lot of 3D games, but I’ve seenhim play a game that looks a lot like Populous 2, and another like a Civilization clone. It’s making me interested in RBLX.
  • Minecraft – v1.18 includes the second part of the Caves and Cliffs update. The bedrock layer dropped from Y=0 to Y=-64, adding huge caves where it is easier to stumble on diamonds, or get attacked from a distance. There’s also more mountain biomes and structural variety. He’s bought a realm, and is tweaking server settings again.

One More Week: Terrier Programming

February 5, 2022

I spent most of my work week chasing down a bug. I worked late, took short meals, and was generally absorbed in the problem, even when I was in an unrelated meeting or “not at work”. It’s a kind of flow where I’m making progress, and it feels like I’m minutes away from a break though, for hours and days. It’s like I’m a terrier, that has the sniff of vermin, and I’m sure it is around the next corner.

Our rat terrier Link, watching for rodents at a state park in 2020.

The bug was part of the Taskcluster update from Node 14 to 16, started in PR 5095. Everything passed but a single test. The other parts were easy – a test that changed due to Node 15, and taking the next step suggested by a well-commented TODO. Surely this would be similar, maybe a morning of work. However, the string just got longer the more I pulled.

The details are a bit boring (feel free to skim or skip!), and similar to past efforts when I’ve updated platforms:

  • On Monday, I narrowed CI to just test the one broken test, in a failed effort to simplify debugging. I did notice the broken test was the only one using a particular class. This led me to another Taskcluster repository, taskcluster/docker-exec-websocket-client.
  • On Tuesday, I followed that repo to another, taskcluster/docker-exec-websocket-server, which seemed to be the source of the issue. It had a similar test that worked on v14 and failed on v16. I spent the day trying and failing to get Taskcluster’s Community TC deployment to run the tests, so I could demonstrate the fix when I fixed it. This was also the day Taskcluster switched to Dependabot, and by the end of the day Community TC was too busy building other updates to build my next attempt at a CI configuration.
  • On Wednesday, I switched to a different worker type, and got CI running by the afternoon. I showed that minor dependency updates did not fix the issue. I also narrowed the issue to Node v16.0.0, the first v16 release.
  • On Thursday, I tried major dependency updates, in both the client and server libraries. This required reading some release notes and careful updates that slowed progress. Something went very wrong with the last update, requiring a reboot. I called it a night with an OS update.
  • On Friday, I updated the last dependency, but still the bug persisted. This meant the issue was probably in our code and how it interacted with v16. I got into a cycle of adding debug statements, re-running tests, comparing the output under v14 and v16, and starting over. While tests ran, I read copious release notes for v16, looking for clues. After lunch, I gave myself a 4 PM deadline, to give me a chance to catch up on email. At 3:30 PM, I noticed a side-note in the dockerode README, that a “hijacking” feature in Docker Engine v1.22 (February 2016!) allows stdin, stdout, and stderr on the same websocket. I tried the setting, and it fixed the error! I spend the next two hours removing all my debug statements and packaging it up for PR 35.

There’s more to do Monday, but it is just details and sequencing code merges.

This is typical of bug work. When I started last week, I thought it would take most of an afternoon to update to Node v16. Instead of 4 hours, it is going to be closer to 80 hours. If I’d known, I might have spent my last weeks on the SysEng team doing something else. At the same time, it will be a real benefit to be on Node v16, and to have this bug fixed (which is also be broken in Node v14 and earlier, just in a different way that wasn’t tested). If I’d stopped at 3 PM, 30 minutes from a solution, there is a real chance the next developer would need to spend 2 weeks to get to the same point, or the team may have decided to accept a known bug. And maybe that would have been the right choice, but I’m glad to have the win!

Finn catches a snowball while Asha watches

I’m too familiar with terrier programming mode. My focus was on chasing down the bug, and I was confident I could, but I started to neglect other things. Kourosh Dini’s Weekly Wind Down newsletter was on this topic, which he calls the Dark Side of Flow. It is seductive to get focused on a single task to the exclusion of others, justifying it with a deadline, but this can lead to hyper-focus on urgent tasks followed by extended procrastination for the non-urgent tasks of life and work, and a feeling of despair and being out of control.

It was not as bad as last year’s CTMS projects, where I spent day after day plowing through the task backlog without pause to make the deadlines. This time, I read my email, most days. I stopped around 6, most days. I did some laundry, did some dishes, made a few meals, played some games, and watched some TV. We had a February snow that cancelled school for three days, and I took time on Thursday to go sledding with the kid, then shovel the driveway. I got behind on Email and my OmniFocus projects and this blog, but I caught up today (Saturday).

It is fun to be in terrier programming, and fun to catch the vermin and kill it. It is important to know the goal before I start, and to agree that it is worth the unknown time it will take. Even in the middle of a multi-day flow, there’s still time to keep up with the important stuff.

Recommendations and links:

  • Kurosh Dini writes about productivity as a way to live the good life, and I appreciate his tips and his insights. Sign up for his newsletter for a taste.
  • why-is-node-running is a useful package to figure out why Node is waiting, which I found in a Stack Overflow comment. We were using mocha --exit, but this helped me quickly figure out that we needed to shut down some more resources.
  • I’m a fan of integration-level tests as documentation. A test should verify that a feature works, and maybe tell why that feature is important. One hard part of this bug was that it was clear in retrospect that one aspect (the exit code) was working, but not another (the output of the command). That test should have broken years ago, and someone should have been able to read the test to determine what feature just broke. A test should be more useful than a Check Engine light.
  • Splendor is a good game for 2 to 4 players. It is easy enough to learn, the coins have a weight like good poker chips, and a game finishes quickly. By the time someone reaches 15 points, everyone feels like they were two turns from winning. The kids enjoy it, but it still takes some persuasion to get them to play.
  • Billy Preston is a bright presence in the Beatles’ Get Back. One great moment in Part 3 involves him getting fascinated by Lennon’s new Stylophone. My kid got his Stylophone recreation last year, and enjoyed playing it for longer than almost any other instrument.