One More Week: Which debt to pay first?

March 7, 2022

The Firefox Relay sprint is wrapping up, and I can talk about the features soon. I haven’t done much on this one, but I did get to review some PRs and get more familiar with the code. I’m seeing plenty that I’d like to change, but I’m still thinking about the order of work. I’m often way off on my estimates, and if something takes longer than planned, I want to be sure that it is still worth completing. We have a short “innovation sprint” planned next, which should allow me to get one or two things completed.

Luke orders from So Bahn, the pop-up kitchen run by Se Yeon and family

There aren’t many automated tests, but they have been rapidly added in the last few months, getting to 60% coverage. Some additional tests would be useful, and it would also be nice to have metrics showing this progress. Luke added code coverage to test, and I added XML test and coverage output (see PR 1576), but I’m short of integrating with a tracking service like coveralls or codecov. It was not easy to get this to work with CircleCI’s remote docker environment, and once I get all the pieces working, it will probably be worth a stand-alone blog post.

I’m suspicious of coverage for the sake of coverage, although it is useful and possible to get 100% for new projects. I do think there are benefits to structuring the code for testing, including easier development environments and a clean separation from services. It will take some work to create seams between the services and the code, which will allow application code to be completely tested, while interface code is dumb and monitored in deployments. As the project grows, the application code will grow larger, and the glue should be a smaller fraction of the code.

In deployments, Sentry is used for capturing exceptions, as well as 50x return codes and missing translations. Se Yeon has been interested in Sentry issues for a while, and has started a weekly meeting to triage the new ones, so we’ve all been staring at Sentry events recently. Some of the data is duplicated – exceptions are logged with tracebacks, and then again as a 500 Server Error. There is also a lot of unactionable warnings from security probes. There’s some work to ensure we’re using Sentry effectively. I started with adding the deployment version (PR 1573), and there’s a little more to go.

We’re sending some statsd-style metrics to our InfluxDB server, and ingesting some from our cloud logging tools. There’s a lot, but I see some opportunities to use tagging for more effective displays, and more features of Graphana that we could use to display the data. There’s also stats from other services that may be useful to bring in, but some of those will be cross-team efforts. I don’t plan to tackle those until later in the year at the earliest.

InfluxDB does well with operational data, and limited use of tags, but has trouble with high-cardinality data, like most real-time metrics systems. Brian Pitts, a former Mozilla SRE, convinced me to emit operational data as metrics, and longer term detailed data as a canonical log, one per request or backend transaction. I like to use structlog, coercing it to emit Mozilla’s favored MozLog format, with processes to ingest the data into BigQuery data stores. There’s some work to integrate the tools, document the format, and get to one log per request.

Relay requires a bunch of cloud services to work, which means that local development is partial or requires provisioning resources. We have a development deployment, which becomes a contested shared resource toward the end of a sprint. There may be ways to emulate the services in development, either by swapping in fake versions via configuration, or mock tools like localstack or moto. Or maybe we should lean into using real services, and automate per-developer provisioning.

There are some other possibilities for quick work. Others are preparing to convert the frontend to React, which may allow some automated front-end testing. Black left beta recently, and Django is considering re-formatting in DEP-0008. Relay could use it and other linting tools. Relay is also due for an upgrade to Python 3.9 and Django 3.2.

There’s a lot I could do, but my job is not to polish the code to perfection. Relay is still finding market fit, and features are a big part of that. I’ll need to ensure I’m shipping some of those features, even if my interest is in code polishing.

I’m leaning toward this order:

  • Ensure Sentry is tracking only actionable errors, so we can discover issues before users
  • Document logging and metrics, and then write a short proposal for future changes
  • Update to Python 3.9 and Django 3.2, to avoid lagging behind our toolset
  • Implement canonical logging and an example report, and build as we build features
  • Refactor code, moving external services to the edges

I’m looking for ways to minimize the time I spend implementing the first versions, and save the bulk of the work as collaborative efforts or as part of feature work. I also need to get started while no one is expecting much of me!

In other work, Mozilla is getting its GitHub projects in order again. I touched several projects last, so I’ve been asked about a few. I think mozilla/ci-docker-bases should be deprecated by the fall, replaced by CircleCI’s convenience images like cimg/base. The app mozilla/django-dnt should be archived, since the W3 DNT working group shut down in January 2019, and DNT detection should be in the client-side code anyway. The app mozilla/django-tidings should be absorbed into kitsune, the SUMO engine, now that MDN has moved from kuma to yari, and uses GitHub for content notifications. I expect to do a lot of the lifting to retire these projects.

Finn recovers from his traumatic week

On the home front, our dog Finn caught a plastic latch in the lower eyelid, and required some emergency vet work. The latch was part of the “safety net” for our trampoline, and he enjoys chasing his sister Asha around it. It appears to have missed damaging his eye, and after a day of recovery, he’s back to chasing Asha around. He is getting some additional attention, and we’re taking daily close-ups to monitor the swelling.

My kid continues to be interested in PC upgrades, and spent money on an SSD drive since he’s used 950 GB of the 1 TB drive. We hooked it up, and thought it was a dud, but after a night sleep, I remembered how electronics work, and we connected it to the power supply as well. Now he has 2 TB to fill up with Minecraft and Roblox downloads.

Recommendations:

  • So Bahn 82, Se Yeon’s family restaurant, took over the takeover kitchen at Mother Road Market. I enjoyed the So Bahn fried chicken, the rice cakes, and the baked corn cheese. I’m excited to see it come to downtown Tulsa soon!
  • Sarah Bird is implementing https://github.com/mozilla-services/cjms in Rust, and recommends Zero To Production In Rust: An introduction to backend development as a practical guide.
  • Twitter is a waste of time, and I continue to read it every day. I feel like it is surfacing good content about the war in Ukraine, so maybe it has figured out how to be more useful and less of a weapon.
  • I finished Get Back, the lengthy Beatles documentary. I endorse breaking it up into “days” – watch a day of filming, take a break, and watch the next. Most of the drama is in the first part, and they seem to get into the groove of the project by the third part.

Advertisement

One More Week: My first Relay production incident

February 28, 2022

We had a short outage in production this week. Exciting! The system recovered, no data was lost, and we learned a bit about our environment. Maybe.

Se Yeon checked the website, and found that it was unresponsive. The health checkpoints were not responding, and Kubernetes was shutting down app nodes until there wasn’t anything left. Aaron scaled them up, and the system recovered. But why did it go down?

We use several Amazon services for Relay. The Simple Email Service (SES) receives emails from the world, and stores them in the Simple Storage Service (S3). The Simple Notification Service (SNS) tells the Relay app about the new message, by calling the /emails/sns-inbound endpoint. To process the request, the Relay app then fetches the email from S3, analyzes it, forwards an email using SES, and then tells SNS that it is done. If there’s an error, the SNS will retry again, over and over, for an hour. It then sticks the email in another bucket, handled by a different process.

A delay in getting data from S3 results in SNS giving up. This is my first Mermaid sequence diagram, I’ll get better.

My best guess is that reading from S3 was slower than usual, taking longer than SNS was willing to wait. While Relay continued to wait on S3, SNS gave up, which is treated like an error, and tried calling /emails/sns-inbound again. This repeat request was also slow, and tied up another Relay inbound connection. Soon, the Relay app was doing nothing but waiting on S3, and there was no more connection capacity, including for the health check. With health checks not responding, Kubernetes did what we told it – terminated the hung apps and started new ones. Maybe it helped, or maybe they proceeded to fill up all the connections and wait on S3.

The good news is that the system did not lose data. SNS churned for a bit, but when the issue resolved itself, it retried the messages and they were successfully processed.

This did expose a few issues. An S3 error handler had an error itself, causing more issues. The default S3 client automatically retries (a “legacy retry mode” with exponential backoff). We switched to the “standard retry mode” and turned off retries for the S3 client, since we want to fail quick and let SNS retry (PR 1567). We also tuned our alerting to tell us things are bad before our users do.

There are some long-term issues as well. I’m guessing at the cause of the incident, based on timing data for the /emails/sns-inbound endpoint, the S3 client configuration, and some previous experience. However, we don’t have detailed metrics on external API calls, much less trends leading up to the incident, to say for sure what happened. We could use some better telemetry and automated reporting. We’re doing a lot of work in an endpoint, and instead we could capture the data and process it in a different job, but that could just move the issue to somewhere less visible. I prefer to add metrics, measure, and then adjust the infrastructure, but I think implementing more scalable infrastructure first is valid as well.

Leaky Bore, Montecollina” by Photoma*, CC BY-NC-ND 2.0

This is pretty typical for a project that glues together third-party services. It is hard to tell what data you’ll get and how fast it will come. Some issues don’t appear until you’re at production load, or at 2x growth, or 10x growth. I’m pretty happy with the scope of this incident, and that the stream is still a trickle and not a flood.

I had Monday off, but the kids had a whole week without school. We also had a winter storm, starting with hail that resembled Sonic ice cubes, and would have shut down school if it was in session. Everything around me was screaming “Winter Break!”, and it was cold in my basement office. I didn’t take time off, because it was a short work week anyway, but made sure to spend some evenings with the kids and in front of the fireplace.

The dogs didn’t get a lot of walks due to the winter storm. My wife and I took them to a local park, to train the puppy and get the older dogs some exercise. The ground alternated between mud and ice, and the dogs were ungovernable. Very little training was done, Finn cracked the ice and swam a bit, and many squirrels were threatened. February is (hopefully!) the end of winter weather, and very much overstays its welcome.

Some recommendations:

  • Don’t commit to doing something weekly, unless you really enjoy it. Days late with this one!
  • I’m enjoying Babylon 5 on HBO Max. I watched it back in the 90’s, but missed the last seasons in college. I think they remastered the CGI, but I don’t watch it for the effects. The characters are excellent, and they slowly tease out the “long plot” over several episodes. The aliens steal the show – Peter Jurasik’s Londo, Andreas Katsula’s G’Kar, and Bill Mumy’s Lennier are fun characters, and well acted. The action is slow enough that I can fold laundry and still catch what’s happening.

One More Week: First weeks on Firefox Relay

February 22, 2022

I joined the Firefox Relay branch of the Privacy and Security team on February 9th. It was quite shocking to join a new team in the middle of a sprint, and that has two to three times the people working on it as my previous team. It has been nearly two work weeks since I started, and I’m starting to find my place.

Finn the dog looks up from the floor in the back seat of a car
Finn found his own spot for our car trip. Not shown: two other dogs and four humans.

Relay allow users to create an email alias that they can submit to online services instead of their regular email. Emails sent by the service are forwarded to their “real” email, and they can reply to these emails but not send new ones, which reduces a bunch of possible abuses. Users can create five aliases for free, and can pay for more aliases, and for additional features such as a custom domain. If they no longer want email from that service, they can turn off that alias and stop getting email. In addition, the service can’t use their primary email to figure out who they are on other services, or relationships to other people via email.

Mozilla has a few initiatives where we’re trying to get users to pay for a service directly. The Mozilla VPN is developed by the Privacy and Security team as well, and Pocket is still going strong. The team is growing, with many new hires, and I’m one of the people that have been at Mozilla the longest. There’s also a start-up mentality about announcing new features as part of releases, rather than as they are in development. This is a switch from the “develop in the open” strategy of Firefox and other Mozilla products, and I’m going to err on the side of silence about my feature work as I figure it our. The code remains open source, so the truly curious can see features a few days in advance.

I started preparing to join the new team back in December. I created a team document from our company phonebook, with everyone’s name, profile picture, location, and other information. If they had a GitHub profile, I followed them. I continued working on this after joining the team, adding what they looked like in Zoom as well as their “Video off” images, collecting facts and information, and starting a “interview” list of things I want to know (and can ask, since I’m new). It felt creepy at first, but I believe it has sped the process of putting names to faces, and has helped me navigate meetings with a dozen or more participants.

I am still exhausted at the end of most days. It will take time for me to build my social muscles up again, and be used to multiple meetings a day with several people.

I read through the code back in January, and haven’t repeated that this month. I’m not thrilled with the development process, which requires setting up external services to test everything. At the moment, the application is some light Django plus a bunch of glue between several cloud services, and you need the configured cloud services to do much of anything. Also, the application requires incoming email, which gets forwarded to another email address. Email is hard to fake, but email addresses are easy to acquire, so most testing is integration testing. Difficult development comes with the territory.

I’m worried bugs are slipping through the cracks since most wierdness will only happen in production, and it can be challenging to reproduce bugs locally, even when you have details of the data. I’ve tackled issues like this before, but it should be worth surveying the current development environment and see if there are better tools and techniques for working with code that relies heavily on external services. I also think we’ll need some intense logging and event processing, to find those sharp corners of the internet as we hit them. I’m forming a presentation framework in my head, which means I should start prototyping soon rather than get stuck polish my pitch.

An orchid with 4 or 5 blooms, the bottom one fully open
The third bloom for this orchid

Outside of work, the weather has been predictably unpredictable for February. Some days are crisp and pleasant, others like today are below freezing and windy. The kids had a three-day winter weather break, which seconded as a Covid-19 break, and they are now on a “President’s Week” break. The weather co-operated on Sunday, when we drove an hour to Greenleaf State Park with the dogs and took a long hike near the lake. My wife twisted her ankle a little due to Asha pulling, and my teen got a bit worn down, but otherwise it went very well. We stopped at Donna’s Malt Shop in Bragg, OK, so none of us lost any weight.

Some recommendations:

  • Piranesi by Susanna Clarke – This was a pleasant, atmospheric book, with a narrator that figures things out a little slower than the reader, or even comes to the opposite conclusions. It reminded me of The Slow Regard of Silent Things by Patrick Rothfuss.
  • Maintain an orchid – We picked out an orchid for our anniversary a few years ago, and I got a second for an occasion I can’t remember. I’ve thought of orchids as tender hothouse flowers, and learned they instead are hardy as weeds, and will re-bloom if given the occasional drink and time. Our anniversary orchid is on the third bloom, and will be due for a re-potting in a couple of months.
  • sparkmeter/sentry2csv – It exports Sentry issues to a CSV file, which I can then upload to Google sheets for collaborating and sorting. It worked as advertised, even on Mozilla’s self-hosted, sometimes-buggy Sentry instance.
  • Working Effectively with Legacy Code by Michael Feathers – The VPN team is looking at standards for code coverage, a favorite topic of mine. I got to recommend this book, and was thrilled by the table of contents again:
    • Chapter 13: I Need to Make a Change, but I Don’t Know What Tests to Write
    • Chapter 14: Dependencies on Libraries Are Killing Me
    • Chapter 15: My Application is All API Calls
    • Chapter 16: I Don’t Understand the Code Well Enough to Change It
    • Chapter 17: My Application Has No Structure
    • Chapter 18: My Test Code Is in the Way
    • … and it just goes on. It really addresses the stress of testing and development.
  • K Lars Lohn (Two Braids) has posted a new maze, The Ant in the Sunflower, and is starting a Patreon to support his work.

One More Week: Fixing a 7-Year-Old Bug

February 13, 2022

This was my last few days with the System Engineering team, and the first few with the Privacy and Security team. I barely finished the Node 16 update before switching, and was already a bit tired before being overwhelmed by new faces and a ton of documentation.

Our rat terrier Link, attacking his brother under the couch, August 2021

I spent last week working in two neglected libraries, taskcluster/docker-exec-websocket-client and taskcluster/docker-exec-websocket-server, getting them in CI, updating to Node v16, and eventually finding a setting to fix an issue. It felt good to find that bug before my weekend!

The code review of PR 35 and the follow-up (in PR 36 and 37) took a lot of the day, then releasing the 3.0.0 versions of each library. It was late Monday before I bumped the libraries in the original Taskcluster PR 5095. When the updated code ran in CI, I was devastated – the test was still broken. Well, broken in a different way. Previously, it timed out, and now it completed, but now it said the input (a random megabyte of data) didn’t match the output (that data piped through cat in the container). I now had one day left to fix it.

I spent a bit of time Tuesday trying to run it locally, but that had been a dead-end a few weeks ago as well. I tried just the library updates with the existing Node v14 version, and the test failed in exactly the same way! I narrowed the CI jobs to just this test, and added some debugging, and then noticed that the test code was:

assert(data.compare(buf), 'buffer mismatch');

The string .compare() function is in a lot of languages (I first encountered it as strcmp in C), and is often used when sorting. It returns -1 if the first item “sorts” earlier, 1 if the second item sorts earlier, and 0 if they are equal. With the library updates, the strings were now equal, .compare() was returning 0, and the assert was failing on a falsy value. Reading the code, the author of this test wanted to use .equals(), returning true if equal or false if not. I’ve made a similar mistake, and seen others as well. I appreciate that Python 3 dropped the cmp operator and __cmp__ magic method, and eliminated a lot of developer confusion.

Digging in a little, this change was made 7 years ago, around Node 0.10.x or 0.12.x, and close to when .compare() and .equals() were added. Before that, a manual byte-by-byte method was used to compare the buffers, which was probably correct. So, this code has been broken for 7 years, and it took Node v16 to break it in a new way and get some attention.

Luckily, this just took a day to figure out, and we were able to merge Node v16 on my last day. I then transitioned to the Privacy and Security team, to work on Firefox Relay. It is a very different project than the ones I’ve been working on for the last three years, but I think tests and testability will continue to be a focus on the new team. I’ll write more about the new team next week.

I’m a bit tired to make my own recommendations, so here’s some stuff my 11-year-old has been into:

  • Root Beer Milk – I saw this at the store and got it on a whim, and the kid likes it and asked for more. It seemed like a bizarre combo at first, but it is basically a blended root beer float.
  • Roblox – He and his friends continue to play this weekly. It feels a bit like the Flash game scene of a decade or two ago, where game developers can experiment with game types and incentives, or copy ideas from other games, TV shows, or movies. There’s a lot of 3D games, but I’ve seenhim play a game that looks a lot like Populous 2, and another like a Civilization clone. It’s making me interested in RBLX.
  • Minecraft – v1.18 includes the second part of the Caves and Cliffs update. The bedrock layer dropped from Y=0 to Y=-64, adding huge caves where it is easier to stumble on diamonds, or get attacked from a distance. There’s also more mountain biomes and structural variety. He’s bought a realm, and is tweaking server settings again.

One More Week: Terrier Programming

February 5, 2022

I spent most of my work week chasing down a bug. I worked late, took short meals, and was generally absorbed in the problem, even when I was in an unrelated meeting or “not at work”. It’s a kind of flow where I’m making progress, and it feels like I’m minutes away from a break though, for hours and days. It’s like I’m a terrier, that has the sniff of vermin, and I’m sure it is around the next corner.

Our rat terrier Link, watching for rodents at a state park in 2020.

The bug was part of the Taskcluster update from Node 14 to 16, started in PR 5095. Everything passed but a single test. The other parts were easy – a test that changed due to Node 15, and taking the next step suggested by a well-commented TODO. Surely this would be similar, maybe a morning of work. However, the string just got longer the more I pulled.

The details are a bit boring (feel free to skim or skip!), and similar to past efforts when I’ve updated platforms:

  • On Monday, I narrowed CI to just test the one broken test, in a failed effort to simplify debugging. I did notice the broken test was the only one using a particular class. This led me to another Taskcluster repository, taskcluster/docker-exec-websocket-client.
  • On Tuesday, I followed that repo to another, taskcluster/docker-exec-websocket-server, which seemed to be the source of the issue. It had a similar test that worked on v14 and failed on v16. I spent the day trying and failing to get Taskcluster’s Community TC deployment to run the tests, so I could demonstrate the fix when I fixed it. This was also the day Taskcluster switched to Dependabot, and by the end of the day Community TC was too busy building other updates to build my next attempt at a CI configuration.
  • On Wednesday, I switched to a different worker type, and got CI running by the afternoon. I showed that minor dependency updates did not fix the issue. I also narrowed the issue to Node v16.0.0, the first v16 release.
  • On Thursday, I tried major dependency updates, in both the client and server libraries. This required reading some release notes and careful updates that slowed progress. Something went very wrong with the last update, requiring a reboot. I called it a night with an OS update.
  • On Friday, I updated the last dependency, but still the bug persisted. This meant the issue was probably in our code and how it interacted with v16. I got into a cycle of adding debug statements, re-running tests, comparing the output under v14 and v16, and starting over. While tests ran, I read copious release notes for v16, looking for clues. After lunch, I gave myself a 4 PM deadline, to give me a chance to catch up on email. At 3:30 PM, I noticed a side-note in the dockerode README, that a “hijacking” feature in Docker Engine v1.22 (February 2016!) allows stdin, stdout, and stderr on the same websocket. I tried the setting, and it fixed the error! I spend the next two hours removing all my debug statements and packaging it up for PR 35.

There’s more to do Monday, but it is just details and sequencing code merges.

This is typical of bug work. When I started last week, I thought it would take most of an afternoon to update to Node v16. Instead of 4 hours, it is going to be closer to 80 hours. If I’d known, I might have spent my last weeks on the SysEng team doing something else. At the same time, it will be a real benefit to be on Node v16, and to have this bug fixed (which is also be broken in Node v14 and earlier, just in a different way that wasn’t tested). If I’d stopped at 3 PM, 30 minutes from a solution, there is a real chance the next developer would need to spend 2 weeks to get to the same point, or the team may have decided to accept a known bug. And maybe that would have been the right choice, but I’m glad to have the win!

Finn catches a snowball while Asha watches

I’m too familiar with terrier programming mode. My focus was on chasing down the bug, and I was confident I could, but I started to neglect other things. Kourosh Dini’s Weekly Wind Down newsletter was on this topic, which he calls the Dark Side of Flow. It is seductive to get focused on a single task to the exclusion of others, justifying it with a deadline, but this can lead to hyper-focus on urgent tasks followed by extended procrastination for the non-urgent tasks of life and work, and a feeling of despair and being out of control.

It was not as bad as last year’s CTMS projects, where I spent day after day plowing through the task backlog without pause to make the deadlines. This time, I read my email, most days. I stopped around 6, most days. I did some laundry, did some dishes, made a few meals, played some games, and watched some TV. We had a February snow that cancelled school for three days, and I took time on Thursday to go sledding with the kid, then shovel the driveway. I got behind on Email and my OmniFocus projects and this blog, but I caught up today (Saturday).

It is fun to be in terrier programming, and fun to catch the vermin and kill it. It is important to know the goal before I start, and to agree that it is worth the unknown time it will take. Even in the middle of a multi-day flow, there’s still time to keep up with the important stuff.

Recommendations and links:

  • Kurosh Dini writes about productivity as a way to live the good life, and I appreciate his tips and his insights. Sign up for his newsletter for a taste.
  • why-is-node-running is a useful package to figure out why Node is waiting, which I found in a Stack Overflow comment. We were using mocha --exit, but this helped me quickly figure out that we needed to shut down some more resources.
  • I’m a fan of integration-level tests as documentation. A test should verify that a feature works, and maybe tell why that feature is important. One hard part of this bug was that it was clear in retrospect that one aspect (the exit code) was working, but not another (the output of the command). That test should have broken years ago, and someone should have been able to read the test to determine what feature just broke. A test should be more useful than a Check Engine light.
  • Splendor is a good game for 2 to 4 players. It is easy enough to learn, the coins have a weight like good poker chips, and a game finishes quickly. By the time someone reaches 15 points, everyone feels like they were two turns from winning. The kids enjoy it, but it still takes some persuasion to get them to play.
  • Billy Preston is a bright presence in the Beatles’ Get Back. One great moment in Part 3 involves him getting fascinated by Lennon’s new Stylophone. My kid got his Stylophone recreation last year, and enjoyed playing it for longer than almost any other instrument.

One More Week: Update all the things!

January 28, 2022

I’ve re-affirmed my decision to go to the new team. Now that I have a deadline, I’m back in the game, trying to figure out the best way to spend my remaining days in Systems Engineering.

Saint Anthony Tormented by Demons – Martin Schongauer, ca 1470-75

I’ve been in more meetings this week that all year! I was pulled back into the SysEng team Slack channel, the weekly staff meeting, and two project reviews. I missed my new favorite meeting, the Taskcluster Community meeting led by Pete Moore. I thought I was more emotionally level in 2022 than last year. It turns out it may be that I just had less Zoom meetings.

We recently released Taskcluster v44.4.0, which dropped Python 2.7 support. This revealed some lingering Python 2 code, updated in PR 5071. Matt Boris has been churning through Dependabot updates, and it is good to see those getting merged. I was inspired to start on the version bumps for Go (1.16.7 to 1.17.6) and Node (14.17.15 to LTS 16.13.2). My first attempt, PR 5084, was sloppier than I liked, so I broke out the Go upgrade in PR 5088. The Node update in PR 5095 is also a bit sloppy, requiring updating to a CI worker from Ubuntu 14.04 to 20.04 – a six-year jump! All the tests pass but one, so that will slip until next week.

For the monitoring project, I’ve been warned away from both Pulse / RabbitMQ as a transport, as well as creating a new Monitor service. I’m looking into ways to build it with existing tools, and hook into libraries like Taskcluster’s Monitor. Alan Alexander still likes Prometheus, which we used on the CTMS project on an API server and a backend processor, and thinks the InfluxDB TICK stack would also work. Christina Harlow confirmed we can get data out of InfluxDB with a token, so it isn’t a one-way transaction. I confirmed that the service APIs are not proxied, but instead available via the Kubernetes Ingress configuration. The Kubernetes configuration answered several questions for me, including how the services are run, and the existence of a handful of CronJob tasks that could also use monitoring.

I do not yet have a plan for how to implement monitoring, at least one that I can hand off to other people. I am still in discovery mode. One question I had after the meeting was if have enough monitoring for the major task already – the slow provisioning of Azure workers – and if we should tackle that directly. Next week I plan to put together my scattered notes on that issue, and plan some first steps.

We went to my mother-in-law’s house on Sunday. She is so excited for visitors, she agreed to play Wingspan, which took about 3 hours (the kid won), followed by a speed round of Hearts. I went from no points to catching the black Queen twice and being the big loser.

Recommendations and links:

  • Wordle is a daily word game like Mastermind, and just requires a browser. The author made it easy to share your daily results without spoilers, making it a viral hit on Twitter, Slack, and our family iMessage threads. It really annoys some people, which is also fun, and makes for a new meme format.
  • Alan wants to read Implementing Service Level Objectives by Alex Hidalgo, now I do too.
  • Wingspan is a fun engine-building board game. It takes so long to get through the first game, but then we just wanted to play again. I can’t say that I’ve figured out a winning strategy – I never know who won until the points are added at the end.
  • The Story of King Solomon and Ashmedai is quite a story, and makes me think about the thin line between what makes it into the religious canon and what doesn’t.

One More Week: Designing the next Taskcluster monitoring system

January 22, 2022

I’ve got a date for the team transition: Wednesday February 9th, 4½ weeks after my original date. I was also told in confidence the reason, and given a chance to continue on my current team. I’m still learning toward the new team, but I’m taking the weekend to think about it.

Our dogs Asha, Link, and Finn enjoy a light snow

Monday was Martin Luther King Jr. Day and a Mozilla holiday, so it was a short work week. I spent a little time helping with some tests in PRs on the Relay project. The tests are not as mature as I’d like, but things are improving in January, such as PR 1493, Domain recipient tests and coverage, by groovecoder. I look forward to talking to the current team and determining what code quality issues are slowing down progress.

The Taskcluster team is focusing on reliability and observability this quarter. I put together a summary of Taskcluster, what monitoring is available, and a suggestion for a monitoring strategy. Most of the summary was based on the Reference section of the Taskcluster docs, which describes the 3 platform services and the 7 core services, as well as the API, log messages produced, and Pulse / AMQP messages emitted and consumed. Existing monitoring involves an (unused) New Relic hook, watching log messages in structured logs, and smoke tests.

The docs do not cover the breakdown of 10 services to 20 or more micro-services – usually a “front-end” HTTP API and one or more backend processing tasks running in a loop. In my experience, the API services are reliable and well-tested, while the backend processesing tasks cause problems and are difficult to (automatically) monitor and debug. My belief may be influenced by the worker-manager scanner, which syncs status with the cloud providers and is prone to slow execution cycles (see bug 1736329, gecko-3/decision workers not taking tasks, for analysis via logs, and bug 1735411, windows 2004 test worker backlog (gecko-t/win10-64-2004, for a analysis of one incident and a historical code review). The provisioner, which provisions new on-demand servers, has its own issues, such as a work strike due to bad recovery from cloud provider issues.

My recommendation is to use the Pulse message bus to periodically broadcast status of backend services, and add a monitoring service that listens to these messages, gathers status from the micro-services with a Web API, and aggregates the data into a unified JSON status report. This can be consumed by Telescope, which can set limits and alert thresholds. The feedback has been that AMQP can be hard to debug and reason about, that there can be issues with message volume (bug 1717128, Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster, may be a data point), and that a new core monitoring service may be overkill. Some counter-proposals have been per-service health endpoints, or using an existing monitoring system like statsd or Prometheus.

Next week we’ll refine the recommendations and discuss it some more. Once we get consensus next week or the following, I plan to summarize as a public Taskcluster RFC. If the team move goes as planned, that may be where my active involvement ends. I’m optimistic that the team (including new Berlin hire Yaraslau Kurmyza) can move Taskcluster to the next level.

We use community-tc to build Taskcluster, and our usage of Taskcluster helps align us with the users of FirefoxCI. Pete Moore recently released v44.4.0, which took a while due to intermittent errors in the automated release process. These broken releases broke one of the CI build steps (issue 5067) due to missing images, and then when then new code fully shipped, the build broke in a different way due to lingering Python 2.7 code (issue 5070). Matt Boris was also tripped by an intermittent error on PR 5069, Fix styling of label on GitHub Quickstart Page, and Taskcluster doesn’t fully support re-runs (issue 4950). Testing PRs and releasing new versions is far too fragile, but is not the focus of the reliability effort in Q1. Hopefully the team will find some time to improve these aspects as well.

The three-day weekend gave me time to finish Learned Optimism by Martin Seligman. My sister recommended this in summer 2021, and it took me to November to start it. I got ⅔ of the way through, and then started over from chapter 1, taking notes as I read, 13,000 words worth. I found it interesting and helpful, and I’m trying the flexible optimism exercises to alter my thought patterns. My review on Goodreads has more details.

Asha’s Graduation Ceremony

The kids had an even shorter 3 day week. The COVID numbers dropped (0.82% of staff and kids, down from 1.28% last week), but we won’t know until next week if this is a true downward trend. We’re hearing daily stories of COVID positive tests – a neighbor on Wednesday, my brother-in-law on Friday, and plenty through the work networks. A COVID scare cancelled our training class last week, but our 24-week-old, 70lb puppy graduated on Friday.

Recommendations and links:

  • I’ve been using the comment
    <!-- vim:set wrap linebreak breakat&vim: -->
    in my Markdown notes, as suggested by this stackoverflow answer. It has made editing long lines more pleasant than regular wrapping.
  • OmSave for Safari has been a big improvement over OmniFocus’s default “Sent To” method, with the ability to set a project and tag, place the URL in the note, and even add some of the content.
  • We’re working through Season 3 of Buffy the Vampire Slayer, and it is still good fun. Our Senior finds some of the Senior year discussions too stressful to binge it. The creator Joss Whedon has gone from nerd hero to a problematic figure in the last few years, and the Vulture feature The Undoing of Joss Whedon goes into the details.
  • I found that Whedon profile link through Matt Enloe’s “What’s Good” newsletter. Newsletters in my email are the new RSS feed, and OmniFocus my new “Read Later” service.

One More Week: gpg key refresh

January 14, 2022

I’m going to blog about what happened in a week, instead of only blogging when I think I have something significant and permanent to say and I’m in the mood to blog.

I’m still moving teams within Mozilla, but the transition is now pending without a date. I’m trying to make the best of the this in-between time.

The transition continues to be slow.
“Snail” by jamieanne, CC BY-ND 2.0

At Mozilla, we use GPG for signing commits. We sometimes use it for encrypting secrets at rest, but the current favorite secret sharing tools are 1Password and magic-wormhole. When I was setting up my keys, I was convinced that an expiring key was a good idea, because if I lose access to the private key and can’t revoke it, at least it will expire eventually. However, this means that I need to refresh the expiration date. The article How to change the expiration date of a GPG key by George Notaras was published in 2010, but GPG doesn’t change much, so it is still relevant.

Signed commits with “Verified” tag

I schedule expiration for February, and set myself a reminder to refresh in January, along with the steps I take to do it. I publish keys 9ECA 5960 3107 8B1E and 082C 735D 154F B750 to keys.openpgp.org, after gpg.mozilla.org was taken down after a June 2019 attack. I also sync then to Keybase and recreate them in my Github account.

I cleaned up the MLS / Ichnaea documentation. PR 1764 includes a refresh of the deployment docs which is a couple of years overdue. I also updated the Location entry on the wiki and some other internal documents. This the end of my “quick” list of MLS updates, so I’m moving on to Taskcluster: fixing some build issues, reviewing and merging dependency updates, and thinking about how to monitor deployment health. I got a little stuck with building the generic-worker on my M1 MacBook, and a failing CI test, but both are surmountable.

For my next team, I read through the fx-private-relay codebase. I found this tip for getting the list of files tracked in a repo:

git ls-tree --full-tree --name-only -r HEAD

I then manipulated the output to to turn it into a list of Markdown links to the GitHub repository, and checked off each file as I viewed it in SourceTree, or on GitHub if it was more than 50 lines of code. Most of the action is in the email app, and a lot of that in emails/models.py and emails/views.py. There’s not as many tests as I would expect, and some integration tests may cover a bunch of functionality.

In my non-work life, the schools are struggling with Covid-19. Isaac was in remote learning one day, and got two notices of exposure, meaning he was in class with a Covid-positive person. Ainsley has two remote learning days. I’m so glad they are both vaccinated, and wear their masks. I’m so disappointed with Oklahoma’s leadership.

I got a promotion for a local pizza place’s 17th anniversary special, and got it for delivery. It was delivered almost 5 hours after ordering. It was a bad combination of a fantastic promotion, an online ordering system that didn’t turn itself off, and using a gig economy service for delivery. I don’t want to shame anyone, so I’m avoiding names. I went from a hero for ordering pizza, to an adequate dad for making spaghetti.

Finally, my grandmother turned 93 this week! I’m so grateful for her, and that she and my grandfather have stayed safe through this pandemic.

Recommendations and links:

  • magic-wormhole is magic for transferring files securely. If you need to resend, change the filename first, to avoid problems conflicting with the original file.
  • I’m taking notes in Markdown, and started using Marked 2 again to view them. The plugin itspriddle/vim-marked adds a :MarkedOpen command, that opens a rendered version and updates when saved.
  • I printed a few photos and a board-backed poster with Walgreens Photo. The photo webapp is OK, but had a couple of hiccups. If you can, crop to the desired size in a desktop photo app first. I got them printed the same day, and the results were decent and competitive with printing myself.
  • You can support the authors of some of the best MDN Web Docs content through Open Collective.

One More Week: Slow Transitions, Multi-Platform Image Creation on CircleCI

January 7, 2022

I’m going to blog about what happened in a week, instead of only blogging when I think I have something significant and permanent to say and I’m in the mood to blog.

I’m moving teams within Mozilla. This was supposed to be my first week on the Privacy and Security team, working on Firefox Relay, an email aliasing service. I spent some time talking with Se Yeon Kim, my on-boarding buddy, about the team, the Slack channels, and the other side projects we continue to maintain (she has shavar, I have ichnaea). The transfer process between teams is slow, so I have unexpected bonus time to work on old projects for the System Engineering team.

turtle
The mascot for team transitions
“turtle” by Jazminator

For Relay, I got the development environment working, mostly following the README. The tricky bit was getting the requirements installed with Python 3.7.12 via pyenv. I was getting errors for cryptography ('openssl/opensslv.h' file not found) and psycopg2 (ld: library not found for -lssl). The error message led to the solution from Building cryptography on macOS:

LDFLAGS="-L"(brew --prefix openssl@1.1)"/lib" CFLAGS="-I"(brew --prefix openssl@1.1)"/include" pip install -r requirements.txt

I didn’t contribute a change because I feel this is standard for developing on the native system versus a Docker container – you need to figure out stuff like this. However, I did notice a warning, which lead to re-opening issue #1201, django.urls.path warning when calling runserver function, and PR #1447, Fix swagger API schema endpoint. I reported a further warning to axnsan12/drf-yasg as issue #765, ruamel.yaml 0.17.0 has deprecated dump(), load().

For my SysEng work, I’m tackling small stuff that I can work to completion, but not anything that can risk breaking production when I’m gone. I have plenty of these things, too small to make it to quarterly planning but too large to work on in the gaps between other projects.

I spent some time on generating multi-platform docker images for Ichnaea, following the guide Building Docker images for multiple operating system architectures. This required moving from the docker executor to a machine executor, to enable buildx and installing different engines for QEMU emulation. This was the easy part, and the guide was very useful. However, building just linux/amd64 and linux/arm64 took 35 minutes, versus 5 minutes for linux/amd64 in a docker executor. I decided slower builds was not worth it, since I’ve worked out the kinks in M1 MacBook development, and we’re not planning to ship on ARM. I closed the experiment in PR 1762, made some config changes in PR 1763, and added notes to issue 1636. If I try again, I think I will build the images on native machines (i.e., ARM images on ARM machines), and combine the Docker manifests in a final step.

There’s persistent slowness in provisioning Azure workers, and I’ve looked into it between other projects. I used SourceTree to walk through the history of the Taskcluster Azure provider, and added my notes to bug 1735411. It looks like the incremental provisioning and de-provisioning of Azure workers is required, and has been built up over months. The problem is each increment requires a worker scanner loop, and these loops can be as long as 90 minutes, meaning it takes hours to get a worker up and running. The next step is to speed up the worker scanner loop, ideally a minute or less, so that each of those iterations is shorter. That could be optimization, or it could be time-limiting the work done in each loop. It will hopefully be interesting work for the next person.

Ichnaea dependency updates are batched at the start of the month, and I was able to get through them quickly with pmac (PR #1759 and PR #1761). None of the updates were significant. We’re still blocked on the SQLAlchemy update (unknown work to update database code) and Python 3.10 (packaging issue with a dependency).

In my non-work life, the kids were back in school Tuesday for a 4-day week. Masks are now recommended again, but we’re not as bad as NYC (yet). Both kids missed some time for health reasons, but not Covid. Asha, our Great Dane puppy, is up to 60 lbs, and will be back in training class tonight after a winter break.

Asha and Finn hanging out on the couch

Random recommendations from me and others:

  • The family has gotten into Project Zomboid, a top-down zombie survival game where you always eventually die, but you feel like you’re getting better.
  • I continue to play Castlevania: Grimore of Souls, a free-to-play-but-pay-to-win game that was ported to Apple Arcade, lost all the pay-to-win features, but continues to have an addictive upgrade cycle. It takes 5-10 minutes to play a level, but I find it easy to repeat for hours.
  • I’m slowly watching Get Back, the epic Beatles documentary. It’s 8 hours total, but each day is 15-30 minutes, so I’m watching a day at a time. I like the Beatles, but I am also fascinated by the team dynamics between the band members, and with those that support them and rely on them for their job.
  • My client-side Sentry event filtering code in Socorro silently broke. WillKG fixed it, and launched a new project kent as a fake Sentry server for testing integrations.
  • Will KG recommends CPython Internals, still likes Introduction to Algorithms, but thinks the Dragon book is showing its age.
  • Pete Moore recommends tup as a faster alternative to make, which is capable of building Firefox.

Faster virtualenv workflows with mkvirtualenvhere and workonthis

May 25, 2016

I use virtualenvwrapper to create Python virtual environments. I have a couple of functions in my ~/.bash_profile that make it even easier to use:

if [ -f $(brew --prefix)/bin/virtualenvwrapper.sh ]; then
    . $(brew --prefix)/bin/virtualenvwrapper.sh;
    function mkvirtualenvhere() { mkvirtualenv "$@" "${PWD##*/}" ; setvirtualenvproject; }
    function workonthis() { workon "${PWD##*/}" ; }
fi

This is for OS X and homebrew. You’ll need something slightly different for your flavor of Linux.

mkvirtualenvhere creates a new virtualenv with the same name as the current directory, and I use it like this:

$ cd ~/src
$ git clone https://github.com/jwhitlock/drf-cached-instances.git
$ cd drf-cached-instances
$ mkvirtualenvhere
$ pip install -r requirements.txt

The virtualenv is called “drf-cached-instances”, and I can start work two ways. The standard way is:

$ workon drf-cached-instances

This will activate the virtualenv and change the current directory to ~/src/drf-cached-instances, because of the setvirtualenvproject call.

The second way is:

$ cd ~/src/drf-cached-instances
$ workonthis

This activates the virtualenv that shares the name of the current directory. Often, I think I am just looking at a project, using ack to explore the code, but then switch to running the project so I can experiment in an interactive prompt.

Another tool to help this workflow is GitHub’s hub, which can be faster than git for working with GitHub hosted projects.  I often use it for code review work, but forget to use it when cloning a new project.