Archive for January, 2022

One More Week: Update all the things!

January 28, 2022

I’ve re-affirmed my decision to go to the new team. Now that I have a deadline, I’m back in the game, trying to figure out the best way to spend my remaining days in Systems Engineering.

Saint Anthony Tormented by Demons – Martin Schongauer, ca 1470-75

I’ve been in more meetings this week that all year! I was pulled back into the SysEng team Slack channel, the weekly staff meeting, and two project reviews. I missed my new favorite meeting, the Taskcluster Community meeting led by Pete Moore. I thought I was more emotionally level in 2022 than last year. It turns out it may be that I just had less Zoom meetings.

We recently released Taskcluster v44.4.0, which dropped Python 2.7 support. This revealed some lingering Python 2 code, updated in PR 5071. Matt Boris has been churning through Dependabot updates, and it is good to see those getting merged. I was inspired to start on the version bumps for Go (1.16.7 to 1.17.6) and Node (14.17.15 to LTS 16.13.2). My first attempt, PR 5084, was sloppier than I liked, so I broke out the Go upgrade in PR 5088. The Node update in PR 5095 is also a bit sloppy, requiring updating to a CI worker from Ubuntu 14.04 to 20.04 – a six-year jump! All the tests pass but one, so that will slip until next week.

For the monitoring project, I’ve been warned away from both Pulse / RabbitMQ as a transport, as well as creating a new Monitor service. I’m looking into ways to build it with existing tools, and hook into libraries like Taskcluster’s Monitor. Alan Alexander still likes Prometheus, which we used on the CTMS project on an API server and a backend processor, and thinks the InfluxDB TICK stack would also work. Christina Harlow confirmed we can get data out of InfluxDB with a token, so it isn’t a one-way transaction. I confirmed that the service APIs are not proxied, but instead available via the Kubernetes Ingress configuration. The Kubernetes configuration answered several questions for me, including how the services are run, and the existence of a handful of CronJob tasks that could also use monitoring.

I do not yet have a plan for how to implement monitoring, at least one that I can hand off to other people. I am still in discovery mode. One question I had after the meeting was if have enough monitoring for the major task already – the slow provisioning of Azure workers – and if we should tackle that directly. Next week I plan to put together my scattered notes on that issue, and plan some first steps.

We went to my mother-in-law’s house on Sunday. She is so excited for visitors, she agreed to play Wingspan, which took about 3 hours (the kid won), followed by a speed round of Hearts. I went from no points to catching the black Queen twice and being the big loser.

Recommendations and links:

  • Wordle is a daily word game like Mastermind, and just requires a browser. The author made it easy to share your daily results without spoilers, making it a viral hit on Twitter, Slack, and our family iMessage threads. It really annoys some people, which is also fun, and makes for a new meme format.
  • Alan wants to read Implementing Service Level Objectives by Alex Hidalgo, now I do too.
  • Wingspan is a fun engine-building board game. It takes so long to get through the first game, but then we just wanted to play again. I can’t say that I’ve figured out a winning strategy – I never know who won until the points are added at the end.
  • The Story of King Solomon and Ashmedai is quite a story, and makes me think about the thin line between what makes it into the religious canon and what doesn’t.

Advertisement

One More Week: Designing the next Taskcluster monitoring system

January 22, 2022

I’ve got a date for the team transition: Wednesday February 9th, 4½ weeks after my original date. I was also told in confidence the reason, and given a chance to continue on my current team. I’m still learning toward the new team, but I’m taking the weekend to think about it.

Our dogs Asha, Link, and Finn enjoy a light snow

Monday was Martin Luther King Jr. Day and a Mozilla holiday, so it was a short work week. I spent a little time helping with some tests in PRs on the Relay project. The tests are not as mature as I’d like, but things are improving in January, such as PR 1493, Domain recipient tests and coverage, by groovecoder. I look forward to talking to the current team and determining what code quality issues are slowing down progress.

The Taskcluster team is focusing on reliability and observability this quarter. I put together a summary of Taskcluster, what monitoring is available, and a suggestion for a monitoring strategy. Most of the summary was based on the Reference section of the Taskcluster docs, which describes the 3 platform services and the 7 core services, as well as the API, log messages produced, and Pulse / AMQP messages emitted and consumed. Existing monitoring involves an (unused) New Relic hook, watching log messages in structured logs, and smoke tests.

The docs do not cover the breakdown of 10 services to 20 or more micro-services – usually a “front-end” HTTP API and one or more backend processing tasks running in a loop. In my experience, the API services are reliable and well-tested, while the backend processesing tasks cause problems and are difficult to (automatically) monitor and debug. My belief may be influenced by the worker-manager scanner, which syncs status with the cloud providers and is prone to slow execution cycles (see bug 1736329, gecko-3/decision workers not taking tasks, for analysis via logs, and bug 1735411, windows 2004 test worker backlog (gecko-t/win10-64-2004, for a analysis of one incident and a historical code review). The provisioner, which provisions new on-demand servers, has its own issues, such as a work strike due to bad recovery from cloud provider issues.

My recommendation is to use the Pulse message bus to periodically broadcast status of backend services, and add a monitoring service that listens to these messages, gathers status from the micro-services with a Web API, and aggregates the data into a unified JSON status report. This can be consumed by Telescope, which can set limits and alert thresholds. The feedback has been that AMQP can be hard to debug and reason about, that there can be issues with message volume (bug 1717128, Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster, may be a data point), and that a new core monitoring service may be overkill. Some counter-proposals have been per-service health endpoints, or using an existing monitoring system like statsd or Prometheus.

Next week we’ll refine the recommendations and discuss it some more. Once we get consensus next week or the following, I plan to summarize as a public Taskcluster RFC. If the team move goes as planned, that may be where my active involvement ends. I’m optimistic that the team (including new Berlin hire Yaraslau Kurmyza) can move Taskcluster to the next level.

We use community-tc to build Taskcluster, and our usage of Taskcluster helps align us with the users of FirefoxCI. Pete Moore recently released v44.4.0, which took a while due to intermittent errors in the automated release process. These broken releases broke one of the CI build steps (issue 5067) due to missing images, and then when then new code fully shipped, the build broke in a different way due to lingering Python 2.7 code (issue 5070). Matt Boris was also tripped by an intermittent error on PR 5069, Fix styling of label on GitHub Quickstart Page, and Taskcluster doesn’t fully support re-runs (issue 4950). Testing PRs and releasing new versions is far too fragile, but is not the focus of the reliability effort in Q1. Hopefully the team will find some time to improve these aspects as well.

The three-day weekend gave me time to finish Learned Optimism by Martin Seligman. My sister recommended this in summer 2021, and it took me to November to start it. I got ⅔ of the way through, and then started over from chapter 1, taking notes as I read, 13,000 words worth. I found it interesting and helpful, and I’m trying the flexible optimism exercises to alter my thought patterns. My review on Goodreads has more details.

Asha’s Graduation Ceremony

The kids had an even shorter 3 day week. The COVID numbers dropped (0.82% of staff and kids, down from 1.28% last week), but we won’t know until next week if this is a true downward trend. We’re hearing daily stories of COVID positive tests – a neighbor on Wednesday, my brother-in-law on Friday, and plenty through the work networks. A COVID scare cancelled our training class last week, but our 24-week-old, 70lb puppy graduated on Friday.

Recommendations and links:

  • I’ve been using the comment
    <!-- vim:set wrap linebreak breakat&vim: -->
    in my Markdown notes, as suggested by this stackoverflow answer. It has made editing long lines more pleasant than regular wrapping.
  • OmSave for Safari has been a big improvement over OmniFocus’s default “Sent To” method, with the ability to set a project and tag, place the URL in the note, and even add some of the content.
  • We’re working through Season 3 of Buffy the Vampire Slayer, and it is still good fun. Our Senior finds some of the Senior year discussions too stressful to binge it. The creator Joss Whedon has gone from nerd hero to a problematic figure in the last few years, and the Vulture feature The Undoing of Joss Whedon goes into the details.
  • I found that Whedon profile link through Matt Enloe’s “What’s Good” newsletter. Newsletters in my email are the new RSS feed, and OmniFocus my new “Read Later” service.

One More Week: gpg key refresh

January 14, 2022

I’m going to blog about what happened in a week, instead of only blogging when I think I have something significant and permanent to say and I’m in the mood to blog.

I’m still moving teams within Mozilla, but the transition is now pending without a date. I’m trying to make the best of the this in-between time.

The transition continues to be slow.
“Snail” by jamieanne, CC BY-ND 2.0

At Mozilla, we use GPG for signing commits. We sometimes use it for encrypting secrets at rest, but the current favorite secret sharing tools are 1Password and magic-wormhole. When I was setting up my keys, I was convinced that an expiring key was a good idea, because if I lose access to the private key and can’t revoke it, at least it will expire eventually. However, this means that I need to refresh the expiration date. The article How to change the expiration date of a GPG key by George Notaras was published in 2010, but GPG doesn’t change much, so it is still relevant.

Signed commits with “Verified” tag

I schedule expiration for February, and set myself a reminder to refresh in January, along with the steps I take to do it. I publish keys 9ECA 5960 3107 8B1E and 082C 735D 154F B750 to keys.openpgp.org, after gpg.mozilla.org was taken down after a June 2019 attack. I also sync then to Keybase and recreate them in my Github account.

I cleaned up the MLS / Ichnaea documentation. PR 1764 includes a refresh of the deployment docs which is a couple of years overdue. I also updated the Location entry on the wiki and some other internal documents. This the end of my “quick” list of MLS updates, so I’m moving on to Taskcluster: fixing some build issues, reviewing and merging dependency updates, and thinking about how to monitor deployment health. I got a little stuck with building the generic-worker on my M1 MacBook, and a failing CI test, but both are surmountable.

For my next team, I read through the fx-private-relay codebase. I found this tip for getting the list of files tracked in a repo:

git ls-tree --full-tree --name-only -r HEAD

I then manipulated the output to to turn it into a list of Markdown links to the GitHub repository, and checked off each file as I viewed it in SourceTree, or on GitHub if it was more than 50 lines of code. Most of the action is in the email app, and a lot of that in emails/models.py and emails/views.py. There’s not as many tests as I would expect, and some integration tests may cover a bunch of functionality.

In my non-work life, the schools are struggling with Covid-19. Isaac was in remote learning one day, and got two notices of exposure, meaning he was in class with a Covid-positive person. Ainsley has two remote learning days. I’m so glad they are both vaccinated, and wear their masks. I’m so disappointed with Oklahoma’s leadership.

I got a promotion for a local pizza place’s 17th anniversary special, and got it for delivery. It was delivered almost 5 hours after ordering. It was a bad combination of a fantastic promotion, an online ordering system that didn’t turn itself off, and using a gig economy service for delivery. I don’t want to shame anyone, so I’m avoiding names. I went from a hero for ordering pizza, to an adequate dad for making spaghetti.

Finally, my grandmother turned 93 this week! I’m so grateful for her, and that she and my grandfather have stayed safe through this pandemic.

Recommendations and links:

  • magic-wormhole is magic for transferring files securely. If you need to resend, change the filename first, to avoid problems conflicting with the original file.
  • I’m taking notes in Markdown, and started using Marked 2 again to view them. The plugin itspriddle/vim-marked adds a :MarkedOpen command, that opens a rendered version and updates when saved.
  • I printed a few photos and a board-backed poster with Walgreens Photo. The photo webapp is OK, but had a couple of hiccups. If you can, crop to the desired size in a desktop photo app first. I got them printed the same day, and the results were decent and competitive with printing myself.
  • You can support the authors of some of the best MDN Web Docs content through Open Collective.

One More Week: Slow Transitions, Multi-Platform Image Creation on CircleCI

January 7, 2022

I’m going to blog about what happened in a week, instead of only blogging when I think I have something significant and permanent to say and I’m in the mood to blog.

I’m moving teams within Mozilla. This was supposed to be my first week on the Privacy and Security team, working on Firefox Relay, an email aliasing service. I spent some time talking with Se Yeon Kim, my on-boarding buddy, about the team, the Slack channels, and the other side projects we continue to maintain (she has shavar, I have ichnaea). The transfer process between teams is slow, so I have unexpected bonus time to work on old projects for the System Engineering team.

turtle
The mascot for team transitions
“turtle” by Jazminator

For Relay, I got the development environment working, mostly following the README. The tricky bit was getting the requirements installed with Python 3.7.12 via pyenv. I was getting errors for cryptography ('openssl/opensslv.h' file not found) and psycopg2 (ld: library not found for -lssl). The error message led to the solution from Building cryptography on macOS:

LDFLAGS="-L"(brew --prefix openssl@1.1)"/lib" CFLAGS="-I"(brew --prefix openssl@1.1)"/include" pip install -r requirements.txt

I didn’t contribute a change because I feel this is standard for developing on the native system versus a Docker container – you need to figure out stuff like this. However, I did notice a warning, which lead to re-opening issue #1201, django.urls.path warning when calling runserver function, and PR #1447, Fix swagger API schema endpoint. I reported a further warning to axnsan12/drf-yasg as issue #765, ruamel.yaml 0.17.0 has deprecated dump(), load().

For my SysEng work, I’m tackling small stuff that I can work to completion, but not anything that can risk breaking production when I’m gone. I have plenty of these things, too small to make it to quarterly planning but too large to work on in the gaps between other projects.

I spent some time on generating multi-platform docker images for Ichnaea, following the guide Building Docker images for multiple operating system architectures. This required moving from the docker executor to a machine executor, to enable buildx and installing different engines for QEMU emulation. This was the easy part, and the guide was very useful. However, building just linux/amd64 and linux/arm64 took 35 minutes, versus 5 minutes for linux/amd64 in a docker executor. I decided slower builds was not worth it, since I’ve worked out the kinks in M1 MacBook development, and we’re not planning to ship on ARM. I closed the experiment in PR 1762, made some config changes in PR 1763, and added notes to issue 1636. If I try again, I think I will build the images on native machines (i.e., ARM images on ARM machines), and combine the Docker manifests in a final step.

There’s persistent slowness in provisioning Azure workers, and I’ve looked into it between other projects. I used SourceTree to walk through the history of the Taskcluster Azure provider, and added my notes to bug 1735411. It looks like the incremental provisioning and de-provisioning of Azure workers is required, and has been built up over months. The problem is each increment requires a worker scanner loop, and these loops can be as long as 90 minutes, meaning it takes hours to get a worker up and running. The next step is to speed up the worker scanner loop, ideally a minute or less, so that each of those iterations is shorter. That could be optimization, or it could be time-limiting the work done in each loop. It will hopefully be interesting work for the next person.

Ichnaea dependency updates are batched at the start of the month, and I was able to get through them quickly with pmac (PR #1759 and PR #1761). None of the updates were significant. We’re still blocked on the SQLAlchemy update (unknown work to update database code) and Python 3.10 (packaging issue with a dependency).

In my non-work life, the kids were back in school Tuesday for a 4-day week. Masks are now recommended again, but we’re not as bad as NYC (yet). Both kids missed some time for health reasons, but not Covid. Asha, our Great Dane puppy, is up to 60 lbs, and will be back in training class tonight after a winter break.

Asha and Finn hanging out on the couch

Random recommendations from me and others:

  • The family has gotten into Project Zomboid, a top-down zombie survival game where you always eventually die, but you feel like you’re getting better.
  • I continue to play Castlevania: Grimore of Souls, a free-to-play-but-pay-to-win game that was ported to Apple Arcade, lost all the pay-to-win features, but continues to have an addictive upgrade cycle. It takes 5-10 minutes to play a level, but I find it easy to repeat for hours.
  • I’m slowly watching Get Back, the epic Beatles documentary. It’s 8 hours total, but each day is 15-30 minutes, so I’m watching a day at a time. I like the Beatles, but I am also fascinated by the team dynamics between the band members, and with those that support them and rely on them for their job.
  • My client-side Sentry event filtering code in Socorro silently broke. WillKG fixed it, and launched a new project kent as a fake Sentry server for testing integrations.
  • Will KG recommends CPython Internals, still likes Introduction to Algorithms, but thinks the Dragon book is showing its age.
  • Pete Moore recommends tup as a faster alternative to make, which is capable of building Firefox.