One More Week: Designing the next Taskcluster monitoring system

I’ve got a date for the team transition: Wednesday February 9th, 4½ weeks after my original date. I was also told in confidence the reason, and given a chance to continue on my current team. I’m still learning toward the new team, but I’m taking the weekend to think about it.

Our dogs Asha, Link, and Finn enjoy a light snow

Monday was Martin Luther King Jr. Day and a Mozilla holiday, so it was a short work week. I spent a little time helping with some tests in PRs on the Relay project. The tests are not as mature as I’d like, but things are improving in January, such as PR 1493, Domain recipient tests and coverage, by groovecoder. I look forward to talking to the current team and determining what code quality issues are slowing down progress.

The Taskcluster team is focusing on reliability and observability this quarter. I put together a summary of Taskcluster, what monitoring is available, and a suggestion for a monitoring strategy. Most of the summary was based on the Reference section of the Taskcluster docs, which describes the 3 platform services and the 7 core services, as well as the API, log messages produced, and Pulse / AMQP messages emitted and consumed. Existing monitoring involves an (unused) New Relic hook, watching log messages in structured logs, and smoke tests.

The docs do not cover the breakdown of 10 services to 20 or more micro-services – usually a “front-end” HTTP API and one or more backend processing tasks running in a loop. In my experience, the API services are reliable and well-tested, while the backend processesing tasks cause problems and are difficult to (automatically) monitor and debug. My belief may be influenced by the worker-manager scanner, which syncs status with the cloud providers and is prone to slow execution cycles (see bug 1736329, gecko-3/decision workers not taking tasks, for analysis via logs, and bug 1735411, windows 2004 test worker backlog (gecko-t/win10-64-2004, for a analysis of one incident and a historical code review). The provisioner, which provisions new on-demand servers, has its own issues, such as a work strike due to bad recovery from cloud provider issues.

My recommendation is to use the Pulse message bus to periodically broadcast status of backend services, and add a monitoring service that listens to these messages, gathers status from the micro-services with a Web API, and aggregates the data into a unified JSON status report. This can be consumed by Telescope, which can set limits and alert thresholds. The feedback has been that AMQP can be hard to debug and reason about, that there can be issues with message volume (bug 1717128, Treeherder repeatedly falling behind with ingestion of data from Pulse/Taskcluster, may be a data point), and that a new core monitoring service may be overkill. Some counter-proposals have been per-service health endpoints, or using an existing monitoring system like statsd or Prometheus.

Next week we’ll refine the recommendations and discuss it some more. Once we get consensus next week or the following, I plan to summarize as a public Taskcluster RFC. If the team move goes as planned, that may be where my active involvement ends. I’m optimistic that the team (including new Berlin hire Yaraslau Kurmyza) can move Taskcluster to the next level.

We use community-tc to build Taskcluster, and our usage of Taskcluster helps align us with the users of FirefoxCI. Pete Moore recently released v44.4.0, which took a while due to intermittent errors in the automated release process. These broken releases broke one of the CI build steps (issue 5067) due to missing images, and then when then new code fully shipped, the build broke in a different way due to lingering Python 2.7 code (issue 5070). Matt Boris was also tripped by an intermittent error on PR 5069, Fix styling of label on GitHub Quickstart Page, and Taskcluster doesn’t fully support re-runs (issue 4950). Testing PRs and releasing new versions is far too fragile, but is not the focus of the reliability effort in Q1. Hopefully the team will find some time to improve these aspects as well.

The three-day weekend gave me time to finish Learned Optimism by Martin Seligman. My sister recommended this in summer 2021, and it took me to November to start it. I got ⅔ of the way through, and then started over from chapter 1, taking notes as I read, 13,000 words worth. I found it interesting and helpful, and I’m trying the flexible optimism exercises to alter my thought patterns. My review on Goodreads has more details.

Asha’s Graduation Ceremony

The kids had an even shorter 3 day week. The COVID numbers dropped (0.82% of staff and kids, down from 1.28% last week), but we won’t know until next week if this is a true downward trend. We’re hearing daily stories of COVID positive tests – a neighbor on Wednesday, my brother-in-law on Friday, and plenty through the work networks. A COVID scare cancelled our training class last week, but our 24-week-old, 70lb puppy graduated on Friday.

Recommendations and links:

  • I’ve been using the comment
    <!-- vim:set wrap linebreak breakat&vim: -->
    in my Markdown notes, as suggested by this stackoverflow answer. It has made editing long lines more pleasant than regular wrapping.
  • OmSave for Safari has been a big improvement over OmniFocus’s default “Sent To” method, with the ability to set a project and tag, place the URL in the note, and even add some of the content.
  • We’re working through Season 3 of Buffy the Vampire Slayer, and it is still good fun. Our Senior finds some of the Senior year discussions too stressful to binge it. The creator Joss Whedon has gone from nerd hero to a problematic figure in the last few years, and the Vulture feature The Undoing of Joss Whedon goes into the details.
  • I found that Whedon profile link through Matt Enloe’s “What’s Good” newsletter. Newsletters in my email are the new RSS feed, and OmniFocus my new “Read Later” service.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: