[tor-project] minutes from the sysadmin meeting

Hi everyone,

TPA held its first meeting of the year, and those are the minutes. I’ll
take the opportunity to wish everyone a happy new year, if you’re into
that kind of calendar. I know it’s not the most obvious thing to do
right now, but I wish you can find hope this year.

Roll call: who’s there and emergencies

  • anarcat
  • kez
  • lavamind

No emergencies.

Holidays debrief

Holidays went fine, some minor issues, but nothing that needed to be
urgently dealt with (e.g. 40569, 40567, commit, runner
bug). Rotation worked well.

anarcat went cowboy and setup two new nodes before the holidays, which
is not great because it’s against our general “don’t launch on a
friday”. (It wasn’t on a friday, but it was close enough to the
holidays to be a significant risk.) Thankfully things worked out fine:
one of the runners ended up failing just as lavamind was starting work
again last week. (!)

2021 roadmap review

sysadmin

We did a review directly in the wiki page. Notable changes:

  • jenkins is marked as completed, as rouyi will be retired this week
    (!)
  • the blog migration was completed!
  • we consider we managed to deal with the day-to-day while still
    reserving time for the unexpected (e.g. the rushed web migration
    from Jenkins to GitLab CI)
  • we loved that team work and should plan to do it again
  • we were mostly on budget: we had an extra 100EUR/mth at hetzner for
    a new Ganeti node in the gnt-fsn cluster, and extra costs
    (54EUR/mth!) for the Hetzner IPv4 billing changes, and more for
    extra bandwidth use

web

Did a review of the 2021 web roadmap (from the wiki homepage), copied below:

Syadmin+web OKRs for 2022 Q1

We want to take more time to plan for the web team, in particular, and especially focused on this in the meeting.

web team

We did the following brainstorm. Anarcat will come up with a proposal for a better-formatted OKR set for next week, at which point we’ll prioritize this and the sysadmin OKRs for Q1.

  • OKR: rewrite of the donate page (milestone 22)
  • OKR: make it easier for translators to contribute
    • help the translation team to switch to Weblate
    • it is easier for translators to find their built copy of the website
    • bring build time to 15 minutes to accelerate feedback to translators
    • allow the web team to trigger manual builds for reviews
  • OKR: documentation overhaul:
    • launch dev.tpo
    • “Remove outdated documentation from the header”, stop pointing to dead docs
    • come with ideas on how to manage the wiki situation
    • cleanup the queues and workflow
  • OKR: resurrect bridge port scan
    • do not scan private IP blocks
    • make it pretty

Missed from the last meeting:

  • sponsor 9 stuff: collected UX feedback for portals, which involves web to fix issues we found, need to prioritise

We also need to organise with the new people:

  • onion SRE: new OTF project USAGM, starting in February
  • new community person

Other discussions

Next meeting

We’re going to hold another meeting next week, same time, to review the web OKRs and prioritize Q1.

Metrics of the month

  • hosts in Puppet: 89, LDAP: 91, Prometheus exporters: 139
  • number of Apache servers monitored: 27, hits per second: 185
  • number of Nginx servers: 0, hits per second: 0, hit ratio: 0.00
  • number of self-hosted nameservers: 6, mail servers: 8
  • pending upgrades: 7, reboots: 0
  • average load: 0.35, memory available: 4.01 TiB/5.13 TiB, running processes: 643
  • disk free/total: 84.95 TiB/39.99 TiB
  • bytes sent: 325.45 MB/s, received: 190.66 MB/s
  • planned bullseye upgrades completion date: 2024-09-07
  • GitLab tickets: 159 tickets including…
    • open: 2
    • icebox: 143
    • backlog: 8
    • next: 2
    • doing: 2
    • needs information: 2
    • (closed: 2573)

Upgrade prediction graph now lives at:

… with someone accurate values, although the 2024 estimate above
should be taken with a grain of salt, as we haven’t really started the
upgrade at all.

Number of the month

  1. We just hit the 5TiB of deployed memory, kind of neat.

Another number of the month

  1. We have zero Nginx servers left, as we turned off our two Nginx
    servers (ignoring the Nginx server in the GitLab instance, which is not
    really monitored correctly), when we migrated the blog to a static
    site. Those two servers were the caching server sitting in front of the
    Drupal blog for cost savings. They served us well but are now retired
    since they are not necessary for the static version.
2 Likes

An error crept up in the Metrics of this month and last, see if you can
spot it:

# Metrics of the month

* hosts in Puppet: 89, LDAP: 91, Prometheus exporters: 139
* number of Apache servers monitored: 27, hits per second: 185
* number of Nginx servers: 0, hits per second: 0, hit ratio: 0.00
* number of self-hosted nameservers: 6, mail servers: 8
* pending upgrades: 7, reboots: 0
* average load: 0.35, memory available: 4.01 TiB/5.13 TiB, running processes: 643
* disk free/total: 84.95 TiB/39.99 TiB
* bytes sent: 325.45 MB/s, received: 190.66 MB/s
* planned bullseye upgrades completion date: 2024-09-07
* [GitLab tickets][]: 159 tickets including...
   * open: 2
   * icebox: 143
   * backlog: 8
   * next: 2
   * doing: 2
   * needs information: 2
   * (closed: 2573)

[Gitlab tickets]: Development · Boards · The Tor Project / TPA / TPA team · GitLab

hint: it's about disk space...

anyone?

credits to roger who figured it out: the disk free/total was
backwards. The correct figure should have read:

* disk free/total: 38.28 TiB/84.95 TiB

... in this report. Future report shouldn't have this error. It should
also be noted that those metrics should be generally taken with a grain
of salt. The disk query was introduced recently and, in particular,
counts disk usage of the (huge) backup server (60TiB) which itself
keeps a copy of everything by definition.

The network metrics also probably overcount things as we simply do this:

    sum(rate(node_network_transmit_bytes_total[30d]))

... which, in the most likely case you are unfamiliar with Prometheus
and our network infrastructure, may count traffic twice. This will count
internal traffic between network mirrors, for example.

I haven't yet figured out a good (AKA simple) way to fix those queries...

Cheers!

A.

···

On 2022-01-11 20:34:08, Antoine Beaupré wrote:

--
Antoine Beaupré
torproject.org system administration

1 Like