[tor-project] TPA-RFC-20: bullseye upgrade schedule

anarcat · March 24, 2022, 8:35pm

Note: this proposal is also visible in:

Summary: bullseye upgrades will roll out starting the first weeks of
April and May, and should complete before the end of August 2022. Let
us know if your service requires special handling.

Background

Debian 11 bullseye was released on August 14 2021). Tor
started the upgrade to bullseye shortly after and hopes to complete
the process before the buster EOL, one year after the stable
release, so normally around August 2022.

In other words, we have until this summer to upgrade all of TPA’s
machine to the new release.

New machines that were setup recently have already been installed in
bullseye, as the installers were changed shortly after the release. A
few machines were upgraded manually without any ill effects and we do
not consider this upgrade to be risky or dangerous, in general.

This work is part of the %Debian 11 bullseye upgrade milestone,
itself part of the OKR 2022 Q1/Q2 plan.

Proposal

The proposal, broadly speaking, is to upgrade all servers in three
batches. The first two are somewhat equally sized and spread over
April and May, and the rest will happen at some time that will be
announced later, individually, per server.

Affected users

All service admins are affected by this change. If you have shell
access on any TPA server, you want to read this announcement.

Upgrade schedule

The upgrade is split in multiple batches:

low complexity (mostly TPA): April
moderate complexity (service admins): May
high complexity (hard stuff): to be announced separately
to be retired or rebuilt servers: not upgraded
already completed upgrades

The free time between the first two will also allow us to cover for
unplanned contingencies: upgrades that could drag on and other work
that will inevitably need to be performed.

The objective is to do the batches in collective “upgrade parties”
that should be “fun” for the team (and work parties have generally
been generally fun in the past).

Low complexity, batch 1: April

A first batch of servers will be upgraded in the first week of April.

Those machines are considered to be somewhat trivial to upgrade as
they are mostly managed by TPA or that we evaluate that the upgrade
will have minimal impact on the service’s users.

archive-01
build-x86-05
build-x86-06
chi-node-12
chi-node-13
chives
ci-runner-01
ci-runner-arm64-02
dangerzone-01
hetzner-hel1-02
hetzner-hel1-03
hetzner-nbg1-01
hetzner-nbg1-02
loghost01
media-01
metrics-store-01
perdulce
static-master-fsn
submit-01
tb-build-01
tb-build-03
tb-tester-01
tbb-nightlies-master
web-chi-03
web-cymru-01
web-fsn-01
web-fsn-02

27 machines. At a worst case 45 minutes per machine, that is 20 hours
of work. At three people, this might be doable in a day.

Feedback and coordination of this batch happens in issue
tpo/tpa/team#40690.

Moderate complexity, batch 2: May

The second batch of “moderate complexity servers” happens in the first
week of May. The main difference with the first batch is that the second
batch regroups services mostly managed by service admins, who are given
a longer heads up before the upgrades are done.

bacula-director-01
bungei
carinatum
check-01
crm-ext-01
crm-int-01
fallax
gettor-01
gitlab-02
henryi
majus
mandos-01
materculae
meronense
neriniflorum
nevii
onionbalance-01
onionbalance-02
onionoo-backend-01
onionoo-backend-02
onionoo-frontend-01
onionoo-frontend-02
polyanthum
rude
staticiforme
subnotabile

26 machines. If the worst case scenario holds, this is another day of
work, at three people.

Not mentioned here is the gnt-fsn Ganeti cluster upgrade, which is
covered by ticket tpo/tpa/team#40689. That alone could be a few
day-person of work.

Feedback and coordination of this batch happens in issue tpo/tpa/team#40692

High complexity, individually done

Those machines are harder to upgrade, due to some major upgrades of
their core components, and will require individual attention, if not
major work to upgrade.

alberti
eugeni
hetzner-hel1-01
pauli

Each machine could take a week or two to upgrade, depending on the
situation and severity. To detail each server:

alberti: userdir-ldap is, in general, risky and needs special
attention, but should be moderately safe to upgrade, see ticket
tpo/tpa/team#40693
eugeni: messy server, with lots of moving parts (e.g. Schleuder,
Mailman), Mailman 2 EOL, needs to decide whether to migrate to
Mailman 3 or replace with Discourse (and self-host), see
tpo/tpa/team#40471, followup in tpo/tpa/team#40694
hetzner-hel1-01: Nagios AKA Icinga 1 is end-of-life and needs to
be migrated to Icinga 2, which involves fixing our git hooks to
generate Icinga 2 configuration (unlikely), or rebuilding a Icinga
2 server, or replacing with Prometheus (see
tpo/tpa/team#29864), followup in tpo/tpa/team#40695
pauli: Puppet packages are severely out of date in Debian, and
Puppet 5 is EOL (with Puppet 6 soon to be). doesn’t necessarily
block the upgrade, but we should deal with this problem sooner than
later, see tpo/tpa/team#33588, followup in tpo/tpa/team#40696

All of those require individual decision and design, and specific
announcements will be made for upgrades once a decision has been made
for each service.

To retire

Those servers are possibly scheduled for removal and may not be
upgraded to bullseye at all. If we miss the summer deadline, they
might be upgraded as a last resort.

cupani
gayi
moly
peninsulare
vineale

Specifically:

cupani/vineale is covered by tpo/tpa/team#40472
gayi is TPA-RFC-11: SVN retirement, tpo/tpa/team#17202
moly/peninsulare is tpo/tpa/team#29974

To rebuild

Those machines are planned to be rebuilt and should therefore not be
upgraded either:

cdn-backend-sunet-01
colchicifolium
corsicum
nutans

Some of those machines are hosted at a Sunet and need to be migrated
elsewhere, see tpo/tpa/team#40684 for details. colchicifolium will
is planned to be rebuilt in the gnt-chi cluster, no ticket created
yet.

They will be rebuilt in new bullseye machines which should allow for a
safer transition that shouldn’t require specific coordination or
planning.

Completed upgrades

Those machines have already been upgraded to (or installed as) Debian
11 bullseye:

btcpayserver-02
chi-node-01
chi-node-02
chi-node-03
chi-node-04
chi-node-05
chi-node-06
chi-node-07
chi-node-08
chi-node-09
chi-node-10
chi-node-11
chi-node-14
ci-runner-x86-05
palmeri
relay-01
static-gitlab-shim
tb-pkgstage-01

Other related work

There is other work related to the bullseye upgrade that is mentioned
in the %Debian 11 bullseye upgrade milestone.

Alternatives considered

We have not set aside time to automate the upgrade procedure any
further at this stage, as this is considered to be a too risky
development project, and the current procedure is fast enough for
now.

We could also move to the cloud, Kubernetes, serverless, and Ethereum
and pretend none of those things exist, but so far we stay in the real
world of operating systems.

Also note that this doesn’t cover Docker container images
upgrades. Each team is responsible for upgrading their image tags in
GitLab CI appropriately and is strongly encouraged to keep a close
eye on those in general. We may eventually consider enforcing stricter
control over container images if this proves to be too chaotic to
self-manage.

Costs

It is estimates this will take one or two person-month to complete, full
time.

Approvals required

This proposal needs approval from TPA team members, but service admins
can request additional delay if they are worried about their service
being affected by the upgrade.

Comments or feedback can be provided in issues linked above.

Deadline

Upgrades will start in the first week of April 2022 (2022-04-04)
unless an objection is raised.

This proposal will be considered adopted by then unless an objection
is raised within TPA.

Status

This proposal is currently in the proposed state.

References

···

--
Antoine Beaupré
torproject.org system administration

anarcat · April 8, 2022, 2:03am

Hi,

We have *almost* completed our objective of upgrading everything in the
first batch of bullseye upgrades this week. Only three servers are left:

... and they are three web mirrors which we have held off of upgrading
because issues came up in the static mirror sync procedure. TB *and*
core team had uploads to do, so we quickly fixed that regression and
left those machines alone for now.

They will be upgraded next week though, because the one machine that was
upgraded seems to have recovered.

The procedure took slightly longer than estimated because I spent some
time automating things that will, hopefully, pay off in the next batch
(and in future major upgrades as well, of course).

Next up is the second batch of servers, in early May:

We might also start working on the migration of the sunet cluster:

... which involves retiring some old services like build-sunet-a:

and ipv6only.torproject.net (anyone knows *what* that thing is
anyways?).

Again, feedback on this procedure is welcome here or in the TPA-RFC-20
issue:

People interested in the long term plan here can look at TPA-RFC-20
again:

Thank you for your attention.

···

--
Antoine Beaupré
torproject.org system administration

anarcat · April 27, 2022, 3:25pm

Reminder: the bullseye upgrade run is continuing in May.

We therefore are probably going to resume upgrades of the rest of the
cluster *next week*. The machines in this batch are:

bacula-director-01
bungei
carinatum
check-01
crm-ext-01
crm-int-01
fallax
gettor-01
gitlab-02
henryi
majus
mandos-01
materculae
meronense
neriniflorum
nevii
onionbalance-01
onionbalance-02
onionoo-backend-01
onionoo-backend-02
onionoo-frontend-01
onionoo-frontend-02
polyanthum
rude
staticiforme
subnotabile

If you have any concern about those servers being upgraded, do let us
know.

A copy of the original RFC follows.

Thanks!

a.

···

--
Antoine Beaupré
torproject.org system administration

On 2022-03-24 16:35:34, Antoine Beaupré wrote:

Note: this proposal is also visible in:

tpa rfc 20 bullseye upgrades · Wiki · The Tor Project / TPA / TPA team · GitLab

Summary: bullseye upgrades will roll out starting the first weeks of
April and May, and should complete before the end of August 2022. Let
us know if your service requires special handling.

# Background

Debian 11 [bullseye] was [released on August 14 2021]). Tor
started the upgrade to bullseye shortly after and hopes to complete
the process before the [buster] EOL, [one year after the stable
release], so normally around August 2022.

In other words, we have until this summer to upgrade *all* of TPA's
machine to the new release.

New machines that were setup recently have already been installed in
bullseye, as the installers were changed shortly after the release. A
few machines were upgraded manually without any ill effects and we do
not consider this upgrade to be risky or dangerous, in general.

This work is part of the [%Debian 11 bullseye upgrade milestone],
itself part of the [OKR 2022 Q1/Q2 plan].

# Proposal

The proposal, broadly speaking, is to upgrade all servers in three
batches. The first two are somewhat equally sized and spread over
April and May, and the rest will happen at some time that will be
announced later, individually, per server.

## Affected users

All service admins are affected by this change. If you have shell
access on any TPA server, you want to read this announcement.

## Upgrade schedule

The upgrade is split in multiple batches:

* low complexity (mostly TPA): April
* moderate complexity (service admins): May
* high complexity (hard stuff): to be announced separately
* to be retired or rebuilt servers: not upgraded
* already completed upgrades

The free time between the first two will also allow us to cover for
unplanned contingencies: upgrades that could drag on and other work
that will inevitably need to be performed.

The objective is to do the batches in collective "upgrade parties"
that should be "fun" for the team (and work parties *have* generally
been generally fun in the past).

### Low complexity, batch 1: April

A first batch of servers will be upgraded in the first week of April.

Those machines are considered to be somewhat trivial to upgrade as
they are mostly managed by TPA or that we evaluate that the upgrade
will have minimal impact on the service's users.
archive-01
build-x86-05
build-x86-06
chi-node-12
chi-node-13
chives
ci-runner-01
ci-runner-arm64-02
dangerzone-01
hetzner-hel1-02
hetzner-hel1-03
hetzner-nbg1-01
hetzner-nbg1-02
loghost01
media-01
metrics-store-01
perdulce
static-master-fsn
submit-01
tb-build-01
tb-build-03
tb-tester-01
tbb-nightlies-master
web-chi-03
web-cymru-01
web-fsn-01
web-fsn-02
27 machines. At a worst case 45 minutes per machine, that is 20 hours
of work. At three people, this might be doable in a day.

Feedback and coordination of this batch happens in issue
[tpo/tpa/team#40690].

### Moderate complexity, batch 2: May

The second batch of "moderate complexity servers" happens in the first
week of May. The main difference with the first batch is that the second
batch regroups services mostly managed by service admins, who are given
a longer heads up before the upgrades are done.
bacula-director-01
bungei
carinatum
check-01
crm-ext-01
crm-int-01
fallax
gettor-01
gitlab-02
henryi
majus
mandos-01
materculae
meronense
neriniflorum
nevii
onionbalance-01
onionbalance-02
onionoo-backend-01
onionoo-backend-02
onionoo-frontend-01
onionoo-frontend-02
polyanthum
rude
staticiforme
subnotabile
26 machines. If the worst case scenario holds, this is another day of
work, at three people.

Not mentioned here is the `gnt-fsn` Ganeti cluster upgrade, which is
covered by ticket [tpo/tpa/team#40689]. That alone could be a few
day-person of work.

Feedback and coordination of this batch happens in issue [tpo/tpa/team#40692]

### High complexity, individually done

Those machines are harder to upgrade, due to some major upgrades of
their core components, and will require individual attention, if not
major work to upgrade.
alberti
eugeni
hetzner-hel1-01
pauli
Each machine could take a week or two to upgrade, depending on the
situation and severity. To detail each server:

* `alberti`: `userdir-ldap` is, in general, risky and needs special
   attention, but should be moderately safe to upgrade, see ticket
   [tpo/tpa/team#40693]
* `eugeni`: messy server, with lots of moving parts (e.g. Schleuder,
   Mailman), Mailman 2 EOL, needs to decide whether to migrate to
   Mailman 3 or replace with Discourse (and self-host), see
   [tpo/tpa/team#40471], followup in [tpo/tpa/team#40694]
* `hetzner-hel1-01`: Nagios AKA Icinga 1 is end-of-life and needs to
   be migrated to Icinga 2, which involves fixing our git hooks to
   generate Icinga 2 configuration (unlikely), or rebuilding a Icinga
   2 server, or replacing with Prometheus (see
   [tpo/tpa/team#29864]), followup in [tpo/tpa/team#40695]
* `pauli`: Puppet packages are severely out of date in Debian, and
   Puppet 5 is EOL (with Puppet 6 soon to be). doesn't necessarily
   block the upgrade, but we should deal with this problem sooner than
   later, see [tpo/tpa/team#33588], followup in [tpo/tpa/team#40696]

All of those require individual decision and design, and specific
announcements will be made for upgrades once a decision has been made
for each service.

### To retire

Those servers are possibly scheduled for removal and may not be
upgraded to bullseye at all. If we miss the summer deadline, they
might be upgraded as a last resort.
cupani
gayi
moly
peninsulare
vineale
Specifically:

* cupani/vineale is covered by [tpo/tpa/team#40472]
* gayi is [TPA-RFC-11: SVN retirement], [tpo/tpa/team#17202]
* moly/peninsulare is [tpo/tpa/team#29974]

### To rebuild

Those machines are planned to be rebuilt and should therefore not be
upgraded either:
cdn-backend-sunet-01
colchicifolium
corsicum
nutans
Some of those machines are hosted at a Sunet and need to be migrated
elsewhere, see [tpo/tpa/team#40684] for details. `colchicifolium` will
is planned to be rebuilt in the `gnt-chi` cluster, no ticket created
yet.

They will be rebuilt in new bullseye machines which should allow for a
safer transition that shouldn't require specific coordination or
planning.

### Completed upgrades

Those machines have already been upgraded to (or installed as) Debian
11 bullseye:
btcpayserver-02
chi-node-01
chi-node-02
chi-node-03
chi-node-04
chi-node-05
chi-node-06
chi-node-07
chi-node-08
chi-node-09
chi-node-10
chi-node-11
chi-node-14
ci-runner-x86-05
palmeri
relay-01
static-gitlab-shim
tb-pkgstage-01
### Other related work

There is other work related to the bullseye upgrade that is mentioned
in the [%Debian 11 bullseye upgrade milestone].

# Alternatives considered

We have not set aside time to automate the upgrade procedure any
further at this stage, as this is considered to be a too risky
development project, and the current procedure is fast enough for
now.

We could also move to the cloud, Kubernetes, serverless, and Ethereum
and pretend none of those things exist, but so far we stay in the real
world of operating systems.

Also note that this doesn't cover Docker container images
upgrades. Each team is responsible for upgrading their image tags in
GitLab CI appropriately and is *strongly* encouraged to keep a close
eye on those in general. We may eventually consider enforcing stricter
control over container images if this proves to be too chaotic to
self-manage.

# Costs

It is estimates this will take one or two person-month to complete, full
time.

# Approvals required

This proposal needs approval from TPA team members, but service admins
can request additional delay if they are worried about their service
being affected by the upgrade.

Comments or feedback can be provided in issues linked above.

# Deadline

Upgrades will start in the first week of April 2022 (2022-04-04)
unless an objection is raised.

This proposal will be considered adopted by then unless an objection
is raised within TPA.

# Status

This proposal is currently in the `proposed` state.

# References

* [TPA bullseye upgrade procedure]
* [%Debian 11 bullseye upgrade milestone]

[TPA bullseye upgrade procedure]: bullseye · Wiki · The Tor Project / TPA / TPA team · GitLab
[%Debian 11 bullseye upgrade milestone]: Debian 11 bullseye upgrade · TPA · GitLab
[bullseye]: DebianBullseye - Debian Wiki
[released on August 14 2021]: Debian -- News -- Debian 11 "bullseye" released
[buster]: howto/upgrades/buster
[one year after the stable release]: Debian -- Debian security FAQ
[OKR 2022 Q1/Q2 plan]: 2022 · Wiki · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#40690]: bullseye upgrades, first batch (#40690) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#40692]: bullseye upgrades, second batch (#40692) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#40693]: upgrade alberti to bullseye ... er bookworm! (#40693) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#40471]: upgrade mailman to mailman 3 (#40471) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#29864]: TPA-RFC-33: consider replacing nagios with prometheus (#29864) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#33588]: migrate to puppetserver and Puppet agent 7 before EOL (#33588) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#40684]: Move Sunet/Safespring VM's to new site (#40684) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#40694]: upgrade eugeni to bullseye (#40694) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#40695]: upgrade or rebuild hetzner-hel1-01 (nagios/icinga) (#40695) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#40696]: upgrade or rebuild pauli / puppet (#40696) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#40472]: draft TPA-RFC-36: establish policy on git repository mirroring, hosting and, ultimately migration from gitolite (#40472) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#17202]: Shut down SVN and decomission the host (gayi) (#17202) · Issues · The Tor Project / TPA / TPA team · GitLab
[TPA-RFC-11: SVN retirement]: policy/tpa-rfc-11-svn-retirement
[tpo/tpa/team#29974]: move critical services off, and then replace, moly (#29974) · Issues · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#40689]: upgrade the gnt-fsn cluster to bullseye (#40689) · Issues · The Tor Project / TPA / TPA team · GitLab

--
Antoine Beaupré
torproject.org system administration
_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
tor-project Info Page

_______________________________________________
tor-project mailing list
tor-project@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project

anarcat · May 5, 2022, 4:12pm

TL;DR: upgrade is a little slower, snag on PostgreSQL and you may need
to upgrade to Python3 earlier than expected.

Reminder: the bullseye upgrade run is continuing in May.

A little update on the progress of the bullseye upgrades... As you can
see here, progress has been a little slower than the first batch:

During the first week of the first batch, we had most of the servers
done ("only three left!"). It was pretty impressive. But in this first
week, we were a little slower: we only did a third.

That was partly due to people's availability: I was away on Monday, and
kez was also less available. Plus, we had kernel reboots to handle,
which took a day off lavamind's.

But it was also due to the complexity of this second batch: most of the
servers are managed by service admins and have more moving parts and
legacy.

In particular, it seems we might be having unexpected trouble with the
PostgreSQL 13 upgrade, which is a little disappointing. Materculae is
showing signs of increase memory usage, including an OOM last
night. That issue is tracked here and we welcome any input from
PostgreSQL nerds:

We also hit a few problems with the Python 2 deprecation. Normally, we
haven't announced removing support for Python 2 just yet, and Debian
bullseye did ship with Python 2.7, even though it's been dead since
April 2020. But bullseye *does* ship with a bunch of Python 2 *modules*
removed. So far, we have found that we needed to deal with those
removals:

* python-dateutil
* python-dnspython
* python-psycopg2
* python-stem

In general, buster shipped with 3470 Python 2 packages, and bullseye
brought that list down to *only* 766! So there are a *lot* more packages
like this that may cause problem on our servers. The details of which
packages those are is available here:

https://people.debian.org/~anarcat/python2-in-debian/

We don't actually *know* of any such packages left: all the ones that we
had specified in Puppet are noted above and have been replaced with
their Python 3 equivalent, and the service admins have fixed their
service.

But we don't actually manage everything through Puppet, so it is
perfectly possible that you are relying on a dependency that will be
removed in the pending Python upgrade.

So if you rely on any Python script which relies on Debian packages, now
is a good time to make sure it works with the following header:

#!/usr/bin/python3

... and you can do this right now, even before we upgrade your service
to Python 3.

I plan on making a formal RFC to clarify this situation as well, today,
so that this gets to a broader audience.

We plan to resume the bullseye upgrades next week, as we don't do major
changes like this before the weekend.

Thank you for your attention!

···

On 2022-04-27 11:25:16, Antoine Beaupré wrote:

--
Antoine Beaupré
torproject.org system administration

anarcat · May 5, 2022, 7:21pm

this is now done, in [tor-project] TPA-RFC-27: Python 2 end of life

anarcat · June 23, 2022, 4:02pm

Hi everyone,

So the second batch of Debian upgrades, as expected, took longer than
expected but I'm happy to announce that, as of today, we have completed
the upgrade of the second batch of servers to Debian 11 "bullseye".

That was almost 30 machines to upgrade, some which required service
admins or TPA to port things to Python 3!

We have also completed the upgrade of the main Ganeti cluster as well:

... which means we only have three batches of servers left to do:

* Sunet cluster, 4 machines to rebuild (see #40684):
   * cdn-backend-sunet-01
   * colchicifolium
   * corsicum
   * nutans
* retirements, 5 machines to retire:
   * cupani/vineale (gitolite/gitweb, RFC to come, see #40472 for now)
   * moly/peninsulare (old virtual machine hosting, a bunch of VMs to
     retire, migrate, or rebuild as well, #29974)
   * subnotabile (survery.tpo, see TPA-RFC-26 and #40810)
* "hard servers" batch, 4 machines to upgrade or rebuild:
   * alberti (LDAP, to upgrade, #40693)
   * eugeni (email, unsure, maybe rebuild, depends on TPA-RFC-31, #40694)
   * hetzner-hel1-01 (Nagios, undecided, rebuild or retire, #40695)
   * pauli (Puppet, unsure, upgrade or rebuild, #40696)

That is 13 machines left to deal with, out of 97. It's great progress,
but the numbers are a bit deceptive: many of those upgrades are "hard"
in that they require either migrating machines, retiring services, or
rebuilding services from scratch. Some upgrades, particularly Eugeni
(Mailman 3! complex server) and Pauli (major Puppet upgrade, no Debian
support, complex server) are particularly tricky.

It's unlikely this will be all completed by the buster EOL date, which
is supposed to be in late July. But we can dream! I still hope to
realize that goal, but as a fallback, I hope to be done before the
bookworm freeze, planned in early 2023, at which point we plan to start
all of this over again, fresh with the knowledge we gained, and do a
*lot* of upgrades again!

(There's also some cleanup work that needs to happen in various places,
but that can be done after the EOL.)

So stay tuned for the rest of this exciting adventure. Details in the
TPA milestone here:

And let me know if this is too noisy for tor-project.

A.

···

--
Antoine Beaupré
torproject.org system administration