Summary: start upgrading servers during the Debian 13 ("trixie")
freeze, if it goes well, complete most of the fleet upgrade in around
June 2025, with full completion by the end of 2025, with a 2026 year
free of major upgrades entirely. Improve automation, retire old
container images.
Deadline: 2 weeks, 2025-04-01
# Background
Debian 13 ("trixie"), currently "testing", is going into freeze soon, which
means we should have a new Debian stable release in 2025. It has been
a long-standing tradition at TPA to collaborate in the Debian
development process and part of that process is to upgrade our servers
during the freeze. Upgrading during the freeze makes it easier for us
to fix bugs as we find them and contribute them to the community.
The [freeze dates announced by the debian.org release team] are:
2025-03-15 - Milestone 1 - Transition and toolchain freeze
2025-04-15 - Milestone 2 - Soft Freeze
2025-05-15 - Milestone 3 - Hard Freeze - for key packages and
packages without autopkgtests
To be announced - Milestone 4 - Full Freeze
We have entered the "transition and toolchain freeze" which locks
changes on packages like compilers and interpreters unless
exceptions. See the [Debian freeze policy] for an explanation of
each step.
Even though we've just completed the Debian 11 ("bullseye") and 12
("bookworm") upgrades in late 2024, we feel it's a good idea to start
*and* complete the Debian 13 upgrades in 2025. That way, we can hope of
having a year or two (2026-2027?) *without* any major upgrades.
This proposal is part of the [Debian 13 trixie upgrade milestone],
itself part of the [2025 TPA roadmap].
[freeze dates announced by the debian.org release team]: Bits from the Release Team: trixie freeze dates
[Debian freeze policy]: trixie Freeze Timeline and Policy
[Debian 13 trixie upgrade milestone]: Debian 13 trixie upgrade · TPA · GitLab
[2025 TPA roadmap]: 2025 · Wiki · The Tor Project / TPA / TPA team · GitLab
# Proposal
As usual, we perform the upgrades in three batches, in increasing
order of complexity, starting in 2025Q2, hoping to finish by the end
of 2025.
Note that, this year, this proposal also includes upgrading the Tails
infrastructure as well. To help with merging rotations in the two
teams, TPA staff will upgrade Tails machines, with Tails folks
assistance, and vice-versa.
## Affected users
All service admins are affected by this change. If you have shell
access on any TPA server, you want to read this announcement.
In the past, TPA has typically kept a page detailing notable changes
and a proposal like this one would link against the upstream release
notes. Unfortunately, at the time writing, upstream hasn't yet
produced release notes (as we're still in testing).
We're hoping the documentation will be refined by the time we're ready
to coordinate the second batch of updates, around May 2025, when we
will send reminders to affected teams.
We do expect the Debian 13 upgrade to be less disruptive than bookworm,
mainly because Python 2 is already retired.
## Notable changes
For now, here are some known changes that are already in Debian 13:
Package | 12 (bookworm) | 13 (trixie) |
--------------------|---------------|-------------|
Ansible | 7.7 | 11.2 |
Apache | 2.4.62 | 2.4.63 |
Bash | 5.2.15 | 5.2.37 |
Emacs | 28.2 | 30.1 |
Fish | 3.6 | 4.0 |
Git | 2.39 | 2.45 |
GCC | 12.2 | 14.2 |
Golang | 1.19 | 1.24 |
Linux kernel image | 6.1 series | 6.12 series |
LLVM | 14 | 19 |
MariaDB | 10.11 | 11.4 |
Nginx | 1.22 | 1.26 |
OpenJDK | 17 | 21 |
OpenLDAP | 2.5.13 | 2.6.9 |
OpenSSL | 3.0 | 3.4 |
PHP | 8.2 | 8.4 |
Podman | 4.3 | 5.4 |
PostgreSQL | 15 | 17 |
Prometheus | 2.42 | 2.53 |
Puppet | 7 | 8 |
Python | 3.11 | 3.13 |
Rustc | 1.63 | 1.85 |
Vim | 9.0 | 9.1 |
Most of those, except "tool chains" (e.g. LLVM/GCC) can still change,
as we're not in the full freeze yet.
## Upgrade schedule
The upgrade is split in multiple batches:
- automation and installer changes
- low complexity: mostly TPA services and less critical Tails servers
- moderate complexity: TPA "service admins" machines and remaining
Tails physical servers and VMs running services from the official
Debian repositories only
- high complexity: Tails VMs running services not from the official
Debian repositories
- cleanup
The free time between the first two batches will also allow us to
cover for unplanned contingencies: upgrades that could drag on and
other work that will inevitably need to be performed.
The objective is to do the batches in collective "upgrade parties"
that should be "fun" for the team. This policy has proven to be
effective in the previous upgrades and we are eager to repeat it
again.
### Upgrade automation and installer changes
First, we tweak the installers to deploy Debian 13 by default to avoid
installing further "old" systems. This includes the bare-metal
installers but also and especially the virtual machine installers and
container images.
Concretely, we're planning on changing the `latest` container image
tag to point to `trixie` in early April. A full *year* later, the
`bookworm` container images will be retired. Note that we are already
planning the retirement of the "old stable" (`bullseye`) container
images, see [tpo/tpa/base-images#19], for which you may have
already been contacted.
New `idle` canary servers will be setup in Debian 13 to test
integration with the rest of the infrastructure, and future new
machine installs will be done in Debian 13.
We also want to work on automating the upgrade procedure
further. We've had catastrophic errors in the PostgreSQL upgrade
procedure in the past, in particular, but the whole procedure is now
considered ripe for automation, see [tpo/tpa/team#41485] for
details.
[tpo/tpa/base-images#19]: retire bullseye images (#19) · Issues · The Tor Project / TPA / base-images · GitLab
[tpo/tpa/team#41485]: automate major upgrades (#41485) · Issues · The Tor Project / TPA / TPA team · GitLab
### Batch 1: low complexity
This is scheduled during two weeks: TPA boxes will be upgraded in
the last week of April, and Tails in the first week of May.
The idea is to start the upgrade long enough before the vacations to
give us plenty of time to recover, and some room to start the second
batch.
In April, Debian should also be in "soft freeze", not quite a fully
"stable" environment, but that should be good enough for simple
setups.
35 TPA machines:
archive-01.torproject.org
cdn-backend-sunet-02.torproject.org
chives.torproject.org
dal-rescue-01.torproject.org
dal-rescue-02.torproject.org
gayi.torproject.org
hetzner-hel1-02.torproject.org
hetzner-hel1-03.torproject.org
hetzner-nbg1-01.torproject.org
hetzner-nbg1-02.torproject.org
idle-dal-02.torproject.org
idle-fsn-01.torproject.org
lists-01.torproject.org
loghost01.torproject.org
mandos-01.torproject.org
media-01.torproject.org
minio-01.torproject.org
mta-dal-01.torproject.org
mx-dal-01.torproject.org
neriniflorum.torproject.org
ns3.torproject.org
ns5.torproject.org
palmeri.torproject.org
perdulce.torproject.org
srs-dal-01.torproject.org
ssh-dal-01.torproject.org
static-gitlab-shim.torproject.org
staticiforme.torproject.org
static-master-fsn.torproject.org
submit-01.torproject.org
vault-01.torproject.org
web-dal-07.torproject.org
web-dal-08.torproject.org
web-fsn-01.torproject.org
web-fsn-02.torproject.org
4 Tails machines:
ecours.tails.net
puppet.lizard
skink.tails.net
stone.tails.net
In the [first batch of bookworm machines], we ended up taking 20
minutes per machine, done in a single day, but warned that the second
batch took longer.
It's probably safe to estimate 20 hours (30 minutes per machine) for
this work, in a single week.
Feedback and coordination of this batch happens in [issue batch 1].
[first batch of bookworm machines]: bookworm upgrades, first batch (#41251) · Issues · The Tor Project / TPA / TPA team · GitLab
[issue batch 1]: "Sign in · GitLab;
### Batch 2: moderate complexity
This is scheduled for the last week of may for TPA machines, and the
first week of June for Tails.
At this point, Debian testing should be in "hard freeze", which should
be more stable.
40 TPA machines:
anonticket-01.torproject.org
backup-storage-01.torproject.org
bacula-director-01.torproject.org
btcpayserver-02.torproject.org
bungei.torproject.org
carinatum.torproject.org
check-01.torproject.org
ci-runner-x86-02.torproject.org
ci-runner-x86-03.torproject.org
colchicifolium.torproject.org
collector-02.torproject.org
crm-int-01.torproject.org
dangerzone-01.torproject.org
donate-01.torproject.org
donate-review-01.torproject.org
forum-01.torproject.org
gitlab-02.torproject.org
henryi.torproject.org
materculae.torproject.org
meronense.torproject.org
metricsdb-01.torproject.org
metricsdb-02.torproject.org
metrics-store-01.torproject.org
onionbalance-02.torproject.org
onionoo-backend-03.torproject.org
polyanthum.torproject.org
probetelemetry-01.torproject.org
rdsys-frontend-01.torproject.org
rdsys-test-01.torproject.org
relay-01.torproject.org
rude.torproject.org
survey-01.torproject.org
tbb-nightlies-master.torproject.org
tb-build-02.torproject.org
tb-build-03.torproject.org
tb-build-06.torproject.org
tb-pkgstage-01.torproject.org
tb-tester-01.torproject.org
telegram-bot-01.torproject.org
weather-01.torproject.org
17 Tails machines:
apt-proxy.lizard
apt.lizard
bitcoin.lizard
bittorrent.lizard
bridge.lizard
dns.lizard
dragon.tails.net
gitlab-runner.iguana
iguana.tails.net
lizard.tails.net
mail.lizard
misc.lizard
puppet-git.lizard
rsync.lizard
teels.tails.net
whisperback.lizard
www.lizard
The [second batch of bookworm upgrades] took 33 hours for 31
machines, so about one hour per box. Here we have 57 machines, so it
will likely take us 60 hours (or two weeks) to complete the upgrade.
Feedback and coordination of this batch happens in [issue batch 2].
[second batch of bookworm upgrades]: bookworm upgrades, second batch (#41252) · Issues · The Tor Project / TPA / TPA team · GitLab
[issue batch 2]: Debian trixie upgrades, second batch (#42070) · Issues · The Tor Project / TPA / TPA team · GitLab
### Batch 3: high complexity
Those machines are harder to upgrade, or more critical. In the case of
TPA machines, we typically regroup the Ganeti servers and all the
"snowflake" servers that are not properly Puppetized and full of
legacy, namely the LDAP, DNS, and Puppet servers.
That said, we waited a long time to upgrade the Ganeti cluster for
bookworm, and it turned out to be trivial, so perhaps those could
eventually be made part of the second batch.
15 TPA machines:
- [ ] alberti.torproject.org
- [ ] dal-node-01.torproject.org
- [ ] dal-node-02.torproject.org
- [ ] dal-node-03.torproject.org
- [ ] fsn-node-01.torproject.org
- [ ] fsn-node-02.torproject.org
- [ ] fsn-node-03.torproject.org
- [ ] fsn-node-04.torproject.org
- [ ] fsn-node-05.torproject.org
- [ ] fsn-node-06.torproject.org
- [ ] fsn-node-07.torproject.org
- [ ] fsn-node-08.torproject.org
- [ ] nevii.torproject.org
- [ ] pauli.torproject.org
- [ ] puppetdb-01.torproject.org
It seems like the [bookworm Ganeti upgrade] took roughly 10h of
work. We ballpark the rest of the upgrade to another 10h of work, so
possibly 20h.
11 Tails machines:
- [ ] isoworker1.dragon
- [ ] isoworker2.dragon
- [ ] isoworker3.dragon
- [ ] isoworker4.dragon
- [ ] isoworker5.dragon
- [ ] isoworker6.iguana
- [ ] isoworker7.iguana
- [ ] isoworker8.iguana
- [ ] jenkins.dragon
- [ ] survey.lizard
- [ ] translate.lizard
The challenge with Tails upgrades is the coordination with the Tails
team, in particular for the Jenkins upgrades.
Feedback and coordination of this batch happens in [issue batch 3].
[bookworm Ganeti upgrade]: upgrade gnt-fsn Ganeti cluster to bookworm (#41254) · Issues · The Tor Project / TPA / TPA team · GitLab
[issue batch 3]: Debian trixie upgrades, third batch (#42069) · Issues · The Tor Project / TPA / TPA team · GitLab
### Cleanup work
Once the upgrade is completed and the entire fleet is again running a
single OS, it's time for cleanup. This involves updating configuration
files to the new versions and removing old compatibility code in
Puppet, removing old container images, and generally wrapping things
up.
This process has been historically neglected, but we're hoping to wrap
this up, worst case in 2026.
## Timeline
- 2025-Q2
- W14 (first week of April): default container image changed to
`trixie`, installer defaults changed and first tests in
production
- W18 (last week of April): Batch 1 upgrades, TPA machines
- W19 (first week of May): Batch 1 upgrades, Tails machines
- W22 (last week of May): Batch 2 upgrades, TPA machines
- W23 (first week of June): Batch 2 upgrades, Tails machines
- 2025-Q3 to Q4: Batch 3 upgrades
- 2026-Q2: bookworm container image retired
## Deadline
The community has until the beginning of the above timeline to
manifest concerns or objections.
Two weeks before performing the upgrades of each batch, a new
announcement will be sent with details of the changes and impacted
services.
# Alternatives considered
## Retirements or rebuilds
We do not plan any major upgrade or retirements in the third phase
this time.
In the future, we hope to decouple those as much as possible, as the
Icinga retirement and Mailman 3 became blockers that slowed down the
upgrade significantly for bookworm. In both cases, however, the
upgrades *were* challenging and had to be performed one way or
another, so it's unclear if we can optimize this any further.
We are clear, however, that we will not postpone an upgrade for a
server retirement. Dangerzone, for example, is scheduled for
retirement ([TPA-RFC-78]) but is still planned as normal above.
[TPA-RFC-78]: tpa rfc 78 dangerzone retirement · Wiki · The Tor Project / TPA / TPA team · GitLab
# Costs
Task | Estimate | Certainty | Worst case |
-------------------|----------|-----------|------------|
Automation | 20h | extreme | 100h |
Installer changes | 4h | low | 4.4h |
Batch 1 | 20h | low | 22h |
Batch 2 | 60h | medium | 90h |
Batch 3 | 20h | high | 40h |
Cleanup | 20h | medium | 30h |
**Total** | 144h | ~high | ~286h |
The entire work here should consist of over 140 hours of work, or 18
days, or about 4 weeks full time. Worst case doubles that.
The above is done in "hours" because that's how we estimated batches
in the past, but here's an estimate that's based on the [Kaplan-Moss
estimation technique].
[Kaplan-Moss estimation technique]: My Software Estimation Technique - Jacob Kaplan-Moss
Task | Estimate | Certainty | Worst case |
-------------------|----------|-----------|------------|
Automation | 3d | extreme | 15d |
Installer changes | 1d | low | 1.1d |
Batch 1 | 3d | low | 3.3d |
Batch 2 | 10d | medium | 20d |
Batch 3 | 3d | high | 6d |
Cleanup | 3d | medium | 4.5d |
**Total** | 23d | ~high | ~50d |
This is *roughly* equivalent, if a little higher (23 days instead of
18), for example.
It should be noted that automation is not expected to drastically
reduce the total time spent in batches (currently 16 days or 100
hours). The main goal of automation is more to reduce the likelihood
of catastrophic errors, and make it easier to share our upgrade
procedure with the world. We're still hoping to reduce the time spent
in batches, hopefully by 10-20%, which would bring the total number of
days across batches from 16 days to 14d, or from 100 h to 80 hours.
# Approvals required
This proposal needs approval from TPA team members, but service admins
can request additional delay if they are worried about their service
being affected by the upgrade.
Comments or feedback can be provided in issues linked above, or the
general process can be commented on in issue [tpo/tpa/team#41990].
# References
* [Debian 13 trixie upgrade milestone]
* [discussion ticket][tpo/tpa/team#41990]
[TPA bookworm upgrade procedure]: bookworm · Wiki · The Tor Project / TPA / TPA team · GitLab
[tpo/tpa/team#41990]: TPA-RFC-80: (make a) Debian trixie upgrade plan (#41990) · Issues · The Tor Project / TPA / TPA team · GitLab
···
--
Antoine Beaupré
torproject.org system administration