one of my main goals as a relay operator is to diversify the tor network. As this can be done in many different ways (different providers, countries, tor versions etc.) I thought I’d give hosting some high-speed relay on Windows a try as they are super rare (only ~0.1% of all current relays run Windows). Here is my experience going down that path.
Hardware:
Intel Core i7-2600k (4C/8T)
16GB RAM
240 GB SATA SSD
onboard Gbit Realtek NIC
Gigabit symmetrical internet connection (hosted at a DC in the US)
Software:
Windows Server 2016 Standard running “bare metal”
Tor 0.4.6.9 (git-ea2ada6d1459f829)
Installation: Super easy! (If you don’t forget about Windows Firewall)
The setup process for Windows is super fast and painless. You basically create a non-admin user account, download the latest tor binary, customize your torrc configuration file, configure your Windows Firewall and you are good to go. This guide on the Tor project’s site can be helpful for beginners, but it sadly omits to even mention Windows’ integrated Firewall. Windows Firewall is enabled by default in all more recent versions of Windows (for good reason!) and you will need to create new rules for your tor relay(s) or bridge(s) to be reachable from the outside. Maybe @gus can pick it up to improve the guide in the future?
Stability: No issues
I have been running the relays for a couple of days and can report no issues. I copied over some known fingerprints to start with, so I had a fair bit of traffic right from the get-go.
Performance: Pretty poor
There is no way around it, the actual performance on Windows is pretty poor. Running a total of 4 relays on the server (across multiple IPs), I seem to be stuck with a total throughput between 250 Mbit/s and 300 Mbit/s (combining both upload and download). All the of the 4 instances of tor pretty much max out a CPU thread each all the time and I don’t see the server scaling much even if I run 2 additional instances to make better use of all 8 threads this CPU has to offer.
For comparison, I ran the same system previously with Ubuntu and could easily achieve double the throughput with some CPU capacity to spare.
For running high-speed nodes, Windows appears to be not a great choice then. It might do very well for bridges, though I have not tried this (yet).
My next steps: Keep it running
Even though the performance is overall disappointing, I will keep the server running on Windows for at least a couple of months (if it continues to be stable). I encourage others to also give it a go, especially if you don’t plan to max out a gigabit connection with your relays. Let’s get that 0.1% Windows share up!
Neat! Three more things for you to learn about / think about, for Windows relays:
(A) It is probably wise that you picked a version of Windows with the word ‘Server’ in the name, because of the ancient but maybe still applicable bug 98: WSAENOBUFS: Running out of buffer space on Windows (#98) · Issues · The Tor Project / Core / Tor · GitLab where Windows reserves space in the buffer (“non-page”) pool for every outstanding network operation, and if the buffer pool goes empty, things throughout Windows start failing at random. Windows versions with the word ‘server’ in their name start with a larger non-page pool so don’t run out of space as quickly. It’s possible this bug is entirely solved by Windows now (it’s been some years), or it’s possible it still happens and people have just learned not to try to run big Tor relays on small Windowses. Keep an eye on your relay’s logs in case it gives you hints about problems.
(B) On Linux/BSD/etc, the libevent library picks a smart scalable version of select(), such as kqueue or epoll or the like. That is, Tor is good at handling many connections at once, because the underlying operating system is good at it too. On Windows, I wonder what operation libevent picks. Is it select()? I worry it is one of the ones that scales poorly, meaning Tor would spend a lot of its cpu time inefficiently deciding which conn to handle next.
(C) Check out the “Kernel-informed socket transport” (KIST) design: Free Haven's Selected Papers in Anonymity – this is a feature we put into Tor 0.3.2 which makes use of a feature in the Linux kernel that lets you learn how full a given outbound TCP connection actually is, at the network layer, which lets you make smarter scheduling choices about which cells to try sending next. In some network simulations, KIST makes a big performance difference for fast Tor relays, and the feature doesn’t exist on Windows.
Those three might not be the reason, or might not be all of the reason, but they are definitely things to learn about. Thanks for running relays!
Thank you for writing about your experiences with running a Tor relay on Windows. It is something we have minimal experience with internally as well, unfortunately.
Regarding the performance, I believe we have never done proper profiling of the relay paths on Windows to discover where we spend most of our time, but there are certain things that I think we can assume here:
Libevent falls back to using select() in Tor on Windows. This fallback is usually not terrible, but select() isn’t the preferred main loop primitive on the platform. Windows’ IOCP would be ideal as they are much more performant in general. But moving to IOCP would also require a substantial change to Tor either by replacing the event loop library with one that provides APIs for TCP/UDP directly rather than the more low-level approach of just handling “events” that libevent does. We already do some naughty mixture of the two primitives in Tor for how we handle Pluggable Transport as select() on Windows doesn’t let us add Unix domain sockets or fifos. Unfortunately, none of that is related to performance.
I doubt we will be spending a lot of time internally on investigating performance here since the plan in Tor right now is to let our Rust implementation of Tor, Arti, take over for the different features that the current C codebase provides. Arti will eventually include relay support, and Arti is already using an event model that will allow us to exploit the benefits of platform interfaces such as IOCP (and also io_uring in the Linux kernel).
If you want to dive further into this, we would be interested in seeing some profiling happen on your relay if you have experience with doing that. It usually requires a build of the binary where the symbols are available so that the profiling tool can read them. I know MinGW on Windows uses the usual Unix style symbol handling, which is unsupported in many Windows tools that work with binaries for analysis. If you can build Tor on Windows with Clang, it should be possible to do profiling in production, but I have yet to try that myself.
Could this be interesting to you? The next step would be to generate flame graphs that would allow us to see if there are any prominent low-hanging fruits to optimize.
I would love to help, but I really don’t have any recent or relevant experience. Last time I compiled something on Windows and did some basic software performance analysis was when I was still in university… well over a decade ago.
What I can offer however is to spin up a similar server running Windows Server 2016, transfer some of my fingerprints over and hand over the access information to the tor project to do this investigating / testing. Let me know if you are interested, as it will take me a day or two to set this up.
Thank you! I won’t have time to dive into this in the near future, but I do know our Network Health team have an interest in having access to stable, performant, Windows relays. I will let @GeKo join in here to see if this is useful for his team. This would allow us to inspect if there is obvious issues either affecting the network or the individual relay in certain situations.
Yes, relay support is probably first realistic in a timeframe like that. We do hope it will be closer to 2023 than later though, but software engineering is a surprising craft at times.
Thank you! I won’t have time to dive into this in the near future, but I do know our Network Health team have an interest in having access to stable, performant, Windows relays. I will let @GeKo join in here to see if this is useful for his team. This would allow us to inspect if there is obvious issues either affecting the network or the individual relay in certain situations.
I think we won’t have the time to investigate the performance issues of
Windows relays, alas. So, I think for that purpose there is no need to
spin up Windows relays for us. ahf is right, though, that I still have
on my radar running a Windows relay the other day for network health
purposes, figuring out potential Windows-only issues in Tor etc. I need
to think more what I want to get out of such a relay in particular and
then reserve some time to actually work with it. I am not there yet…
So, what about I keep your generous offer in mind and get back to you
once I am actually set to make use of this donation of yours and then we
take it from there?
I wanted to share my observations and follow-up with two questions.
I have rotated a relay between BuyVM’s 8GB 2 core “slice” running Windows Server 2016, Windows 2016 with 16GB i5 at the house and presently under Windows 10 with an 8 core i7-9700K.
What I’m seeing in all cases is the connection load to other systems is around 3000.
How normal a load is this after three weeks?
The BuyVM solution is up and marked as stable and has picked up guard a few times. It showed overloaded last night with CPU for the process at roughly 30%.
Currently, there are 2800 connections with Tor taking around 20% CPU. That’s with two cores.
The Windows 10 machine has a dedicated NIC and is also a Plex server so it gets a workout.
Tor is at 2.7% CPU and 257MB of RAM. At the moment it is pushing ~600 B/sec each way but I’ve seen it well into the 700s.
The gateway/router from the ISP is AT&T’s standard Nokia BGW320-505 running at around 2450 Mbps with 4156 out of 8192 possible in the gateway’s NAT table.
I don’t want to climb to the top with connections so I can have some headway for other devices.
I’m just wondering how normal my situation looks compared to other platforms and the OP after three weeks.
My second question is about the firewall documentation that is open and refers to @gus here in the forum.
Is that assigned or is it up for grabs?
I can knock it out if needed after I figure out how that system works.
I just ran a quick “netstat -a -n” and it lists around 24,000 “ESTABLISHED” connections for my server running four tor relays in parallel.
As per the general experience, I had one crash of all tor instances running on that server around a week ago for no apparent reason, but I could just restart them and keep going. If it happens again in around two weeks, I will start investigating.
I have never really seen the number of established connections as a challenge of any kind, but then again I am running all of my Tor nodes in DC environments.
My biggest limitation running Tor on Windows is the CPU load. While Windows manages to push 250-300 Mbit/s at ~70-80% CPU usage I have another server with the same specs in the same DC currently pushing around 400 Mbit/s combined with less than 15% total CPU usage.
I made similar tests 4 years ago and nothing changed since then.
If I remember correctly, I did not figured out how to properly profile kernel mode calls.
But it was not needed, since, almost for sure, problem was in select() call.
So yes, looks like moving to IOCP is the only solution.