Hi. I wonder if anyone can offer some advice. I have a few VPS middle/guard relays running which are experiencing intermittent issues. They each have 1 vCPU and 1Gb of memory. They each have an advertised bandwidth of around 18MB/s and decent consensus weight of around 25K+. The issue is that they keep losing flags. They will slowly gain the guard, stable and HSDIR flags and then when I next check, they are back to the core four – fast, V2 running and valid. They also occasionally tend to pick up the stale description flag. I don’t however get the ‘overloaded’ flag.
I have Uptime Robot set up to monitor them and the dropping of flags appears to coincide with period of unavailability where Uptime Robot sends an alert and I lose access via SSH. The hosting provider does offer a console access and I do seem to be able to log into the relay via that, just not over SSH. When I log in via the remote console I can also see that tor.service is running and traffic is running. Rebooting the relay or just waiting tends to fix the issue.
My gut feeling is that my relays are being subject to DDoS attacks and can’t cope due to the single core and 1Gb of memory. But I can’t really pin it down to that. My hosting provider frustratingly won’t allow me to add more cores dynamically.
Things I’ve tried:
Undertaken all the steps on the ‘My relay or bridge is overloaded what does this mean’ page.
Made the Enkidu-6 changes to IP tables
Closed all ports apart from 22 and 443 port on the VPS level firewall and enabled SSL key pairs
Enabled the metrics port
When I look at the output from tor metrics I have to admit I’m a little overwhelmed by the volume of information. There are a few ‘tor_relay_congestion_control’ line items which tend to have non-zero values next to them. There don’t appear to be any OOM issues. The tor_relay_dos_total figures for a recent query are below.
I think the high number of introduce2 rejections is indicative of a DDoS attack? – I’ve also noticed it significantly higher than that. So my questions:
Is there any way to diagnose the root cause of my issue more definitively?
Is there anything else I can do to defend myself against DDoS (if that is the issue)?
Is there anyone who would be kind enough to look at my metrics file output and verify if there’s anything else going on?
How do you monitor your servers? DDoS attacks happen, but it has been quiet for a while now for me. Unless you have proper monitoring, you won’t be able to say what is happening. It might just be Tor servers testing your bandwidth capacity. I personally use Netdata.
I’d keep it simple as much as possible. You now installed iptables whilst not knowing if it’s actually DDoS, you’ve applied the overloaded steps whilst Tor does not state it is overloaded. You’re introducing variables that having unknown effects on your relay behaviour, whilst you are running blind to the root cause of your issue.
If you’re having stability issues, I’d first increase your RAM or limit the advertised bandwidth. Again, these decisions are best made if you have proper monitoring to see what is the actual behaviour / bottleneck here. It might just be the connection in the datacenter itself, you won’t know unless you have proper monitoring. When your relay obtains a Guard flag, the traffic changes and it might be stressing the connection so much you fastly lose it again.
My relay with Advertised Bandwidth of 6 MiB/s uses ~800 MiB of RAM.
Faster relay probably needs more.
I suggest you to make charts of RAM usage somehow.
However I have little experience with VPS and Linux, so I can’t say anything more.
Thanks both for your advice. I had wrongly assumed that the Tor advice for ‘tuning sysctl for your network, memory and CPU load’ and the Enkidu-6 steps were considered good practice. I imagine it would be hard for me to unpick what I’ve done there.
I tried to install Netdata and my VPS completely sh*t the bed Perhaps as a result of it running pretty close to the wire with regards resources as it is. It was showing 100% CPU but i can’t rule out that was due to the strain of running Netdata itself, as it became almost completely unresponsive. It took me an hour to unistall it.
I think I will take a more simple approach and bring the max bandwidth right down to see if that has an effect. If anyone out there has a particular skill for reading tor metrics outputs however, I’d still appreciate if someone might be so kind enough as to take a look for me. I don’t seem to be experiencing OOM warnings which is what I imagine I’d get if it was a resource issue? And I would still like to know what the huge number of INTRODUCE_2 errors I often get relate to. Cheers
Just as an update, I took on the advice and took a closer look at the performance of my relays using Glances and I think I got to the bottom of it. When I enabled the connections plugin, I could see that I was maxing out. I also checked syslog and it was full of ‘nf_conntrack: table full, dropping packet’ errors. My net.netfilter.nf_conntrack_max values was set very low at just over 7000. This seemed to be causing the relay to become unresponsive and since I’ve increased the figure everything seems a lot happier, no more Uptime Robot alerts.
One of the relays did however lose flags again and when I checked syslog, it was flooded with [UFW BLOCK] errors which, as I understand it, may indicate a DDoS attack right? I had seen those before and I think I rather blindly assumed that all my woes were down to DDoS. I’ll try and act more like Columbo in the future and make assumption my enemy .