Seeking Insights on Enhancing Privacy in Tor Bridge Distribution Mechanism

Hello everyone,

I’m a member of a cybersecurity research group, and we are currently diving deep into the security issues related to the Tor bridge distribution mechanism. It has come to our attention that the current bridge distribution system might be exposing the bridge choices of users to BridgeDB. This could potentially lead to privacy concerns on the user end.

To address this issue, we’ve conceptualized a “condition-based PIR scheme.” This would allow users to secretly obtain their expected bridges from BridgeDB without revealing their query conditions (i.e., bandwidth/uptime/country). Simultaneously, BridgeDB wouldn’t be aware of the specific bridges returned to the user. I’d like to point out, however, that the finer details of this scheme are still under wraps as our research is ongoing.

However, a challenge we’ve encountered is that since BridgeDB remains unaware of the user’s query conditions and the bridges they obtain, it’s difficult to implement a load-balancing strategy in our scheme. This is further aggravated as most users will likely request the same or similar bridges (high bandwidth and long uptime), further leading to overcrowding of these bridges. I’d appreciate it if anyone could share their thoughts or potential solutions to this challenge.

Looking forward to hearing your valuable insights.

Efficiency often comes at the expense of privacy. For instance, if query conditions are obscured using your method, the BridgeDB system would be unable to maintain a comprehensive view of how Tor bridges are being used, let alone perform load balancing.

The central challenge, then, is to reconstruct this comprehensive view while still keeping the relevant data concealed. Since you’re already employing Private Information Retrieval (PIR) techniques to obscure this data, one approach could be to reverse the roles of the two parties involved. Specifically, the bridge could act as the client and periodically query each user to determine which bridges they are connecting to.

Alternatively, this issue could be framed as identifying “heavy hitters,” as described in the paper “Lightweight Techniques for Private Heavy Hitters” by Dan Boneh, Elette Boyle, Henry Corrigan-Gibbs, Niv Gilboa, and Yuval Ishai, presented at the IEEE Symposium on Security and Privacy in 2021.

It’s worth noting that this is a preliminary proposal and requires further elaboration. Additionally, implementing such a system could significantly impact server performance. Therefore, practical considerations like query frequency and data size must be carefully evaluated before deployment.

Maybe we could split BridgeDB into several sub-DBs and distribute the query requests over these sub-DBs. However, this may result in a smaller anonymity set, which may have negative effects on protecting users’ privacy.

Another perspective to consider is to adopt a dynamic load-balancing technique that isn’t solely dependent on specific query conditions but instead relies on real-time metrics. By using metrics like connection requests, traffic density, and bridge occupancy, we could build a decentralized mechanism where each bridge, upon reaching a threshold, can signal to the BridgeDB (or the multiple sub-DBs, as suggested in Reply 2) to adjust its availability status. This way, new users querying the BridgeDB would be less likely to receive overloaded bridges in their results.

Moreover, another angle could be to introduce a stochastic element into the bridge return mechanism. While a user might express preference for certain bridges based on criteria like bandwidth and uptime, the system could still introduce some randomness, returning a mix of high-demand and less-demanded bridges. This can achieve two things: it reduces the predictability of the bridge return mechanism, thus enhancing privacy, and it helps distribute the load more evenly across the network.

It’s important to recognize the inherent trade-offs between ensuring user privacy and maintaining an efficient load-balancing mechanism. Hence, ongoing iterations, simulations, and user feedback will be crucial to refining such an approach.

Thank you for your insightful feedback. Your point about the balance between efficiency and privacy is spot on, and it’s one of the dilemmas we’re grappling with.

Reversing the roles of the two parties, as you’ve mentioned, is an interesting approach. Having the bridge act as the client to query each user may mitigate some of our load-balancing concerns. However, I wonder if that would introduce new vulnerabilities, given that the bridge would now initiate contact with users, potentially revealing their use of the service.

Your reference to the “heavy hitters” approach is highly relevant. I can see how the methodologies described therein could be adapted to our context. We’ll delve deeper into this and explore how it might be integrated into our existing scheme.

I appreciate your caution on the practical considerations, especially regarding server performance. We will keep those parameters in mind as we advance in our research.

Thanks again for your insights.

1 Like

Thanks for the idea. We will think about it.

Hi!

Great, which group and what’s your research about? Do your research group has a website that we can learn about your other projects?

1 Like

Sorry gus, I can’t tell it now since the paper is under review and the double-blind policy for review is enforced. But I will post the updates here as soon as the review cycle is finished.

I appreciate this insight in enhancing Tor.

Apparently, load balancing is something we need to consider when trying to put this idea into practice. Based on your description, I recommend you continue searching for related works on the eprint using “oblivious load balancing” as a keyword.

I hope this tiny suggestion will be helpful to you.

Still, thank you for your interest in our project.

Letting the bridge signal the BridgeDB is maybe a good idea. We will investigate it in our work.

Introducing randomness is hard to implement since the BridgeDB is unaware of the choices of bridges from the users. Still, we will investigate it.

Thank you for your advice :+1:

Thank you for researching on how to improve BridgeDB. We are already working on some of the ideas you mention implementing LOX: https://www.petsymposium.org/2023/files/papers/issue1/popets-2023-0029.pdf

BTW, notice that BridgeDB as a software is in the process of being deprecated on favor of rdsys: The Tor Project / Anti-censorship / rdsys · GitLab

Thanks for mentioning your ideas on this topic!

We’d like to study the feasibility of integrating our work with LOX and rdsys. Hopefully we can contribute something to Tor’s improvement as well.

Double-blind review does not mean you are not allowed to talk about your project at all before it is accepted and published. Not talking about your work with knowledgeable experts ahead of time can lead to poor work and mistaken conclusions. I hope you will reconsider—asking questions on a forum like this (and even sharing preprints) is absolutely in bounds and does not violate any peer review requirements.

I’m confused by what you said about “query conditions (i.e., bandwidth/uptime/country).” As far as I know, BridgeDB does not have such query conditions. https://bridges.torproject.org/options/ offers the options of transport and IPv6, that’s all. Is it some other interface you are dealing with?

Dear dcf,

Thanks for the comments. The reason why I mentioned the double-blind review policy is that currently I can’t answer gus’s question about introducing my research group. Of course, I can provide more details about our design.

Currently, Tor users retrieve bridges from BridgeDB via web/email/etc. This could potentially lead to privacy concerns on the user end since the choices of bridges are revealed to BridgeDB. Our motivation is to design a PIR-like system for Tor users to secretly retrieve the bridges from BridgeDB without leaking their choices.

Specifically, PIR systems allow users to privately retrieve records from the DB with indices. However, the contents in BridgeDB are concealed due to censorship-evading reasons, meaning that Tor users will not be able to know the indices of the bridges they want.

To solve this problem, we design a PIR-like query system with cryptography tools (like Distributed Point Functions). Instead of retrieving the bridges with indices, we provide users with an interface that allows them to query for bridges with some customized conditions: bandwidth/uptime/country of the bridge/IPv6/transports/etc. These are the metadata of bridges stored in the BridgeDB and can be found at Onionoo. We wish to provide Tor users with a more fine-grained query protocol with these query parameters – beyond IPv6 and pluggable transports.

According to our design, the query parameters are hidden from BridgeDB as well as the bridges returned to users. Although the privacy of users is guaranteed in terms of the choices of bridges, this design may introduce new problems (like load balancing) as I mentioned in the original post. Therefore, I would like to seek some insights from the Tor community.

The paper is currently under review and subject to subsequent revisions. We will share more details here as soon as the review cycle is finished.

Best regards

1 Like

Thanks, that explanation has helped me understand. While there is currently no option for querying based on country/bandwidth/uptime, you are positing that such an interface might exist in the future, and when it does, you want to make it hard for a malicious or compromised BridgeDB server to associate user identifiers (email addresses, IP addresses) and assigned bridges or query preferences. The same query privacy could protect the current limited options of transport and IPv4/IPv6.

I find your phrasing “the bridge choice of users” and “the choices of bridges are revealed to BridgeDB” strange, because it is not the user that chooses the bridge. BridgeDB/rdsys chooses the bridge (according to its own logic, which includes compartmentalizing bridge pools according to access method) and assigns it to the user. Maybe that’s what you mean. It’s true that currently BridgeDB/rdsys could record what bridges were assigned to what query identities (email addresses, source IP addresses), which are additionally used to try to rate-limit queries.

One question that comes to mind is whether a more fine-grained query protocol is a useful thing to provide. Why would I, as a bridge user, not always ask for the maximum bandwidth and maximum uptime? Why would I care about the country the bridge is in, as long as it works, except for minimizing geographical distance, which is another way of saying I want to maximize performance? Is a bridge query protocol a problem that needs solving, or is it just a problem that admits of a novel cryptographic solution?

Another question, more important that load balancing IMO, is how does the proposed query protocol interact with anti-enumeration defenses? What stops an attacker from querying for country:HR bandwidth:0-100k, then country:HR bandwidth:100-200k, and so on, thereby discovering all the bridges in Croatia, and then repeating the process for all the other country codes? What if an honest user’s query is too specific, and the result contains 0 bridges? If there is still some kind of anti-enumeration defense, does that failed query “burn” one of their chances to ask for bridges?

I’m not trying to be challenging or provocative. You’ve done a good thing by starting a public discussion—it’s an act of bravery and honesty. I’m asking direct questions to try to get to the substance quickly. I’m ready to believe that you have good answers.

I can reiterate the recommendation to read the Lox paper from this year’s PETS. You can read the anti-censorship team’s reading group discussion of the paper here.

Regarding peer review, it is not true that blind review prevents you from identifying yourself in a discussion of your work before or during submission. You’re not meant to have to act like a secret agent. Your responsibility is to anonymize your submission; the burden falls on the reviewers not to snoop around to try to find out who the authors are. You can reassure yourself with the norms expressed by ePrint:

https://eprint.iacr.org/operations.html

… authors are allowed to announce their results in public when they are in an anonymous refereeing process … Authors are allowed to give talks on their papers and submit them to existing preprint servers, which will usually be announced widely. … Anonymous submission just means that papers are submitted without author’s names and too obvious references.

I bring this up because it’s a minor problem in censorship circumvention research that research groups misunderstand details of each other’s work, or make unjustified assumptions about the problem space, in simple ways that could be alleviated by more open discussion, and I think part of the cause is unjustified fears regarding peer review. You’ve done a good thing by posting some of your research questions here, and I think you will find it improves the quality of your work.

Dear dcf,

I am glad to see that you’re interested in our work. Let me answer the questions for you.

This is because, in our design, we wish to act in a user-centric manner, meaning that bridges retrieved are searched by users based on their query conditions rather than determined by BridgeDB. Nevertheless, this design may bring new issues to the current bridge distribution system, as discussed above.

We agree that many people would like to request bridges with maximum bandwidth/uptime, but these bridges may not be the most suitable for them. For example, a user in Africa may suffer from poor network connection to a large bandwidth bridge located in South America. Moreover, to avoid potential censorship, some people may avoid bridges that look premium but choose less popular ones instead. We designed this query scheme aiming to meet the personalized needs of the users.

We have considered this problem when designing the query protocol. In a nutshell, our trick is to limit the amount of bridge data returned to the user. Specifically, if a malicious user tries to dump the bridges with loose query conditions, it will only receive invalid answers from BridgeDB because the answer is a mixture of many matched bridges. On the contrary, if the query conditions are too strict, the user will receive an empty result containing 0 bridges. The user should submit a query with more relaxed conditions then.

Other orthogonal techniques like rate limitation and captcha tests can also be adopted to defend against enumeration attacks. Moreover, we can integrate a reputation system (like Lox) in our system to further resist enumeration attacks from the censors.

(Thank you for providing the Lox paper and the group discussion link.)

Actually, the reviewers of our submission have read this thread as we have mentioned it in the rebuttal response. So we can’t identify ourselves here :joy:. Still, thank you for the clarification and encouragement.

We highly value the comments and suggestions in the thread, together with those from the reviewers of the paper. Therefore, necessary revisions will be made to our work before we post it here. Thank you for your understanding about this.

Best regards

1 Like