A draft research paper about Snowflake – comments welcome

Cecylia Bocovich (@cecylia), Arlo Breault, Serene, Xiaokang Wang (@Shelikhoo), and I (@dcf) are writing a research paper about Snowflake. We have an almost completed draft that we are asking for input on. This is not a published paper yet—we’re asking for comments in advance of submitting it for peer review.

Comments are welcome. If you are a Snowflake user or proxy operator, and something we wrote does not match your experience, please let us know.

1 Like

Some problems and strange places:

The client’s rendezvous message is a bundle of data that the broker users to match the client with a proxy, and that the proxy will need in order to make a connection with the client.

Probably, uses instead of users should be there.

The cache fetches origin web pages on demand, which means that it is effectively as a restricted sort of HTTP proxy.

Something is missing here. Maybe it is effectively as should be it effectively acts as?

Initial growth in the number of proxies depended on our developing new and easier ways to run one,

Probably, of is missing: developing of new.

The increase in the number users from May to August 2022

Same as above: number of users.


Next is not a problem, but question:

we cannot, say, centrally assign 20% of clients to one and 80% to another according to their relative capacity.

If capacity is known and fixed, then why can’t client choose bridge probabilistically?

1 Like

Lookin nice! I’ll post my thoughts here later today.

Here’s my fairly low-level (i.e. not questioning the existence/arrangements of chapters/topics) feedback. Most of it just about wording and not the factual statements.
Append “I think” to almost each of the below points:

Factual statements

the nationwide protests in Iran that started on 2022-09-21

  1. To be precise, the Mahsa Amini protests actually started a few days earlier, with internet censorship to follow.

Higher-level suggestions

  1. Maybe it makes sense to say that the paper describes the current state of the Snowflake software, and some things might change in the future without affecting the general concept of Snowflake. For example the “Rendezvous” section does not account for the potential addition of ICE trickling (ok, there could be a better example).
  2. I don’t know if this is practiced, but can’t links to the relevant issues on GitLab be added, especially in the “blocking attempts” section?

Minor semantic additions/changes

  1. The abstract, or the beginning of Introduction needs more emphasis on NAT traversal, at least as much emphasis as “it can run in a browser” gets. I believe NAT traversal is the most important attribute of Snowflake as it is what gives Snowflake so many working proxies. I think installing a browser extension is not that much different from installing an app for end users.

  2. An explanation of why it’s important that it’s possible to run a proxy in a browser is lacking in the second paragraph of Introduction

  3. Something feels off about the term “lightweight”/“light” in the Abstract. It feels abstract ba dum tss.

  4. First, there is rendezvous, in which a client indicates its need for circumvention service and is matched with a temporary proxy

    “Temporary proxy” here sounds like the proxy is determined to get replaced. The ephemeral nature of proxies is explained at a later point, so maybe makes sense to remove the word “temporary” here?

  5. In general the introduction of the “How it works” section makes it sound like proxies change very often, whereas I think proxies usually outlast clients instead. Idk if something can be done about this.

  6. The essential element is a Session Description Protocol (SDP) offer [28], which contains the information necessary for a WebRTC connection

    It makes sense to emphasize here that this is not something specific to Snowflake but is just a regular part of WebRTC.

  7. The “Domain fronting” section needs to say that the front domain still needs to be uncensored. It is said for the “AMP cache” section

  8. Anything that can be persuaded to convey a rendezvous mes-
    sage of about 1500 bytes indirectly to the broker, and return a
    response of about the same size, might work as a rendezvous
    module

    I think a good/better example of such an “anything” would be a chat bot (see this comment)

  9. Snowflake is inherently tied to WebRTC

    It needs to be said why it is (I think it’s because of browsers), because otherwise I’d say that Snowflake is tied to ICE, not WebRTC (see e.g. “WebRTC, but obfs4 instead of DTLS”).

Grammar

  1. a person must to take a positive action

    Just “must take” I think?

  2. Despite our being able

    “Despite the fact that we were able”, or something shorter?

BTW

Nice to see the “non-Tor applications of Snowflake” getting a touch (in section “Future work”), which I’m a big fan of!

1 Like

Thanks @Vort, made the suggested changes here: Fixes suggested by Vort. · turfed/snowflake-paper@9783158 · GitHub.

Initial growth in the number of proxies depended on our developing new and easier ways to run one,

Probably, of is missing: developing of new.

There was actually nothing ungrammatical about the sentence as it stood; but I changed it anyway to maybe be less awkward.

we cannot, say, centrally assign 20% of clients to one and 80% to another according to their relative capacity.

If capacity is known and fixed, then why can’t client choose bridge probabilistically?

Conceptually it might be possible to do something like that, but practically there are difficulties because of the coupling with Tor. Basically, it would require a new feature in Tor itself (not in the pluggable transport) to support weighted bridge selection. It’s not something we can do in snowflake-client. It is Tor that makes the decision of what bridge to use, and the pluggable transport can only connect to the requested bridge, or else the connection will fail because of an incorrect bridge fingerprint.

The only way it would be possible for snowflake-client to make the decision, rather than Tor, is if either (1) all the bridge sites share the same identity keys, or (2) we don’t put a fingerprint in the client bridge line. We don’t want to do (1) because it increases the likelihood and the impact of losing bridge identity keys, and we don’t want to do (2) because it enables certain circuit tagging attacks (don’t remember the details). The current situation is far from ideal, though, and I wish we had something better.

1 Like

Thanks for taking the time to read it and write comments.

Higher-level suggestions

the nationwide protests in Iran that started on 2022-09-21

To be precise, the Mahsa Amini protests actually started a few days earlier, with internet censorship to follow.

That’s fair. It makes sense to distinguish the start of protests from the start of increased Snowflake use. I’ve made the change.

Maybe it makes sense to say that the paper describes the current state of the Snowflake software, and some things might change in the future without affecting the general concept of Snowflake.

Maybe, though I think that point is understood in a systems paper like this. Maybe more constructively, it would help to delineate or at least think about what are the essential elements of Snowflake. For me, that’s WebRTC and the ability to run a proxy in a browser. Any circumvention system could be transformed into any other by a series of incremental changes; but such an extreme level of abstraction isn’t helpful for modeling. It’s more helpful, I think, to draw a line and explain the advantages and disadvantages of a bundle of design decisions.

Long ago, what would later become Snowflake was envisioned as an extension to flash proxy, swapping WebRTC for WebSocket. In a sense, that’s not incorrect. But it makes more sense to think of flash proxy as one system, and Snowflake as a distinct but related system. Snowflake occupies a certain place in the ontological hierarchy; it’s not trying to be a vessel for all possible circumvention ideas.

I don’t know if this is practiced, but can’t links to the relevant issues on GitLab be added, especially in the “blocking attempts” section?

I think this is probably a good idea. We (the authors) have been talking about it a bit. In the source code, there are abundant hyperlinks to references which we have relied on in making our claims. A lot of these can/should be surfaced. The best way to do it in the PDF/paper version is unfortunately probably footnotes with bare URLs (clickable in the PDF version, at least). We’re planning to also prepare an HTML web page version of the paper, where we can make such references more usefully and unobtrusively in the form of sidenotes or ordinary hyperlinks. Compare the footnotes in the PDF version and the sidenotes in the HTML version of my recent FEP paper, for example.

Minor semantic additions/changes

I appreciate these suggestions, though I myself disagree or at least quibble with all of them, I think. Maybe the other authors will have different opinions.

The point about running in a browser is to make it possible to run a proxy with little to no friction. There may be other ways to do that—you have a good point about apps (and Orbot is the #2 source of proxies, as Figure 5 shows)—but the important dinstinction is not browser vs. app, it’s browser vs. the status quo of Tor bridges, Shadowsocks servers, SOCKS proxies, etc.: running some server software on a long-term VPS.

I think it’s important to lead with the idea that the lifetime of a client does not have to a subset of the lifetime of a proxy. That’s a central idea, and it’s hand in hand with proxies being low friction to run. There’s no stable population of a small number of long-term proxies, there’s a large and constantly changing population of unreliable proxies. This is the main argument for why the proxies are difficult to enumerate and block by their addresses. The fact that proxies can change thoughout a client’s session is one of the features that distinguishes Snowflake from uProxy and MassBrowser. It may actually be beneficial to include some measurements of how often proxies actually change in practice—I don’t think we’ve done an experiment to measure that—but more important than that is for the reader to understand that they can change.

Grammar

a person must to take a positive action

Thanks, fixed.

Despite our being able

This one is grammatical as it stands. “Our being able to confirm…” is a noun phrase.

@WofWca I came around to your way of thinking on a couple of points while editing. In the part about domain fronting rendezvous, I added that the front domain should be chosen to have value to the censor. And I added the chat bot idea as an example of another possible rendezvous method, alongside encrypted DNS.

@WofWca, @Vort, are you okay with being acknowledged in the paper? Do you prefer to use your username or something else?

1 Like

@WofWca, @Vort, are you okay with being acknowledged in the paper? Do you prefer to use your username or something else?

Of course, you can mention me.
Using my nickname (Vort) is preferred choice. With or without @ - does not matter.

Yes! “WofWca” is fine.
I appreciate it!

The Snowflake paper has been conditionally accepted to USENIX Security 2024, which means that it will be part of the conference as long as we satisfy the reviewers with some revisions. We are working on final revisions now. Here is a current snapshot. If you have any more comments, we can try to take them into account up until about 2024-02-26.

For people who aren’t familiar with what I referred to in my post above, here are the graphics from Snowflake which I’ve seen so far, and a common one from EFF.org.

I’m glad to see the Snowflake diagram evolve, but the most recent iteration dropped the numbers detailing sequence; in a technical paper I definitely suggest they go back in. Seeing the order of operations in the diagram is a good thing.

But also, I really want to know how the EFF diagram interfaces to the others as I asked above.

0_eff

1 Like

Sorry for not seeing this until now.

Unfortunately I don’t have time to read the paper, however years ago, I read the Snowflake page and the technical overview.

Currently my biggest questions revolve around how it integrates with all the other diagrams we’ve seen of the traditional 3-hop route. (Putting aside onion services to keep it simple.)

In this paper, there’s a new (to me) diagram but it’s still not conveying to me how it fits in to tor’s 3-hop strategy.

I’m sure you have much deeper knowledge of tor than I do, so please excuse me if I goof up my words.

This is the only thing I see as it relates: snowflake proxy, bridge, destination.

The snowflake proxy is, let’s say me, running (in my case) a stand alone snowflake proxy. This made me think I was akin to a guard, or a bridge; i.e. the first computer on their path to a middle node, etc.

But confusingly the next hop (in all the snowflake diagrams I’ve seen) is called a “bridge,” which has a specific meaning in tor land. My understanding is that tor-bridges are (mainly*) unpublished guard nodes, i.e. the first hop of the 3-hop route. Is this snowflake-bridge an actual tor-bridge (hop 1) or is it, as it looks in the diagrams, hop 2? And is it a published tor relay, or an actual unpublished bridge?

Then there’s the “destination” as the third item in the diagram. Is this “destination” the exit node (hop 3) or is it truly the destination the exit node contacts on behalf of the tor user?

Can the traditional 3-hop tor diagram be laid on top of the Snowflake architecture diagram for those of us who like to think visually? I think this would be a boon. There are scads of 3-hop diagrams online, lots of people are excited to explain tor. But where does SP end and tor begin?

My only other comment is perhaps nitpicking, and it didn’t strike me as much when I read the technical overview, but I admit I read a little past the diagram on this most recent paper and wished for a synonym for “rendezvous” to be used, given its specific meaning to onion site connections. Rendezvous without a word prefacing it, i.e. onion rendezvous, snowflake rendezvous, could lead to confusion in excerpts/conversation/presentations/non-technical writings, and make it more difficult to locate specifics during text searches.

I wish you luck and hope to see the video of the presentation!

* It’s been even longer since I studied Tor’s man page, but I think I recall one can enter a bridge key to be used as an exit.

1 Like

With Snowflake, as with all pluggable transports, the bridge becomes the first hop in the Tor circuit. In a normal relay connection, it goes (guard, middle, exit). In a bridge connection (with or without pluggable transports), it goes (bridge, middle, exit). The bridge takes the place of the guard.

In the diagrams, the Tor relay nodes would come right after the Snowflake bridge node. The host that runs the Snowflake bridge server software also runs a Tor bridge. It’s the same host. Then after that, there are two more hops, then whatever the user’s destination is.

The reason we don’t show the Tor hops in the diagram is it’s not a Tor paper, it’s a circumvention paper. We do use Tor as the backend implementation for the Snowflake bridge, but Tor and Snowflake are only loosely coupled. You could use Snowflake with any other system, the Snowflake bridge could even itself act as an exit node and contact the destination directly. We primarily want readers to think about how Snowflake itself works, not the technical details of how it interfaces with adjacent systems (though we do talk about Tor specifically in a few parts, when talking about practical engineering concerns).

2 Likes

About the diagram: would it be possible to license it under a Wikipedia-compatible license and upload to Wikimedia Commons.

Discussed here.

1 Like

I know that I wont be a Snowflake I mean using that extension any longer because who would want to use that extension??? It needs to be stripped down then salvage what can be toss the rest and rebuild… I know for me I had no clue on how this extension was supposed to work? I didn’t know to come here and I’m glad that I will learn the things that are a must if you want to engage in these projects.

1 Like

The diagram is painted by me and you may reuse it in the public domain.

BTW File:Snowflake-(Tor)-schematic.png - Wikimedia Commons is also by me. It’s from my thesis which is also in the public domain.

2 Likes

The paper was accepted to USENIX Security 2024. Here are HTML and PDF versions with all the final revisions, which supersede the drafts posted above.

Snowflake, a censorship circumvention system using temporary WebRTC proxies (online HTML)
PDF version
Paper source code and data

It will also eventually be up somewhere at USENIX Security '24, along with a presentation video and slides.

3 Likes

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.