A draft research paper about Snowflake – comments welcome

Cecylia Bocovich (@cecylia), Arlo Breault, Serene, Xiaokang Wang (@Shelikhoo), and I (@dcf) are writing a research paper about Snowflake. We have an almost completed draft that we are asking for input on. This is not a published paper yet—we’re asking for comments in advance of submitting it for peer review.

Comments are welcome. If you are a Snowflake user or proxy operator, and something we wrote does not match your experience, please let us know.

Some problems and strange places:

The client’s rendezvous message is a bundle of data that the broker users to match the client with a proxy, and that the proxy will need in order to make a connection with the client.

Probably, uses instead of users should be there.

The cache fetches origin web pages on demand, which means that it is effectively as a restricted sort of HTTP proxy.

Something is missing here. Maybe it is effectively as should be it effectively acts as?

Initial growth in the number of proxies depended on our developing new and easier ways to run one,

Probably, of is missing: developing of new.

The increase in the number users from May to August 2022

Same as above: number of users.


Next is not a problem, but question:

we cannot, say, centrally assign 20% of clients to one and 80% to another according to their relative capacity.

If capacity is known and fixed, then why can’t client choose bridge probabilistically?

1 Like

Lookin nice! I’ll post my thoughts here later today.

Here’s my fairly low-level (i.e. not questioning the existence/arrangements of chapters/topics) feedback. Most of it just about wording and not the factual statements.
Append “I think” to almost each of the below points:

Factual statements

the nationwide protests in Iran that started on 2022-09-21

  1. To be precise, the Mahsa Amini protests actually started a few days earlier, with internet censorship to follow.

Higher-level suggestions

  1. Maybe it makes sense to say that the paper describes the current state of the Snowflake software, and some things might change in the future without affecting the general concept of Snowflake. For example the “Rendezvous” section does not account for the potential addition of ICE trickling (ok, there could be a better example).
  2. I don’t know if this is practiced, but can’t links to the relevant issues on GitLab be added, especially in the “blocking attempts” section?

Minor semantic additions/changes

  1. The abstract, or the beginning of Introduction needs more emphasis on NAT traversal, at least as much emphasis as “it can run in a browser” gets. I believe NAT traversal is the most important attribute of Snowflake as it is what gives Snowflake so many working proxies. I think installing a browser extension is not that much different from installing an app for end users.

  2. An explanation of why it’s important that it’s possible to run a proxy in a browser is lacking in the second paragraph of Introduction

  3. Something feels off about the term “lightweight”/“light” in the Abstract. It feels abstract ba dum tss.

  4. First, there is rendezvous, in which a client indicates its need for circumvention service and is matched with a temporary proxy

    “Temporary proxy” here sounds like the proxy is determined to get replaced. The ephemeral nature of proxies is explained at a later point, so maybe makes sense to remove the word “temporary” here?

  5. In general the introduction of the “How it works” section makes it sound like proxies change very often, whereas I think proxies usually outlast clients instead. Idk if something can be done about this.

  6. The essential element is a Session Description Protocol (SDP) offer [28], which contains the information necessary for a WebRTC connection

    It makes sense to emphasize here that this is not something specific to Snowflake but is just a regular part of WebRTC.

  7. The “Domain fronting” section needs to say that the front domain still needs to be uncensored. It is said for the “AMP cache” section

  8. Anything that can be persuaded to convey a rendezvous mes-
    sage of about 1500 bytes indirectly to the broker, and return a
    response of about the same size, might work as a rendezvous
    module

    I think a good/better example of such an “anything” would be a chat bot (see this comment)

  9. Snowflake is inherently tied to WebRTC

    It needs to be said why it is (I think it’s because of browsers), because otherwise I’d say that Snowflake is tied to ICE, not WebRTC (see e.g. “WebRTC, but obfs4 instead of DTLS”).

Grammar

  1. a person must to take a positive action

    Just “must take” I think?

  2. Despite our being able

    “Despite the fact that we were able”, or something shorter?

BTW

Nice to see the “non-Tor applications of Snowflake” getting a touch (in section “Future work”), which I’m a big fan of!

1 Like

Thanks @Vort, made the suggested changes here: Fixes suggested by Vort. · turfed/snowflake-paper@9783158 · GitHub.

Initial growth in the number of proxies depended on our developing new and easier ways to run one,

Probably, of is missing: developing of new.

There was actually nothing ungrammatical about the sentence as it stood; but I changed it anyway to maybe be less awkward.

we cannot, say, centrally assign 20% of clients to one and 80% to another according to their relative capacity.

If capacity is known and fixed, then why can’t client choose bridge probabilistically?

Conceptually it might be possible to do something like that, but practically there are difficulties because of the coupling with Tor. Basically, it would require a new feature in Tor itself (not in the pluggable transport) to support weighted bridge selection. It’s not something we can do in snowflake-client. It is Tor that makes the decision of what bridge to use, and the pluggable transport can only connect to the requested bridge, or else the connection will fail because of an incorrect bridge fingerprint.

The only way it would be possible for snowflake-client to make the decision, rather than Tor, is if either (1) all the bridge sites share the same identity keys, or (2) we don’t put a fingerprint in the client bridge line. We don’t want to do (1) because it increases the likelihood and the impact of losing bridge identity keys, and we don’t want to do (2) because it enables certain circuit tagging attacks (don’t remember the details). The current situation is far from ideal, though, and I wish we had something better.

1 Like

Thanks for taking the time to read it and write comments.

Higher-level suggestions

the nationwide protests in Iran that started on 2022-09-21

To be precise, the Mahsa Amini protests actually started a few days earlier, with internet censorship to follow.

That’s fair. It makes sense to distinguish the start of protests from the start of increased Snowflake use. I’ve made the change.

Maybe it makes sense to say that the paper describes the current state of the Snowflake software, and some things might change in the future without affecting the general concept of Snowflake.

Maybe, though I think that point is understood in a systems paper like this. Maybe more constructively, it would help to delineate or at least think about what are the essential elements of Snowflake. For me, that’s WebRTC and the ability to run a proxy in a browser. Any circumvention system could be transformed into any other by a series of incremental changes; but such an extreme level of abstraction isn’t helpful for modeling. It’s more helpful, I think, to draw a line and explain the advantages and disadvantages of a bundle of design decisions.

Long ago, what would later become Snowflake was envisioned as an extension to flash proxy, swapping WebRTC for WebSocket. In a sense, that’s not incorrect. But it makes more sense to think of flash proxy as one system, and Snowflake as a distinct but related system. Snowflake occupies a certain place in the ontological hierarchy; it’s not trying to be a vessel for all possible circumvention ideas.

I don’t know if this is practiced, but can’t links to the relevant issues on GitLab be added, especially in the “blocking attempts” section?

I think this is probably a good idea. We (the authors) have been talking about it a bit. In the source code, there are abundant hyperlinks to references which we have relied on in making our claims. A lot of these can/should be surfaced. The best way to do it in the PDF/paper version is unfortunately probably footnotes with bare URLs (clickable in the PDF version, at least). We’re planning to also prepare an HTML web page version of the paper, where we can make such references more usefully and unobtrusively in the form of sidenotes or ordinary hyperlinks. Compare the footnotes in the PDF version and the sidenotes in the HTML version of my recent FEP paper, for example.

Minor semantic additions/changes

I appreciate these suggestions, though I myself disagree or at least quibble with all of them, I think. Maybe the other authors will have different opinions.

The point about running in a browser is to make it possible to run a proxy with little to no friction. There may be other ways to do that—you have a good point about apps (and Orbot is the #2 source of proxies, as Figure 5 shows)—but the important dinstinction is not browser vs. app, it’s browser vs. the status quo of Tor bridges, Shadowsocks servers, SOCKS proxies, etc.: running some server software on a long-term VPS.

I think it’s important to lead with the idea that the lifetime of a client does not have to a subset of the lifetime of a proxy. That’s a central idea, and it’s hand in hand with proxies being low friction to run. There’s no stable population of a small number of long-term proxies, there’s a large and constantly changing population of unreliable proxies. This is the main argument for why the proxies are difficult to enumerate and block by their addresses. The fact that proxies can change thoughout a client’s session is one of the features that distinguishes Snowflake from uProxy and MassBrowser. It may actually be beneficial to include some measurements of how often proxies actually change in practice—I don’t think we’ve done an experiment to measure that—but more important than that is for the reader to understand that they can change.

Grammar

a person must to take a positive action

Thanks, fixed.

Despite our being able

This one is grammatical as it stands. “Our being able to confirm…” is a noun phrase.

@WofWca I came around to your way of thinking on a couple of points while editing. In the part about domain fronting rendezvous, I added that the front domain should be chosen to have value to the censor. And I added the chat bot idea as an example of another possible rendezvous method, alongside encrypted DNS.

@WofWca, @Vort, are you okay with being acknowledged in the paper? Do you prefer to use your username or something else?

1 Like

@WofWca, @Vort, are you okay with being acknowledged in the paper? Do you prefer to use your username or something else?

Of course, you can mention me.
Using my nickname (Vort) is preferred choice. With or without @ - does not matter.

Yes! “WofWca” is fine.
I appreciate it!