Hi there,
Disclaimer: I have tried to RTFM before asking here, and hence I have read posts about this kind of question before (https://tor.stackexchange.com/questions/9127/tor-and-obfsproxy-packet-size, https://www.reddit.com/r/TOR/comments/64mqrz/tcp_packets_reassembled_over_tor/, etc), as well as related docs ( Cells (messages on channels) - Tor Specifications , Tor: The Second-Generation Onion Router ). Yet, I am still quite confused about how these relate to what I am measuring.
(I think) I understand that in theory Tor traffic is divided into mostly fixed-sized cells, 514 bytes long. As I understand it, this should help obfuscate fingerprinting efforts that try to guess the nature of traffic from the statistical distribution of the packet lengths (eg, HTTP traffic tending to consist of larger packets than non-webseed BitTorrent traffic). I also understand that in practice protocols down the TCP/IP hirearchy may pack multiple cells together, meaning that, for example, ethernet frames may bundle together more than one Tor cell.
Based on this understanding, I recently decided to put this to the test just to marvel at the process in action. Here is what I did:
• I picked a 50MB linux iso file that I could fetch via HTTP and via (non-webseeded) BitTorrent
• I downlaoded it both ways without using Tor, while recording traffic in Wireshark; and confirmed that the specific network conversations related to the download had different packet size distributions, as expected.
• I then performed the HTTP download via the TorBrowser, capturing traffic.
• I was also naughty, and did the BitTorrent download via Tor too (sorry!), by pointing KTorrent to use a SOCKS5 proxy → 127.0.0.1:9150 , that is, the Tor Browser.
• In both Tor measurements, I verified that the incoming traffic was coming from a Tor Node listed on TOR Node List . I also repeated these captures looking only at loopback traffic, and could verify that traffic was incoming from 127.0.0.1:9150 if and only if I was running the traffic over Tor.
• Because I am aware that Tor Bridges may try to obfuscate some of the traffic, I turned the use of Tor Bridges OFF before recording the Tor traffic.
In the Tor traffic cases, I expected to be looking at distributions of TCP packets heavily centered around 514 bytes + TCP header overhead. Yet, I measured very different results, with HTTP/Tor traffic being 90%+ of the time larger than 640 bytes, 60%+ of the time being larger than 2560 bytes. I thought: “Ok, this could just be bundling of Tor cells; nothing Tor can do about it”. However, more puzzingly, I could easily tell that HTTP/Tor traffic had a packet size distribution very similar to plain HTTP traffic, while BitTorrent and BitTorrent/Tor traffic was consistently smaller-distributed.
Here are my questions:
• Is my understanding of Tor traffic fundamentally flawed? If so, how?
• Is my experiment fundamentally flawed? If so, how?
• If neither point above is the case (or if the discrepancy is small), does this not mean that Tor traffic between the Client and the Client Guard can be fingerprinted to some extent regardless of Tor cell size?