Browser Fingerprinting

thorin · December 28, 2023, 12:42pm

sites that give entropy figures are nonsense … let me count the ways

very limited datasets
data sets are heavily skewed by privacy conscious users (FF is massively over represented for example)
data sets are further tainted by repeat visitors
data sets are even further tainted by repeat visitors changing settings and repeatedly visiting

Sites that claim to provide entropy figures are absolute snake oil. They may be good to see what is reported, but that’s it. EFF’s cover your tracks had a purpose, to show that fingerprinting was a real threat - they should add disclaimers about their BS figures

Stop making assumptions. Do you understand what and how it is being tested and how it is used to calculate anything?

Comparing tests is a waste of time (well, the entropy figures are nonsense for a start), as the tests and purpose of each test can vary. For example, CYT detects some randomness and can thus return a static value for that test, such as “canvas: random”, but amiunique doesn’t do this, so it will also return a unique result for canvas, and thus an overall unique results

A fingerprint is just a snapshot in time, and can be manipulated after the fact - it is not incumbent on sites to TELL you what is used and what isn’t - and what can be bypassed or discarded in order to linkfy other fingerprints. Always treat fingerprints as snapshopts, that can be fuzzed after the fact.

lets look at this: Global Statistics- Am I Unique ? - last 30 days

36% of users are using Firefox
- in reality we know that FF is about 3% worldwide share, or 6% on desktop
72% are using requesting en* (english)
- it’s a shame this is not broken down by locale
- this is simply not true. We’re talking users/profiles on the internet, not people in the world, so some languages will be under-represented, and a lot of users users do use en-* as their second language. But almost three quarters of internet users being english is a stretch
22% are in timezone UTC0
- it’s a shame this not broken down by actual timezone name instead of classifying everything as UTC-something
- again with internet users vs populations this is a bit vague - but 22% of users being in greenwich mean time is bollocks
and I could go on

Lets look at some more nonsense (but I get that these sites are using all visitors). On CYT using TB (en-US) for windows

userAgent: (FF115 windows 10 64 bit)
- says 1 in 3.45 browsers have this value
- reality says FF is 3% (call it 1 in 33) worldwide, windows is 80% (1 in 1.25), and ESR is about 10% (1 in 10), so the real figure is approx 1 in 413
- you also can’t hide the fact that you’re using TB or your OS, and TB has e.g. 1 million windows daily desktop users, so entropy (as far as we’re concerned) is the barest of buckets (equivalency) is actually zero
this one might explain the zero entropy better
- says my timezone of UTC is entropy 2.35 bits
- ALL tor browser users report his value, so it’s NIL (for our set)

The way we defeat fingerprinting linkability is to take each metric and reduce the entropy in it in our set (our set being TB users) - and there is are some things you can’t lie about (such as requesting web content in a language - e.g. if you need arabic, then request arabic) or hide (version, os, fonts). So for lack of a better word, we call this equivalency. E.g. if you have windows fonts, that’s equivalency of being on windows (os). Or if you have certain default fonts, that’s equivalency of language), etc. We can randomize if we want (per execution or per session+eTLD+1) but ultimately all randomizing can be detected. So this is not some magic bullet - it only exposes that some sites/scripts are lazy. We assume advanced scripts. So we protect each metric one by one, making it harder and more costly for scripts, until they give up and it becomes prohibitive - but we must balance that with usability and compat

The way to determine how many values a metric may return is to test and collect + analyze the data (e.g. checking for equivalency or other external factors such as device pixel ratio), and then the only way to get any real world entropy is to do a large scale test collecting the data, one per profile (so as to not taint the data set)

for example - collect TB115 only fingerprint data: this immediately removes all non-TB noise and e.g. UTC0 = all users = zero entropy (for us) - capisce?

tl;dr: stop comparing different sites’ results, stop using entropy figures from sites

I’m just going to stop here - I’m supposedly to writing this all up for some doc/blog