Tor over VPN 型隧道化匿名网络流量识别技术研究(Identification of Tor over VPN Tunneled Anonymous Network Traffic)

Tor over VPN 型隧道化匿名网络流量识别技术研究

南京理工大学硕士学位论文 胡梁宏

Identification of Tor over VPN Tunneled Anonymous Network Traffic
By
Lianghong Hu
Supervised by Associate Prof.Weiwei Liu

Nanjing University of Science & Technology
December, 2021

I can’t upload full PDF file here so I uploaded to IPFS

https://ipfs.io/ipfs/QmWiSYnbj9fhN6fk2vLdssSdmCNjTvNiBjiRvkYX4sZDfM

1 Like

In general, there’s a lot of this kind of thesis that comes out of China. Most of them are not noteworthy and not of high quality. Examples of some others:

Outside of theses, in research papers generally, there is no shortage of this type of paper, for example:

Several have even been discussed by the anti-censorship team:

Etc., etc., etc. There are many, many of these.

Unless there is something particularly noteworthy about an individual publication, there is no need to pay too much attention to them.

It’s may be helpful if you can skim this thesis and comment on the general technique: do they use machine learning, what algorithm, what is the experimental setup, what assumptions do they make.

It looks like CN116233013 is a patent that corresponds to this thesis.

The present invention belongs to the field of network security technology, and in particular to a method for identifying Tor Over VPN anonymous network traffic and its service type, which extracts the spatiotemporal dimensional features of Tor Over VPN traffic and combines it with CNN, Transformer and other models for identification.

Step 1, divert the input traffic sample into session traffic based on the five-tuple information, and perform marking, grouping, and numbering preprocessing operations on the traffic according to the traffic type; the five-tuple is the source address, destination address, source port, destination port, and protocol five-tuple;

Step 2, extract the payload length, OpenVPN header protocol field, and heartbeat data packet features of the data packets with sequence numbers 0 to N1 flow by flow;

Step 3, extracting the payload length, payload information, and polling data features of the data packets with sequence numbers N1 to N2 flow by flow, and converting these features into a two-dimensional grayscale image;

Step 4, extract the length, load information, inter-packet delay, MSS packet ratio, and number of interactions of the data packets with sequence numbers N2 to N3 flow by flow to form a spatiotemporal feature vector;

Step 5, match the traffic using the features extracted in step 2 to identify the OpenVPN tunnel traffic;

Step 6, constructing a Tor Over VPN anonymous network traffic identification model based on the two-dimensional grayscale image and the CNN model;

Step 7, constructing a service type identification model based on the spatiotemporal feature vector and the Transformer model;

Step 8, for the traffic sample to be detected, execute steps 1 to 4, then identify the OpenVPN tunnel traffic according to step 5, and use the models constructed in steps 6 and 7 to identify the Tor Over VPN anonymous network traffic and service type respectively.

Compared with the prior art, the present invention has the following significant advantages:

  1. By mining the length sequence, protocol fingerprint and heartbeat mechanism characteristics of the VPN tunnel establishment phase, the OpenVPN tunnel traffic is detected based on the rule matching method, which improves the accuracy of encrypted tunnel traffic detection.

  2. By converting attributes such as payload length, payload information, and polling data into grayscale images and combining them with the CNN model to identify Tor Over VPN traffic, the detection capability of Tor Over VPN traffic is improved.

  3. To address the problem of identifying the service type carried by Tor Over VPN traffic, we conduct an in-depth multi-dimensional analysis of the differences in the spatiotemporal characteristics of various types of traffic, select effective features to construct a unified feature vector, and use the Transformer spatiotemporal sequence model to perform refined identification, thereby improving the accuracy of service type identification at the Tor behavior level.

So it’s like 90% of this kind of papers. They take some random traffic features, throw them into some random machine learning classifiers, and get some output.

The part about transforming features into a grayscale bitmap and then running a CNN over the 2D bitmap is similar to what CN111753290 does for software classification.