Revealing QoE of Web Users from Encrypted Network Traffic

We present a dataset targeting a large set of popular pages (Alexa top-500), from probes from several ISPs networks, browsers software (Chrome, Firefox) and viewport combinations, for over 200,000 experiments realized in 2019.

We purposely collect two distinct sets with two different tools, namely Web Page Test (WPT) and Web View (WV), varying a number of relevant parameters and conditions, for a total of 200K+ web sessions, roughly equally split among WV and WPT. Our dataset comprises variations in terms of geographical coverage, scale, diversity and representativeness (location, targets, protocol, browser, viewports, metrics).

For Web Page Test, we used the online service at different locations worldwide (Europe, Asia, USA) and private WPT instances in three locations in China (Beijing, Shanghai, Dongguan). The list of target URLs comprised the main pages and five random subpages from Alexa top-500 worldwide and China. We varied network conditions : native connections and 4G, FIOS, 3GFast, DSL, and custom shaping/loss conditions. The other elements in the configuration were fixed: Chrome browser on desktop with a fixed screen resolution, HTTP/2 protocol and IPv4.
For Web View, we collected experiments from three machines located in France. We selected two versions of two browser families (Chrome 75/77, Firefox 63/68), two screen sizes (1920x1080, 1440x900), and employ different browser configurations (one half of the experiments activate the AdBlock plugin) from two different access technologies (fiber and ADSL). From a protocol standpoint, we used both IPv4 and IPv6, with HTTP/2 and QUIC, and performed repeated experiments with cached objects/DNS. Given the settings diversity, we restricted the number of websites to about 50 among the Alexa top-500 websites, to ensure statistical relevance of the collected samples for each page.

The two archives `` and `` correspond respectively to the Web View experiments and to the Web Page Test experiments.

Each archive contains three files:
- `config.csv`: Description of parameters and conditions for each run,
- `metrics.csv`: Value of different metrics collected by the browser,
- `progressionCurves.csv`: Progression curves of the bytes progress as seen by the network, from 0 to 10 seconds by steps of 100 milliseconds,
- `listUrl` folder: Indexes the sets of urls.

Regarding `config.csv`, the columns are:
- index: Index for this set of conditions,
- location: Location of the machine,
- listUrl: List of urls, located in the folder listUrl
- browserUsed: Internet browser and version
- terminal: Desktop or Mobile
- collectionEnvironment: Identification of the collection environment
- networkConditionsTrafficShaping (WPT only): Whether native condition or traffic shaping (4G, FIOS, 3GFast, DSL, or custom Emulator conditions)
- networkConditionsBandwidth (WPT only): Bandwidth of the network
- networkConditionsDelay (WPT only): Delay in the network
- networkConditions (WV only): network conditions
- ipMode (WV only): requested L3 protocol,
- requestedProtocol (WV only): requested L7 protocol
- adBlocker (WV only): Whether adBlocker is used or not
- winSize (WV only): Window size

Regarding `metrics.csv`, the columns are:
- id: Unique identification of an experiment (consisting of an index 'set of conditions' and an index 'current page')
- DOM Content Loaded Event End (ms): DOM time,
- First Paint (ms) (WV only): First paint time,
- Load Event End (ms): Page Load Time from W3C,
- RUM Speed Index (ms) (WV only): RUM Speed Index,
- Speed Index (ms) (WPT only): Speed Index,
- Time for Full Visual Rendering (ms) (WV only): Time for Full Visual Rendering
- Visible portion (%) (WV only): Visible portion,
- Time to First Byte (ms) (WPT only): Time to First Byte,
- Visually Complete (ms) (WPT only): Visually Complete used to compute the Speed Index,
- aatf: aatf using ATF-chrome-plugin
- bi_aatf: bi_aatf using ATF-chrome-plugin
- bi_plt: bi_plt using ATF-chrome-plugin
- dom: dom using ATF-chrome-plugin
- ii_aatf: ii_aatf using ATF-chrome-plugin
- ii_plt: ii_plt using ATF-chrome-plugin
- last_css: last_css using ATF-chrome-plugin
- last_img: last_img using ATF-chrome-plugin
- last_js: last_js using ATF-chrome-plugin
- nb_ress_css: nb_ress_css using ATF-chrome-plugin
- nb_ress_img: nb_ress_img using ATF-chrome-plugin
- nb_ress_js: nb_ress_js using ATF-chrome-plugin
- num_origins: num_origins using ATF-chrome-plugin
- num_ressources: num_ressources using ATF-chrome-plugin
- oi_aatf: oi_aatf using ATF-chrome-plugin
- oi_plt: oi_plt using ATF-chrome-plugin
- plt: plt using ATF-chrome-plugin

Regarding `progressionCurves.csv`, the columns are:
- id: Unique identification of an experiment (consisting of an index 'set of conditions' and an index 'current page')
- url: Url of the current page. SUBPAGE stands for a path.
- run: Current run (linked with index of the config for WPT)
- filename: Filename of the pcap
- fullname: Fullname of the pcap
- har_size: Size of the HAR for this experiment,
- pagedata_size: Size of the page data report
- pcap_size: Size of the pcap
- App Byte Index (ms): Application Byte Index as computed from the har file (in the browser)
- bytesIn_APP: Total bytes in as seen in the browser,
- bytesIn_NET: Total bytes in as seen in the network,
- X_BI_net: Network Byte Index computed from the pcap file (in the network)
- X_bin_0_for_B_completion to X_bin_99_for_B_completion: X_bin_k_for_B_completion is the bytes progress reached after k*100 milliseconds

If you use these datasets in your research, you can reference to the appropriate paper:

title={Revealing QoE of Web Users from Encrypted Network Traffic},
author={Huet, Alexis and Saverimoutou, Antoine and Ben Houidi, Zied and Shi, Hao and Cai, Shengming and Xu, Jinchun and Mathieu, Bertrand and Rossi, Dario},
booktitle={2020 IFIP Networking Conference (IFIP Networking)},