Curlie Dataset - Language-agnostic Website Embedding and Classification

Version 5 2023-01-24, 22:04

Version 4 2023-01-24, 21:55

Version 3 2022-03-24, 07:04

Version 2 2022-03-24, 06:53

Version 1 2022-03-23, 20:38

dataset

posted on 2022-03-24, 07:04 authored by Sylvain LugeonSylvain Lugeon, Tiziano PiccardiTiziano Piccardi

**************** Full Curlie dataset ****************

This dataset contains the URL scrapped from curlie.org alongside with their multilingual labels. The label correspond to the sub-category where the URL was referenced in Curlie. We also provide a mapping between english labels and labels from other languages for alignment. The URLs have been filtered to only contain homepages. Each distint URL is indexed with a unique identifier (uid).

curlie.csv.gz > [url, uid, label, lang] x 2,275,150 samples
mapping.json.gz > [english_label, matchings] x 35,946 labels

**************** Processed Curlie dataset ****************

You find here the data used to train Homepage2vec. URLs have been further filtered out: websites listed under the Regional top-category where dropped, as well as non-accessible websites. This filtering yields 1,018,207 valid URL. The labels are aligned across languages and reduced to the 14 top-categories (classes).

Because a URL can belong to several classes, a binary vector is used. The grouping yields 885,582 distinct URL, for each of them we provide the HTML content. We also provide a visual encoding, it was obtained by forwarding a screenshot of the homepage trough a ResNet deep-learning model pretrained on ImageNet.

The training and testing sets are also given.

curlie_filtered.csv.gz > [url, uid, label, lang] x 1,018,207 samples

class_vector.json.gz > [uid, class_vector] x 885,582 samples
class_names.txt > [class_name] x 14 classes

html_content.json.gz > [uid, html] x 885,582 samples
visual_encoding.json.gz > [uid, visual_encoding] x 885,582 samples

train_uid.txt > [uid] x 797,023 samples

test_uid.txt > [uid] x 88,559 samples

**************** Pretrained Homepage2Vec****************

h2v_1000_100.zip > Model pretrained on all features

h2v_1000_100_text_only.zip > Model pretrained only on textual features (no visual features from screenshots)

**************** Enriched Curlie dataset ****************

Thanks to Homepage2Vec, we release an enriched version of Curlie. Each URL is associated to a class probability vector and to an embedding in the latent space.

outputs.json.gz > [url, uid, prediction, embedding] x 885,582 samples

******** Notes ********

JSON files have one record per line and can be read with Pandas: e.g pandas.read_json(file, orient='records', lines=True, compression='gzip')