figshare
Browse
1/1
12 files

Curlie Dataset - Language-agnostic Website Embedding and Classification

Version 5 2023-01-24, 22:04
Version 4 2023-01-24, 21:55
Version 3 2022-03-24, 07:04
Version 2 2022-03-24, 06:53
Version 1 2022-03-23, 20:38
dataset
posted on 2023-01-24, 22:04 authored by Sylvain LugeonSylvain Lugeon, Tiziano PiccardiTiziano Piccardi

**************** Full Curlie dataset **************** 

Curlie.org is presented as the largest human-edited directory of the Web. It contains 3M+ multilingual webpage classified in a hierarchical taxonomy that is language-specific, but regrouping the same 14 top-level categories. Unfortunately, the Curlie administrators do not provide a downloadable archive of this valuable content. Therefore, we decided to release our own dataset that results from a in-depth scrapping of the Curlie website. 

This dataset contains webpages URL alongside with the category path (label) where they are referenced in Curlie. For example, the  International Ski Federation website (www.fis-ski.com) is referenced under the category path Sports/Winter/Sports/Skiing/Associations. The category path is language-specific and we provide a mapping between english and other languages for alignment. The URLs have been filtered to only contain homepages (URL with empty path). Each distinct URL is indexed with a unique identifier (uid). 

curlie.csv.gz > [url, uid, label, lang] x 2,275,150 samples 

mapping.json.gz > [english_label, matchings] x 35,946 labels 

**************** Processed Curlie dataset **************** 

We provide here the ground data used to train Homepage2Vec. URLs have been further filtered out: websites listed under the Regional top-category are dropped, as well as non-accessible websites. This filtering yields 933,416 valid entries. 

The labels are aligned across languages and reduced to the 14 top-categories (classes). There are 885,582 distinct URLs, for which the associated classes are represented with a binary class vector (an URL can belong to multiple classes). We provide the HTML content for each distinct URL. We also provide a visual encoding, it was obtained by forwarding a screenshot of the homepage trough a ResNet deep-learning model pretrained on ImageNet. Finally, we provide the training and testing sets for reproduction concerns. 

curlie_filtered.csv.gz > [url, uid, label, lang] x 933,416 samples 

class_vector.json.gz > [uid, class_vector] x 885,582 samples 

html_content.json.gz > [uid, html] x 885,582 samples 

visual_encoding.json.gz > [uid, visual_encoding] x 885,582 samples

class_names.txt > [class_name] x 14 classes 

train_uid.txt > [uid] x 797,023 samples

test_uid.txt > [uid] x 88,559 samples 

**************** Enriched Curlie dataset **************** 

Thanks to Homepage2Vec, we release an enriched version of Curlie. For each distinct URL, we provide the class probability vector (14 classes) and the latent space embedding (100 dimensions). 

outputs.json.gz > [uid, url, score, embedding] x 885,582 samples

**************** Pretrained Homepage2Vec****************

h2v_1000_100.zip > Model pretrained on all features

h2v_1000_100_text_only.zip > Model pretrained only on textual features (no visual features from screenshots)

**************** Notes **************** 

CSV file can be read with python:

import pandas as pd

df = pd.read_csv(“curlie.csv.gz“, index_col=0)

JSON files have one record per line and can be read with python:

import json

import gzip

with gzip.open("html_content.json.gz", "rt", encoding="utf-8") as file:

    for line in file:

        data = json.loads(line)

        …

History