figshare
Browse
PROCAT.zip (555.88 MB)

PROCAT: Product Catalogue Dataset for Implicit Clustering, Permutation Learning and Structure Prediction

Download (555.88 MB)
Version 4 2021-11-01, 12:48
Version 3 2021-11-01, 12:42
Version 2 2021-07-10, 12:49
Version 1 2021-06-01, 08:58
dataset
posted on 2021-11-01, 12:48 authored by Mateusz JurewiczMateusz Jurewicz, Leon DerczynskiLeon Derczynski
We introduce PROCAT, a novel e-commerce dataset containing 10,000 expertly designed product catalogues consisting of individual product offers grouped into complementary sections. We aim to address the scarcity of existing datasets in the area of set-to-sequence machine learning tasks, which require complex structure prediction through both implicit clustering and outputting permutations of the predicted clusters. This task's difficulty is further increased by the need to handle rare and unseen offer instances as well as variable catalogue lengths and cluster counts. PROCAT provides catalogue data consisting of over 1.5 million offers across a 4 year period, in both raw text form and with pre-processed features containing information about relative visual placement. In the related paper we include initial experimental results on a proposed benchmark task by testing a number of joint set encoding and permutation learning model architectures.

For details regarding the dataset structure and format, please see:
https://github.com/mateuszjurewicz/procat#dataset-structure

Funding

Innovation Fund Denmark (grant number 9065-00017B)

History