VoxEL.zip (264.32 kB)

VoxEL

dataset

posted on 2018-06-15, 03:00 authored by Henry Rosales MéndezHenry Rosales Méndez, Aidan Hogan, Barbara Poblete

This dataset has manual annotations with respect to Wikipedia over the same text written in five languages: German (de), English (en), Spanish (es), French (fr) and Italian (it). The dataset is composed of 15 annotated news articles (in each of the 5 languages; 75 articles in total) where there is the same number of sentences in each language, as well as the same set of annotations for each corresponding sentence in the different languages. Each language has a total of 94 sentences across the 15 articles.

We propose two annotated versions of the dataset: a strict version that only annotates persons, organizations and places (per, for example, traditional NER/MUC definitions of an entity), and a relaxed version that includes a larger number of annotations (e.g., capturing entity mentions such as “inflation” that have a corresponding Wikipedia article). Both the relaxed and the strict versions have the same text in the same languages. The strict version has 204 annotations per language, while the relaxed version has 674 annotations per language.

If you use VoxEL in a research work, we would ask you to reference the following paper that describes the dataset in detail:

Henry Rosales-Méndez, Aidan Hogan, Barbara Poblete. "VoxEL: A Benchmark Dataset for Multilingual Entity Linking". International Semantic Web Conference (ISWC), Monterey, United States, 2018. [http://aidanhogan.com/docs/voxel-entity-linking.pdf]

Funding

CONICYT-PCHA/Doctorado Nacional/2016-21160017, the Millennium Institute for Foundational Research on Data (IMFD) and Fondecyt Grant No. 1181896

History

Usage metrics

Keywords

Multilingual Entity Linking Benchmark Dataset Applied Computer Science

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM