Wikidata Reference

dataset

posted on 2025-03-17, 08:02 authored by Sven HertlingSven Hertling, Nandana MihindukulasooriyaNandana Mihindukulasooriya

Dataset Summary

The Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.

Data Fields

Each row in the dataset consists of the following fields:

subject (str): The subject entity of the knowledge graph triple.
rel (str): The relation that connects the subject and object.
object (str): The object entity of the knowledge graph triple.
text (str): A natural language sentence that entails the given triple.
validation (str): LLM-based validation results, including:
- Fluent Sentence(s): TRUE/FALSE
- Subject mentioned in Text: TRUE/FALSE
- Relation mentioned in Text: TRUE/FALSE
- Object mentioned in Text: TRUE/FALSE
- Fact Entailed By Text: TRUE/FALSE
- Final Answer: TRUE/FALSE
reference_url (str): URL of the web source from which the text was extracted.
subj_qid (str): Wikidata QID for the subject entity.
rel_id (str): Wikidata Property ID for the relation.
obj_qid (str): Wikidata QID for the object entity.

Dataset Creation

The dataset was created through the following process:

1. Triple-Reference Sampling and Extraction

All relations from Wikidata were extracted using SPARQL queries.
A sample of KG triples with associated reference URLs was collected for each relation.

2. Domain Analysis and Web Scraping

URLs were grouped by domain, and sampled pages were analyzed to determine their primary language.
English-language web pages were scraped and processed to extract plaintext content.

3. LLM-Based Text Span Selection and Validation

LLMs were used to identify text spans from web content that correspond to KG triples.
A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.
The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.

4. Final Dataset Statistics

12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.
After filtering for English content, 458K triple-web content pairs were processed with LLMs.
80.5K validated triple-text alignments were included in the final dataset.

Wikidata Reference

Dataset Summary

Data Fields

Dataset Creation

1. Triple-Reference Sampling and Extraction

2. Domain Analysis and Web Scraping

3. LLM-Based Text Span Selection and Validation

4. Final Dataset Statistics

History

Usage metrics

Categories

Keywords

Licence

Exports