Wikidata Reference
Dataset Summary
The Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.
Data Fields
Each row in the dataset consists of the following fields:
- subject (str): The subject entity of the knowledge graph triple.
- rel (str): The relation that connects the subject and object.
- object (str): The object entity of the knowledge graph triple.
- text (str): A natural language sentence that entails the given triple.
- validation (str): LLM-based validation results, including:
- Fluent Sentence(s):
TRUE
/FALSE
- Subject mentioned in Text:
TRUE
/FALSE
- Relation mentioned in Text:
TRUE
/FALSE
- Object mentioned in Text:
TRUE
/FALSE
- Fact Entailed By Text:
TRUE
/FALSE
- Final Answer:
TRUE
/FALSE
- Fluent Sentence(s):
- reference_url (str): URL of the web source from which the text was extracted.
- subj_qid (str): Wikidata QID for the subject entity.
- rel_id (str): Wikidata Property ID for the relation.
- obj_qid (str): Wikidata QID for the object entity.
Dataset Creation
The dataset was created through the following process:
1. Triple-Reference Sampling and Extraction
- All relations from Wikidata were extracted using SPARQL queries.
- A sample of KG triples with associated reference URLs was collected for each relation.
2. Domain Analysis and Web Scraping
- URLs were grouped by domain, and sampled pages were analyzed to determine their primary language.
- English-language web pages were scraped and processed to extract plaintext content.
3. LLM-Based Text Span Selection and Validation
- LLMs were used to identify text spans from web content that correspond to KG triples.
- A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.
- The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.
4. Final Dataset Statistics
- 12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.
- After filtering for English content, 458K triple-web content pairs were processed with LLMs.
- 80.5K validated triple-text alignments were included in the final dataset.