figshare
Browse

Wikidata Reference

Version 2 2025-03-17, 08:02
Version 1 2025-03-16, 16:46
dataset
posted on 2025-03-17, 08:02 authored by Sven HertlingSven Hertling, Nandana MihindukulasooriyaNandana Mihindukulasooriya

Dataset Summary

The Triple-to-Text Alignment dataset aligns Knowledge Graph (KG) triples from Wikidata with diverse, real-world textual sources extracted from the web. Unlike previous datasets that rely primarily on Wikipedia text, this dataset provides a broader range of writing styles, tones, and structures by leveraging Wikidata references from various sources such as news articles, government reports, and scientific literature. Large language models (LLMs) were used to extract and validate text spans corresponding to KG triples, ensuring high-quality alignments. The dataset can be used for training and evaluating relation extraction (RE) and knowledge graph construction systems.

Data Fields

Each row in the dataset consists of the following fields:

  • subject (str): The subject entity of the knowledge graph triple.
  • rel (str): The relation that connects the subject and object.
  • object (str): The object entity of the knowledge graph triple.
  • text (str): A natural language sentence that entails the given triple.
  • validation (str): LLM-based validation results, including:
    • Fluent Sentence(s): TRUE/FALSE
    • Subject mentioned in Text: TRUE/FALSE
    • Relation mentioned in Text: TRUE/FALSE
    • Object mentioned in Text: TRUE/FALSE
    • Fact Entailed By Text: TRUE/FALSE
    • Final Answer: TRUE/FALSE
  • reference_url (str): URL of the web source from which the text was extracted.
  • subj_qid (str): Wikidata QID for the subject entity.
  • rel_id (str): Wikidata Property ID for the relation.
  • obj_qid (str): Wikidata QID for the object entity.

Dataset Creation

The dataset was created through the following process:

1. Triple-Reference Sampling and Extraction

  • All relations from Wikidata were extracted using SPARQL queries.
  • A sample of KG triples with associated reference URLs was collected for each relation.

2. Domain Analysis and Web Scraping

  • URLs were grouped by domain, and sampled pages were analyzed to determine their primary language.
  • English-language web pages were scraped and processed to extract plaintext content.

3. LLM-Based Text Span Selection and Validation

  • LLMs were used to identify text spans from web content that correspond to KG triples.
  • A Chain-of-Thought (CoT) prompting method was applied to validate whether the extracted text entailed the triple.
  • The validation process included checking for fluency, subject mention, relation mention, object mention, and final entailment.

4. Final Dataset Statistics

  • 12.5K Wikidata relations were analyzed, leading to 3.3M triple-reference pairs.
  • After filtering for English content, 458K triple-web content pairs were processed with LLMs.
  • 80.5K validated triple-text alignments were included in the final dataset.

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC