figshare
Browse
europepmc-corpus-master.zip (28.84 MB)

Europe PMC Full Text Corpus

Download (28.84 MB)
Version 2 2023-05-25, 15:03
Version 1 2023-05-25, 14:46
dataset
posted on 2023-05-25, 15:03 authored by Santosh TirunagariSantosh Tirunagari, Xiao Yang, Shyamasree Saha, Aravind Venkatesan, Vid Vartak, Johanna McEntyreJohanna McEntyre

This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism.


Corpus Directory Structure


annotations/: contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.
 

  • hypothesis/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.  
  • GROUP0/: contains raw manual annotations made by curator GROUP0.
  • GROUP1/: contains raw manual annotations made by curator GROUP1.
  • GROUP2/: contains raw manual annotations made by curator GROUP2.


  • IOB/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in Inside–Outside–Beginning tagging format.  
  • dev/: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task.
  • test/: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task.
  • train/: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task.


  • JSON/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in JSON format.
  • README.md: a detailed description of all the annotation formats.


articles/: contains the full-text articles annotated in Europe PMC corpus.
 

  • Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser.
  • XML/: contains XML articles directly fetched using Europe PMC Article Restful API.
  • README.md: a detailed description of the sentencising and fetching of XML articles.


docs/: contains related documents that were used for generating the corpus.
 

  • Annotation guideline.pdf: annotation guideline that is provided to curators to assist the manual annotation.
  • demo to molecular conenctions.pdf: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform.
  • Training set development.pdf: initial document that details the paper selection procedures.


pilot/: contains annotations and articles that were used in a pilot study.
 

  • annotations/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.
  • articles/: contains the full-text articles annotated in the pilot study.  
    • Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser.
    • XML/: contains XML articles directly fetched using Europe PMC Article Restful API.
  • README.md: a detailed description of the sentencising and fetching of XML articles.


src/: source codes for cleaning annotations and generating IOB files
 

  • metrics/ner_metrics.py: Python script contains SemEval evaluation metrics.
  • annotations.py: Python script used to extract annotations from raw Hypothes.is annotations.
  • generate_IOB_dataset.py: Python script used to convert JSON format annotations to IOB tagging format.
  • generate_json_dataset.py: Python script used to extract annotations to JSON format.
  • hypothesis.py: Python script used to fetch raw Hypothes.is annotations.


License


CCBY


Feedback


For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.

Funding

Europe PMC 2021-2026 (221523)

Wellcome Trust Open Targets

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC