Europe PMC Full Text Corpus
This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism.
Corpus Directory Structure
annotations/
: contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.
hypothesis/csv/
: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.GROUP0/
: contains raw manual annotations made by curator GROUP0.GROUP1/
: contains raw manual annotations made by curator GROUP1.GROUP2/
: contains raw manual annotations made by curator GROUP2.
IOB/
: contains automatically extracted annotations using raw manual annotations inhypothesis/csv/
, which is in Inside–Outside–Beginning tagging format.dev/
: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task.test/
: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task.train/
: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task.
JSON/
: contains automatically extracted annotations using raw manual annotations inhypothesis/csv/
, which is in JSON format.README.md
: a detailed description of all the annotation formats.
articles/
: contains the full-text articles annotated in Europe PMC corpus.
Sentencised/
: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser.XML/
: contains XML articles directly fetched using Europe PMC Article Restful API.README.md
: a detailed description of the sentencising and fetching of XML articles.
docs/
: contains related documents that were used for generating the corpus.
Annotation guideline.pdf
: annotation guideline that is provided to curators to assist the manual annotation.demo to molecular conenctions.pdf
: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform.Training set development.pdf
: initial document that details the paper selection procedures.
pilot/
: contains annotations and articles that were used in a pilot study.
annotations/csv/
: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.articles/
: contains the full-text articles annotated in the pilot study.Sentencised/
: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser.XML/
: contains XML articles directly fetched using Europe PMC Article Restful API.
README.md
: a detailed description of the sentencising and fetching of XML articles.
src/
: source codes for cleaning annotations and generating IOB files
metrics/ner_metrics.py
: Python script contains SemEval evaluation metrics.annotations.py
: Python script used to extract annotations from raw Hypothes.is annotations.generate_IOB_dataset.py
: Python script used to convert JSON format annotations to IOB tagging format.generate_json_dataset.py
: Python script used to extract annotations to JSON format.hypothesis.py
: Python script used to fetch raw Hypothes.is annotations.
License
CCBY
Feedback
For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.