figshare
Browse
1/2
28 files

CoCoScore Supplementary Data v1.0

dataset
posted on 2018-10-13, 14:55 authored by Alexander JungeAlexander Junge, Lars Juhl JensenLars Juhl Jensen
Supplementary Data: CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision

# Text mining dictionaries

The entities file (entities.tsv.gz), names file (names.tsv.gz), and groups file (groups.tsv.gz) were used to identify proteins/genes, diseases, and tissues in the PubMed + PMC corpus.
Please check the following README for usage of these files: https://bitbucket.org/larsjuhljensen/tagger/src/default/README.md

# Datasets and pre-trained sentence classification models

## H. sapiens disease-gene associations

Training dataset: dataset_9606_-26_train.tsv.gz
Test dataset: dataset_9606_-26_test.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_9606_-26.ftz

## H. sapiens tissue-gene associations

Training dataset: dataset_9606_-25_train.tsv.gz
Test dataset: dataset_9606_-25_test.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_9606_-25.ftz

## H. sapiens functional protein-protein associations

Training dataset: dataset_9606_9606_train.tsv.gz
Test dataset: dataset_9606_9606_test.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_9606_9606.ftz

## D. melanogaster functional protein-protein associations

Training dataset: dataset_7227_7227_train.tsv.gz
Test dataset: dataset_7227_7227_test.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_7227_7227.ftz

## S. cerevisiae functional protein-protein associations

Training dataset: dataset_4932_4932_train.tsv.gz
Test dataset: dataset_4932_4932_test.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_4932_4932.ftz

## H. sapiens physical protein-protein interactions

Training dataset: dataset_9606_9606_train_physical.tsv.gz
Test dataset: dataset_9606_9606_test_physical.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_9606_9606_physical.ftz

## D. melanogaster physical protein-protein interactions

Training dataset: dataset_7227_7227_train_physical.tsv.gz
Test dataset: dataset_7227_7227_test_physical.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_7227_7227_physical.ftz

## S. cerevisiae physical protein-protein interactions

Training dataset: dataset_4932_4932_train_physical.tsv.gz
Test dataset: dataset_4932_4932_test_physical.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_4932_4932_physical.ftz

# Pre-trained word embeddings

Pre-trained fastText word embeddings can be found in: fasttext_sg_masked_dim_300_epoch_5_lr_0.05_minn_3_maxn_6_ws_5.vec.gz

History