Supplementary Data: CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision
# Text mining dictionaries
The entities file (entities.tsv.gz), names file (names.tsv.gz), and groups file (groups.tsv.gz) were used to identify proteins/genes, diseases, and tissues in the PubMed + PMC corpus.
Please check the following README for usage of these files: https://bitbucket.org/larsjuhljensen/tagger/src/default/README.md
# Datasets and pre-trained sentence classification models
## H. sapiens disease-gene associations
Training dataset: dataset_9606_-26_train.tsv.gz
Test dataset: dataset_9606_-26_test.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_9606_-26.ftz
## H. sapiens tissue-gene associations
Training dataset: dataset_9606_-25_train.tsv.gz
Test dataset: dataset_9606_-25_test.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_9606_-25.ftz
## H. sapiens functional protein-protein associations
Training dataset: dataset_9606_9606_train.tsv.gz
Test dataset: dataset_9606_9606_test.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_9606_9606.ftz
## D. melanogaster functional protein-protein associations
Training dataset: dataset_7227_7227_train.tsv.gz
Test dataset: dataset_7227_7227_test.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_7227_7227.ftz
## S. cerevisiae functional protein-protein associations
Training dataset: dataset_4932_4932_train.tsv.gz
Test dataset: dataset_4932_4932_test.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_4932_4932.ftz
## H. sapiens physical protein-protein interactions
Training dataset: dataset_9606_9606_train_physical.tsv.gz
Test dataset: dataset_9606_9606_test_physical.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_9606_9606_physical.ftz
## D. melanogaster physical protein-protein interactions
Training dataset: dataset_7227_7227_train_physical.tsv.gz
Test dataset: dataset_7227_7227_test_physical.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_7227_7227_physical.ftz
## S. cerevisiae physical protein-protein interactions
Training dataset: dataset_4932_4932_train_physical.tsv.gz
Test dataset: dataset_4932_4932_test_physical.tsv.gz
fastText Sentence classification models trained on training dataset: ft_model_CoCoScore_pretrained_4932_4932_physical.ftz
# Pre-trained word embeddings
Pre-trained fastText word embeddings can be found in: fasttext_sg_masked_dim_300_epoch_5_lr_0.05_minn_3_maxn_6_ws_5.vec.gz