figshare
Browse

PPI prediction from sequence, gold standard dataset

Version 4 2025-02-10, 11:37
Version 3 2023-10-23, 12:27
Version 2 2023-06-16, 16:25
Version 1 2022-11-21, 13:37
dataset
posted on 2023-10-23, 12:27 authored by Judith BernettJudith Bernett

Gold Standard Dataset for sequence-based PPI prediction:

  • Big dataset: 163,192 training points (Intra-1), 59,260 validation points (Intra-0), 52,048 test points (Intra-2)) + corresponding protein sequences from Swissprot
  • No direct data leakage: proteins from training are not contained in validation or test, proteins from validation are not in training or test, proteins from test are not in validation or training
  • Minimized sequence similarity between training, validation, test because whole human proteome was split with KaHIP such that sequence similarities are minimized w.r.t. length-normalized bitscores
  • Redundancy-reduction with CD-HIT: inside of the datasets, no proteins with >40% pairwise sequence similarity

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC