Supplementary data to the paper: Transcription factor prediction using protein 3D structures
Motivation: Transcription factors (TFs) are DNA-binding proteins that regulate expressions of genes in an organism. Hence, it is important to identify novel TFs. Traditionally, novel TFs have been identified by their sequence similarity to the DNA-binding domains (DBDs) of known TFs. However, this approach can miss to identify a novel TF that is not sequence similar to any of the known DBDs. Hence, computational methods have been developed for the TF prediction task that do not rely on known DBDs. Such existing methods use protein sequences to train a machine learning model, in order to capture sequence patterns of known TFs, and then use the trained model to predict novel TFs. Because 3-dimensional (3D) structure of a protein captures more functional characteristics of the protein than its sequence, using 3D protein structures can more correctly predict novel TFs.
Results: We propose a protein 3D structure-based deep convolutional neural network pipeline (named StrucTFactor) for TF predictions and compare it with the existing state-of-the-art TF prediction method that relies only on protein sequences. We use 12 datasets capturing different aspects of data bias (including sequence redundancy and 3D protein structural quality) that can influence the training of a machine learning model, spanning ~550,000 proteins. We find that, over all datasets, StrucTFactor significantly (p-value < 0.001) outperforms the existing state-of-the-art method for TF prediction, showing performance differences of up to 23% based on Matthews correlation coefficient. Our results show the importance of using 3D protein structures in the TF prediction task. We provide the StrucTFactor computational pipeline for the scientific community.