Supplementary data to the paper: Transcription factor prediction using protein 3D structures

Version 2 2024-06-18, 15:57

Version 1 2024-03-13, 10:19

dataset

posted on 2024-03-13, 10:19 authored by Fabian Neuhaus, Jeanine LieboldJeanine Liebold, Jan Baumbach, Khalique Newaz

Motivation: Transcription factors (TFs) are DNA-binding proteins that regulate expressions of genes in an organism. Hence, it is important to identify novel TFs. Traditionally, novel TFs have been identified by their sequence similarity to the DNA-binding domains (DBDs) of known TFs. However, this approach can miss to identify a novel TF that is not sequence similar to any of the known DBDs. Hence, computational methods have been developed for the TF prediction task that do not rely on known DBDs. Such existing methods use protein sequences to train a machine learning model, in order to capture sequence patterns of known TFs, and then use the trained model to predict novel TFs. Because 3-dimensional (3D) structure of a protein captures more functional characteristics of the protein than its sequence, using 3D protein structures can more correctly predict novel TFs.

Results: We propose a protein 3D structure-based deep convolutional neural network pipeline (named StrucTFactor) for TF predictions and compare it with the existing state-of-the-art TF prediction method that relies only on protein sequences. We use 12 datasets capturing different aspects of data bias (including sequence redundancy and 3D protein structural quality) that can influence the training of a machine learning model, spanning ~550,000 proteins. We find that, over all datasets, StrucTFactor significantly (p-value < 0.001) outperforms the existing state-of-the-art method for TF prediction, showing performance differences of up to 23% based on Matthews correlation coefficient. Our results show the importance of using 3D protein structures in the TF prediction task. We provide the StrucTFactor computational pipeline for the scientific community.

Funding

This work was supported by Universität Hamburg and HamburgX grant LFF-HHX-03 to the Center for Data and Computing in Natural Sciences (CDCS) from the Hamburg Ministry of Science, Research, Equalities and Districts.

This work was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the e:Med research and funding concept (grants 01ZX1910D and 01ZX2210D).

This work was developed as part of the PoSyMed project and is funded by the German Federal Ministry of Education and Research (BMBF) under grant number 031L0310A.

History

Usage metrics

Keywords

transcription factor transcription factor prediction Protein secondary structures

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM