Predicting protein coding boundaries using recurrent neural networks

2018-02-20T05:27:23Z (GMT) by Saket Choudhary
We explore how recurrent neural networks (RNNs) can be used to predict protein coding domains in a gene. We first demonstrate that using long short term memory RNNs give with one-hot encoding resulted in a limited prediction power. Later, we demonstrate how a word embedding approach along with bi-directional LSTMs gives promising results using the entire pool of protein coding genes in human achieving an overall accuracy of 0.67. This model is then used to predict protein coding domains in a different species, mouse, and achieves an overall accuracy of of 0.70 when tested on non-orthologous genes(where orthogonality implies a gene in mouse shares significant sequence from a human gene owing to descent from a common ancestor).