Probabilistic modeling of protein:RNA interaction data identifies functional Transcript States

Freeberg, Mallory

doi:10.6084/m9.figshare.1588841.v1

GI_CSHL_poster_MFreeberg.pdf (1.36 MB)

Probabilistic modeling of protein:RNA interaction data identifies functional Transcript States

poster

posted on 2015-10-30, 00:39 authored by Mallory FreebergMallory Freeberg

CSHL Genome Informatics 2015 poster

UV-induced crosslinking and immunopurification of an RNA-binding protein (RBP) followed by deep sequencing of its bound RNAs (CLIP-seq and derivative protocols) is an increasingly popular method for identifying in vivo transcriptome-wide sites of RBP interactions at nucleotide resolution. Consequently, a large collection of published deep-sequencing datasets is available representing precise RNA interaction sites for hundreds of RBPs. Initial analyses of RBP:RNA interaction sites for individual RBPs have revealed important mechanistic insights into RBP-mediated post-transcriptional gene regulation. However, comprehensive integration of interaction data for multiple RBPs is lacking, resulting in an underappreciation of the importance of RBP:RNA interactions in the context of other factors.

Inspired by the identification chromatin states (e.g., promoters, enhancers) from ChIP-seq data of histone modifications, transcription factor binding, and RNA Pol II occupancy, we have identified Transcript States across the Saccharomyces cerevisiae transcriptome. First, we obtained empirical evidence of direct interactions between RNAs and over 80 yeast RBPs (represented by over 140 CLIP-seq experiments). Next, we transformed aligned read count data into a binarized matrix representing presence or absence of each RBP across the transcriptome. Finally, we built and trained a probabilistic hidden Markov model to learn Transcript States from the binarized data.

Preliminary results revealed Transcript States associated with functional regulatory elements such as intron 5’ splice sites, 3’ splice sites, and branch points. Importantly, our approach can easily relearn Transcript States as additional CLIP-seq datasets become available. Additionally, we can apply our methods to learn Transcript States from CLIP-seq data for any organism.