Human and Mouse UTRomes
Overview
This dataset contains BED and GTF files representing the cleavage sites and 3'UTR isoform annotations derived from reprocessing Microwell-seq data. These objects are part of the minimum dataset required for verifying the analysis reported in Fansler et al., bioRxiv, 2023.
Description
The BED files contain candidate cleavage sites from the Mouse Cell Atlas and Human Cell Landscape datasets. In brief, paired-end reads were merged with PEAR when overlapping, cell barcodes extracted with umi_tools
, poly-A tails removed with cutadapt
, and then remaining reads mapped to the hg38
or mm10
genomes using HISAT2. Reads were partitioned into cell types according to annotations from the original publications. Per cell type, the 5' end of alignments were summarized, counts were merged to the mode with 30 nts, and finally filtered to a minimum threshold of 5 TPM. The resulting BED files identify the cell type cluster in the name
column and the number of observed reads in the score
column.
The GTF files are augmentations of GENCODE vM25 and v39, using novel cleavage sites, and then truncated to 500 nt. In brief, the sites provided in the BED files were harmonized across cell types by merging to the mode within 30 nts. The candidate sites were then serially classified as (1) "validated" if already in GENCODE (2) "supported" if found in PolyASite2.0 at 3 TPM or higher (3) "likely" if cleanUpdTSeq
scored the posterior probability of being an internal priming site below 0.0001% (4) "unlikely", otherwise. The "supported" and "likely" candidates were then used to augment GENCODE annotations of protein coding transcripts, and each transcript was truncated to the 500 nts at the 3' end. The final annotations identify the regions where the scUTRquant pipeline will quantify scRNA-seq data.
Data Generation
All code required to generate these files is available at:
- https://github.com/Mayrlab/mca-utrome (https://doi.org/10.5281/zenodo.8118416)
- https://github.com/Mayrlab/hcl-utrome (https://doi.org/10.5281/zenodo.8118411)
Funding
Tri-Institutional Training Program in Computational Biology and Medicine
National Institute of General Medical Sciences
Find out more...3'UTR-mediated protein-protein interactions determine protein functions
National Institute of General Medical Sciences
Find out more...Regulation of protein multi-functionality by 3 UTRs
National Institute of General Medical Sciences
Find out more...Function and therapeutic targeting of 3'UTR-dependent protein localization (UDPL) in cancer
Pershing Square Foundation
Find out more...