Core data

dataset

posted on 2025-04-16, 14:14 authored by Baoqiang ChenBaoqiang Chen

Sequence Data Processing

For endogenous mRNA, raw FASTQ files were analyzed using the SLAM-Dunk⁴ package, mapping reads to the human GRCh38 genome. To count yeast spike-in reads, the raw data were mapped to the Saccharomyces cerevisiae R64-1-1 genome. All gene counts were summed to obtain a total count number for normalization. The resulting counts and T-to-C mutation counts were first normalized to the spike-in counts. The Mean Ribosome Density (MRD) was calculated by multiplying the normalized and scaled counts of the four fractions by a vector [0, 1, 4, 9], and then summing these four values. This vector represents the approximate number of ribosomes in each polysome fraction. For time-lapse 4sU labeling at 30 min, 1 h, and 2 h timepoints, MRD was used to calculate the area under the curve (AUC) using the trapz method in R.

To obtain isoform-level MRD and AUC, we first used SLAM-Dunk's Read Separator module to split the BAM files into new mRNA and old mRNA. The new and old sequences were then re-mapped to the human GRCh38 genome using STAR, and counts were generated using Cufflinks. HEK293T and THP-1 isoforms were corrected by using Nanopore sequencing data to obtain an isoform expression reference⁵; then, the Cufflinks isoforms were filtered based on the isoform expression reference. Nanopore sequencing data (SRR8181090 and PRJNA639136) were separately used to create isoform expression references.

The isoform-level MRD and AUC were calculated using the same methods as for the gene-level analysis. DynaRDS (MRD-diff) was calculated by subtracting the old mRNA MRD from the new mRNA MRD. Similarly, DynaRDS (AUC-diff) was calculated by subtracting the old mRNA AUC from the new mRNA AUC, for both gene-level and isoform-level analyses.

For the 5′ UTR library, we developed a Python script to extract sequences and Unique Molecular Identifiers (UMIs) from the FASTQ files. Identical UMIs corresponding to the same 5′ UTR were merged. To count the number of each 5′ UTR sequence in each ribosome fraction, we first normalized the 5′ UTR counts in each fraction by dividing them by the spike-in counts. We then scaled the normalized counts for each fraction and for each 5′ UTR. The Mean Ribosome Density (MRD) was calculated by multiplying the normalized and scaled counts of the four fractions by a vector [0, 1, 4, 9], and then summing these four values. This vector represents the approximate number of ribosomes in each polysome fraction. DynaRDS^syn for the 5′ UTR library was calculated as MRD 2 h – MRD 1 h.

RNA Feature Analysis

We first extracted transcript regions using the extract-transcript-regions⁶ tool, then estimated the RNA feature numbers using transcriptome-properties⁶. m6A levels were obtained from HEK293T data of GLORI⁷ by inferring the mean m6A level on each transcript's 5′ UTR, 3′ UTR, CDS, exons, and introns. m6A levels in THP-1-derived macrophages from Pinello et al.'s m6A-IP-Seq⁸. For gene-level RNA features, we used the value from the highest-expressing transcript to represent the gene-level RNA feature. For transcript-level analysis, the corrected transcript regions were used to infer transcript-level RNA features. RNA Half-Life data from Lugowski et al.’s data⁹.

To assess the effect sizes between the high-coupling and low-coupling groups across multiple features, we calculated Cliff's delta for each feature using the effsize package in R. Cliff's delta ranges from -1 to 1, where values close to -1 or 1 indicate a large effect size, and values near 0 suggest minimal difference between the groups. We utilized the cliff’s delta function from the effsize package to compute Cliff's delta for each feature. This function takes two numeric vectors representing the observations from the high-coupling and low-coupling groups and returns the estimated effect size along with confidence intervals.

To determine which features contribute most to coupling, we employed machine learning methods to predict coupling using the features as variables. We divided the dataset into training and test sets, using 70% of the genes for training and 30% for testing. We implemented machine learning algorithms using the following R packages: rpart for Decision Trees, gbm for Gradient Boosting Machines (GBM), ranger for Random Forests, the glm function for Generalized Linear Models (GLM), and xgboost for Extreme Gradient Boosting (XGB). We then obtained the importance matrix and the ROC curve to evaluate the predictive performance.

RNA functional analysis

Gene Ontology (GO) analysis was performed using the Metascape website, and gene set enrichment analysis (GSEA) was conducted using the GSEABase and enrichplot packages in R. The GSEA gene sets (including hallmark, curated, and immunologic signature gene sets) were downloaded from the Molecular Signatures Database (www.gsea-msigdb.org). Data on gene type and protein location were obtained from the Metascape analysis.

Prediction and Design of 5′ UTRs

We developed a convolutional neural network (CNN) model to predict 5′ UTRs. The network begins with a linear transformation that converts the input RNA sequences from one-hot encoding (4 dimensions) to a higher-dimensional space (24 dimensions). This transformation is performed by a fully connected linear layer.

The transformed sequences are then processed by two sequential convolutional layers. The first convolutional layer uses a 2D convolution with 8 filters, each with a size 1 × 8, and a stride of 1, followed by a ReLU activation function. The second convolutional layer increases the number of filters to 16, with each filter of size 2 × 8 and a stride of 1, also followed by a ReLU activation. Each convolutional layer is followed by a max pooling layer with a kernel size of 2 and a stride of 2, which reduces the spatial dimensions of the feature maps. After the convolution and pooling operations, batch normalization is applied to stabilize and accelerate the training process.

The flattened output is passed through two fully connected layers. The first fully connected layer consists of 880 input neurons and 512 output neurons, with a ReLU activation function and a dropout rate of 10% applied to prevent overfitting. The second fully connected layer contains 512 input neurons and 64 output neurons, also followed by a ReLU activation function and the same dropout rate. Finally, the output layer produces predictions using a fully connected layer with 64 input neurons and 1 output neuron.

The forward propagation of the input through the network starts by applying the initial linear transformation, followed by the two convolutional layers and corresponding max pooling operations. The output of the second convolutional layer is batch normalized before being flattened and passed through the fully connected layers. The final output is a scalar value produced by the last fully connected layer.

This architecture was implemented using the PyTorch library and trained using cross-entropy loss. The model was optimized to classify RNA sequences, achieving robust performance across multiple test sets.

The designed mRNA 5′ UTRs were selected by fixing the MRD and selecting the high and low DynaRDS^syn 5′ UTRs. The sequences were ordered as forward and reverse oligonucleotides. The two oligos were annealed and ligated into the 5′ UTR library vector for in vitro transcription (IVT) and GFP intensity measurement.

Core data

History

Usage metrics

Categories

Keywords

Licence

Exports