figshare
Browse
MATLAB software suite that process raw RNA-seq transcriptome dataset Wenfa Ng 07 September 2020.pdf (182.41 kB)

MATLAB software suite that process raw RNA-seq transcriptome dataset on a personal computer

Download (182.41 kB)
preprint
posted on 2020-09-07, 10:56 authored by Wenfa NgWenfa Ng

RNA-seq has emerged as the dominant approach for profiling differential gene expression in cells at different cell states. Requiring only sample preparation where sequencing could be purchased from a vendor, RNA-seq is accessible to many labs. However, data processing is not trivial and typically require high performance computing resources. This comes about due to the millions to tens of millions of sequenced reads typical of an RNA-seq workflow. Such a dataset could not be processed on a personal computer within a reasonable amount of time. To help extract biologically meaningful conclusions from an RNA-seq experiment, researchers could profile 10% of the entire set of sequenced reads on a personal computer to help reduce computational cost. This work sought to develop a MATLAB software suite that would help the experimentalist handle the entire workflow involved in processing RNA-seq transcriptome data encapsulated in a FASTQ file. The computational pipeline divides the original datafile into batches that are individually processed by the software. For example, one million reads is read into random access memory of which 10% is selected for processing in each batch. Each read is checked against every gene in the genome of the microbial species of interest. If a match is found, the gene would be deemed to have been transcribed and its expression count would be increased by one. Iterative run through the set of selected reads would yield a tabulation of expression count for each gene in the genome of the microbe, which is sorted in descending order to help highlight highly expressed genes. Final output is in Excel format that contains gene abbreviation, gene function, gene sequence, and expression count. Other software tools included in the suite include programmes that tally expression count of each gene in different batches of the data processing run. Overall, the described MATLAB software suite for RNA-seq data processing should find use in a variety of biological studies seeking to tap on the power of understanding differential gene expression to probe cellular physiology and responses to different nutritional and environmental conditions.

Funding

No funding was used in this work.

History