Poplar_Isoform Expression_matrix_AND_Isoform_GTF

dataset

posted on 2020-04-07, 20:46 authored by William BarbazukWilliam Barbazuk

A matrix of isoform expression values(FPKM) for each replicate samples for an unstructured population of 268 Populus deltoides, and a GTF file describing the transcriptome. Isoforms were discovered as follows:Three transcript assembly platforms were used in order to maximize isoform detection: (i) Cufflinks version 2.2 with parameters “--library-type fr-firststrand –u -F 0.05 --max-intron-length 12000 --no-faux-reads -g”; (ii) StringTie version 1.3.3 with parameters “-f 0.05 -j 2 –rf” , and (iii) Trinity version 2.3.2 in genome guided mode with parameters “--genome_guided_bam --genome_guided_max_intron 12000 --full_cleanup --SS_lib_type RF --min_contig_length 50”. The collection of Cufflinks and Stringtie isoforms detected for each sample were merged with Stringtie merge using parameters “-F 1 -f 0.05”. PASA version 2.0.2-r20151207 was used to reconcile this merged assembly and the assembly from Trinity using parameters “-C –R -t --cufflinks_gtf -I 12000 --ALT_SPLICE --ALIGNER gmap,blat”. Additionally, the assemblies generated by PASA were filtered by requiring that (i) all splice junctions be supported by at least 2 reads, and (ii) retained introns be supported by a median read coverage of at least 2 (Python scripts stored in github.com/jdLikesPlants/poplar_AS). Requiring a minimum read support for retained introns minimizes the possibility of incorrect identification of intron retention events from the sequencing of pre-mRNA. Finally, the filtered PASA assemblies for each sample were merged with Cuffmerge (Cufflinks version 2.2.1) to generate a master transcriptome that represents all of the potential AS events and transcript isoforms for the population. The resulting assembly was then reformatted and annotated using gffcompare version 0.9.9c (https://github.com/gpertea/gffread). This transcriptome was subjected to a secondary expression-based filtering pipeline to remove artifacts generated during the merge. Cufflinks version 2.2 was used in quantification mode (parameters: --library-type fr-firststrand -G -u -F 0.05 --max-intron-length 12000) to measure the expression of the transcripts in the merged assembly in each sample. To minimize the presence of incorrectly assembled transcripts in the merged transcriptome assembly, each transcript was required to be expressed above FPKM (fragments per kilobase of exon model per million reads mapped) 3 in at least two of three biological replicates of a given individual, and in at least 3 individuals in the population. This final merged and filtered transcriptome was used in all downstream analyses. It is included here as 'Polpar_deltoides_gffcmp.annotated_3geno_filt_cleaned.gtf. Additionally, any individual that did not have at least 15 million reads generated during sequencing in at least two of three replicates, as well as individuals for which only one replicate was sequenced were removed from analysis, resulting in a final set of 268 individuals.Quantification of expression of each isoform is provided in Poplar_Isoform_Expression_matrix.zip file.