#PdTT Version 1.0 (January 2015) #This document describes workflow for phylostratigraphy analysis and transcriptome age index (TAI) calculation #Please cite: #Cheng X, Hui JHL, Lee YY, Kwan HS. 2015. A 'developmental hourglass' in fungi. #Please make sure the following software package is properly installed: #1) BLAST: http://blast.ncbi.nlm.nih.gov/Blast.cgi #To calculate TAI and statistically assess the TAI profile, please also install the following: #2) Perl module: Statistics-R-0.33 (http://search.cpan.org/~fangly/Statistics-R-0.33/) #3) R: http://www.r-project.org/ #####Step 1##### #Prepare the following files: #1) The gene list of the genome of interest #2) The protein sequences of the genome of interest #3) Download NCBI non-redundant (nr) protein sequence database, http://www.ncbi.nlm.nih.gov/refseq/, to your local server #4) Download the gi_taxid_prot.dmp.gz file from ftp://ftp.ncbi.nih.gov/pub/taxonomy/, to your local server #5) Download the protein sequences of the reference genomes to your local server #6) Prepare a file describing the reference genomes: ref_genome_ps.tab. The format of the file should be: Allomyces_macrogynus_ATCC_38327 3 Mucor_circinelloides_1006PhL 3 Rhizopus_oryzae 3 ... # where the first column is the name of the reference genome and the second column is the phylostratum level of that organism #####Step 2##### #Step 2.1# #Run blastp to NCBI nr protein database. $in_fasta is the protein sequences of the genome of interest, $ncbi is the NCBI nr protein sequences, $in_fasta.blastp.tab is name of the output file: blastall -i $in_fasta -d $ncbi -o $in_fasta.blastp.tab -m 8 -p blastp #Step 2.2# #Retrieve the NCBI taxon id (taxid) given the protein GI number. [in_blastp_tab] is the output tab file of blastp against NCBI nr protein sequences, [gi_taxid_prot] is the gi_taxid_prot.dmp file downloaded from the NCBI taxonomy database #This step will generate two files: "*.blastp.tab.taxid" and "*.taxid.list" perl mapping_to_ncbi_taxonomy.pl [in_blastp_tab] [gi_taxid_prot] #Step 2.3# #Get full taxid lineage at http://www.ncbi.nlm.nih.gov/Taxonomy/TaxIdentifier/tax_identifier.cgi #Browse file "*.taxid.list" => full taxid lineage, exclude common names => Save in file => "tax_report.txt" tax_report.txt #Step 2.4# #Append full lineage to each gene #This step will generate one file: "taxid_lineage.txt" perl append_lineage.pl [blastp_taxid_file] [tax_report.txt] #####Step 3##### #Run blastp to each of the reference genome. $in_fasta is the fasta file of protein sequences of the genome of interest, $db_genome.fasta is the fasta file of protein sequences of a reference genome, $db_genome.blastp.tab is the name of the output file. #The output file $db_genome.blastp.tab should begin with the name of the reference genome and end with .blastp.tab blastall -i $in_fasta -d $db_genome.fasta -o $db_genome.blastp.tab -m 8 -p blastp #####Step 4##### #Run phylostratigraphy.pl perl phylostratigraphy.pl [gene_list] [blastp_dir] [ref_organism_ps.tab] [taxid_lineage.txt] [blastp e-value cutoff] #####Step 5##### #Calculate the transcriptome age index (TAI) #Run calculate_TAI_p.pl #The [in_file] file should follow the following format: #Gene_id PS Intensity_stage1 Intensity_stage2 Intensity_stage3 ... #where #Column 1 = The gene id #Column 2 = The phylostratum level of that gene #Column 3-- = The expression levels of that gene across developmental stages perl calculate_TAI_p.pl [in_file]