Phylogenetic analyses of 34 syntenic gene families in visual opsin gene-bearing chromosome regions

<p>Sequence based phylogenetic analyses of 34 vertebrate gene families identified in an analysis of conserved synteny in chromosome regions containing the genes for visual opsins, the G-protein alpha subunit families for transducin (GNAT) and adenylyl cyclase inhibition (GNAI), the oxytocin and vasopressin receptors (OT/VP-R), and the L-type voltage gated calcium channels (CACNA1-L). For each gene family amino acid sequences were predicted from the Ensembl genome browser ( and used to create sequence alignments and phylogenetic trees. Vertebrate gene families were defined based on Ensembl protein family predictions. Database identifiers, location data, genome assembly information and annotation notes for all identified protein families and sequences are included in 'Supplemental Table 705852.xlsx' (Excel spreadsheet). This spreadsheet also includes informaction on 7 gene families that were discarded from the analyses. Gene families are identified by unique abbreviations based on approved HUGO Gene Nomenclature Committe (HGNC) gene symbols, or known aliases from the NCBI Entrez Gene database.</p> <p><strong>File information:</strong></p> <p>For each gene family an alignment file '...align.fasta', a neighbor joining tree '...NJ.phb' and a phylogenetic maximum likelihood tree '...PhyML.phb' are included. Alignments are included in FASTA format with the extension '.fasta'. This file format can be opened by most sequence analysis applications as well as text editors. Phylogenetic tree files are included in Phylip/Newick format with the extension '.phb'. This file format can be opened by freely available phylogenetic tree viewers such as FigTree ( and TreeView ( Corresponding figures for all phylogenetic trees are also included as PDF files. Sequence names/leaf names include species abbreviations (see below) as well as chromosome/linkage group/genomic scaffold numbers, with lowercase letters to distinguish sequences located on the same chromomosome, linkage group or scaffold. For the human sequences the full HGNC gene symbol is included.</p> <p>The species included in these analyses were (abbreviations and common names in parenthesis): <em>Homo sapiens</em> (Hsa, human), <em>Mus musculus</em> (Mmu, mouse), <em>Monodelphis domestica</em> (Mdo, grey short-tailed opossum), <em>Gallus gallus</em> (Gga, chicken), <em>Danio rerio</em> (Dre, zebrafish), <em>Oryzias latipes</em> (Ola, medaka), <em>Gasterosteus aculeatus</em> (Gac, three-spined stickleback), <em>Tetraodon nigroviridis</em> (Tni, green spotted pufferfish), <em>Ciona intestinalis</em> (Cin, tunicate), <em>Ciona savignyi</em> (Csa, tunicate) and <em>Drosophila melanogaster</em> (Dme, fruit fly). In some analyses the following additional species were used: <em>Sarcophilus harrisii</em> (Sha, Tasmanian devil), <em>Taeniopygia guttata</em> (Tgu, zebra finch), <em>Anolis carolinensis</em> (Aca, Carolina anole lizard), <em>Xenopus (Silurana) tropicalis</em> (Xtr, Western clawed frog), <em>Takifugu rubripes</em> (Tru, Japanese pufferfish), <em>Branchiostoma floridae</em> (Bfl, Florida lancelet) and <em>Caenorhabditis elegans</em> (Cel, nematode).</p> <p>The following vertebrate gene families are included in this file set:</p> <p>ATP2B: ATPase, Ca++ transporting, plama membrane<br>B4GALNT: Beta-1,4-N-acetyl-galactosaminyl transferase<br>CACNA2D: Calcium channel, voltage-dependent, alpha 2/delta subunit<br>CAMK1: Calcium/calmodulin dependent protein kinase<br>CDK: Cyclin-dependent kinase, members 16, 17 and 18<br>CELSR: Cadherin, EGF LAG seven-pass G-type receptor (flamingo homolog, Drosophila)<br>CNTN: Contactin precursor<br>COPG: Coatomer protein complex, subunit gamma<br>ERC: ELKS/RAB6-interacting/CAST family<br>FLN: Filamin<br>GXYLT: Glucoside xylosyltransferase<br>IKBKE: Kinase epsilon and TANK-binding kinase<br>IQSEC: IQ motif and Sec7 domain containing<br>KDM: Lysine specific demethylase 5<br>KLHDC: Kelch domain containing 8<br>L1CAM: L1 cell adehesion molecule<br>LRRN: Leucine rich repeat neuronal<br>MAGI: Membrane associated guanylate kinase, WW and PDZ domain containing<br>PHTF: Putative homeodomain transcription factor<br>PLG: Plaminogen ortholog<br>PLXNA: Plexin A<br>PPM1: Protein phosphatase, Mg2+/Mn2+ dependent<br>PRICKLE: Prickle homolog<br>PTPN: Protein tyrosine phosphatase, non-receptor type<br>RBM: RNA binding motif protein<br>RSBN: Round spermatid basic protein<br>SEMA3: Sema domain, immunoglobulin domain (Ig), short basic domain, secreted, (semaphorin)<br>SRGAP: SLIT-ROBO Rho GTPase activating protein<br>SYP: Synaptophysin<br>TIMM: Translocase of inner mitochondrial membrane 17<br>TWF: Twinfilin<br>UBA: Ubiquitin-like modifier activating enzyme, members 1 and 7<br>USP: Ubiquitin specific peptidase, members 4, 11, 15 and 19<br>WNK: WNK lysine deficient protein kinase</p> <p><strong>Method details:</strong></p> <p>Alignments were created using the ClustalW sequence alignment algorithm with the following settings: Gonnet weight matrix, gap opening penalty 10.0 and gap extension penalty 0.20.</p> <p>Phylogenetic analyses were carried out based on the included alignments using bootstrap-supported neighbor joining (NJ) as well as phylogenetic maximum likelihood (PhyML) methods supported by approximate likelihood ratio tests (aLRT). Phylogenetic trees are rooted with identified Drosophila melanogaster (fruit fly) sequences, if possible. Alternatively some phylogenetic trees are rooted with other identified invertebrate sequences (see Supplemental Table 1). The B4GALNT, PLG, PTPN, RBM, SEMA3 and USP trees are presented as mindpoint-rooted trees in the figures (PDF), however the phylogenetic tree files (.phb) are unrooted. NJ trees were made using standard settings in ClustalX 2.0.12 (, supported by a non-parametric bootstrap analysis with 1000 replicates. PhyML trees were made using the PhyML3.0 algorithm (‎) with the following settings: amino acid frequencies (equilibrium frequencies), proportion of invariable sites (with optimised p-invar) and gamma shape parameters were estimated from the alignments, the number of substitution rate categories was set to 8, BIONJ was chosen to create the starting tree, both NNI and SPR tree optimization methods were considered and both tree topology and branch length optimization were chosen. The amino acid substitution model was chosen based on ProtTest3.2 ( results. The JTT model was applied for all gene families except B4GALNT, CACNA2D, COL, L1CAM, PLG, PPP, QSOX and UBA where the WAG model was chosen, and RPL and TWF where the LG model was chosen. PhyML trees are supported by approximate likelihood ratio tests (aLRT) with SH-like branch upports applied through PhyML.</p> <p>For the CAMK and GXYLT gene families the PhyML trees were repeated (same settings) using a non-parametric bootstrap analysis with 100 replicates rather that aLRT in PhyML. These trees did not improve on the aLRT-supported tree topologies.</p> <p> </p>