Identifying subset errors in multiple sequence alignments

Roy, Aparna; Taddese, Bruck; Vohra, Shabana; K. Thimmaraju, Phani; J.R. Illingworth, Christopher; M. Simpson, Lisa; Mukherjee, Keya; Reynolds, Christopher A.; V. Chintapalli, Sree

doi:10.6084/m9.figshare.825912.v2

tbsd_a_770371_sm1530.pdf (475.56 kB)

Identifying subset errors in multiple sequence alignments

Version 2 2014-01-03, 14:17

Version 1 2014-01-20, 15:48

journal contribution

posted on 2014-01-03, 14:17 authored by Aparna Roy, Bruck Taddese, Shabana Vohra, Phani K. Thimmaraju, Christopher J.R. Illingworth, Lisa M. Simpson, Keya Mukherjee, Christopher A. Reynolds, Sree V. Chintapalli

Multiple sequence alignment (MSA) accuracy is important, but there is no widely accepted method of judging the accuracy that different alignment algorithms give. We present a simple approach to detecting two types of error, namely block shifts and the misplacement of residues within a gap. Given a MSA, subsets of very similar sequences are generated through the use of a redundancy filter, typically using a 70–90% sequence identity cut-off. Subsets thus produced are typically small and degenerate, and errors can be easily detected even by manual examination. The errors, albeit minor, are inevitably associated with gaps in the alignment, and so the procedure is particularly relevant to homology modelling of protein loop regions. The usefulness of the approach is illustrated in the context of the universal but little known [K/R]KLH motif that occurs in intracellular loop 1 of G protein coupled receptors (GPCR); other issues relevant to GPCR modelling are also discussed.