Identifying subset errors in multiple sequence alignments

Multiple sequence alignment (MSA) accuracy is important, but there is no widely accepted method of judging the accuracy that different alignment algorithms give. We present a simple approach to detecting two types of error, namely block shifts and the misplacement of residues within a gap. Given a MSA, subsets of very similar sequences are generated through the use of a redundancy filter, typically using a 70–90% sequence identity cut-off. Subsets thus produced are typically small and degenerate, and errors can be easily detected even by manual examination. The errors, albeit minor, are inevitably associated with gaps in the alignment, and so the procedure is particularly relevant to homology modelling of protein loop regions. The usefulness of the approach is illustrated in the context of the universal but little known [K/R]KLH motif that occurs in intracellular loop 1 of G protein coupled receptors (GPCR); other issues relevant to GPCR modelling are also discussed.

In principle, dynamic programming (Needleman & Wunsch, 1970) permits the determination of an optimum alignment, but for more than about 10-20 sequences, CPU demands dictate the use of heuristic approaches, as in Clustal (Larkin et al., 2007;Thompson, Gibson, Plewniak, Jeanmougin, & Higgins, 1997). However, the optimum defined by the objective function does not necessarily guarantee that the MSA has optimal biological meaning, making it difficult to assess MSAs. In this multiple sequence context, reliability is usually sought by choosing an alignment program that performs well against a series of curated MSA benchmarks such as BAliBASE, Oxbench and Prefab (Aniba, Poch, Marchler-Bauer, & Thompson, 2010a;Aniba et al., 2010b;Edgar, 2004;Raghava, Searle, Audley, Barber, & Barton, 2003). There are systematic approaches for identifying badly aligned residues in MSAs (where there is no reference to e.g. benchmarks) (Blouin, Perry, Lavell, Susko, & Roger, 2009;Dickson et al., 2010;Lassmann & Sonnhammer, 2007), and there are several approaches for scoring alignments and for removing badly aligned sequences or correcting poorly aligned regions (Lassmann & Sonnhammer, 2005a;Muller, Creevey, Thompson, Arendt, & Bork, 2010;Thompson, Thierry, & Poch, 2003;Thompson, Plewniak, Ripp, Thierry, & Poch, 2001). The approach presented here, inspired by the work of Baldwin (Baldwin, 1993;Baldwin, Schertler, & Unger, 1997), is fundamentally different as the focus is on generally well-aligned subsets of very similar sequences and while the errors are relatively minor, they are nevertheless distinctly unwanted in some applications such as homology modelling. It involves splitting the MSA into subsets of sequences with high sequence identity, in which subsets errors can be easily identified. Here we have analysed alignments from a number of sources to illustrate the type of errors and the error rate that may be expected in both database alignments and in alignments generated in house.

Multiple sequence alignments
MSAs were generated with default parameters for 17 protein sequences of known structure, collected following a BLAST search (Altschul et al., 1997) terminated at an E value of 0.001 using ClustalX v1.8 (Thompson, Higgins, & Gibson, 1994;Thompson et al., 1997), Kalign v2.04 (Lassmann & Sonnhammer, 2005b), MaFFT (Katoh, Misawa, Kuma, & Miyata, 2002), Muscle v3.7 (Edgar, 2004) and T-coffee v7.97 (Notredame, Higgins, & Heringa, 2000). Our version of T-coffee did not process more than 100 sequences and so for T-coffee MSAs of greater than 100 sequences, a Clustal alignment was initially used and the sequences were then ordered according to sequence identity using Jalview (Clamp, Cuff, Searle, & Barton, 2004). The alignment was then split into groups of 99 sequences that were aligned using T-coffee and these alignments were then merged using a profile alignment in Clustal (in order of decreasing sequence identity). In some cases, alignments were improved using RASCAL (Thompson et al., 2003) and scored using norMD (Thompson et al., 2001). The quality of the MSAs was evaluated using norMD, but it should be stressed that nor-MD is assessing different aspects of the MSA quality to those described below. In addition, a MSA of 1000 class B G protein coupled receptors (GPCRs) was obtained from the GPCRDB (Horn et al., 1998) and realigned using Clustal; 50 subsets were generated at the 70% sequence identity level. An alignment of 278 sequences from the class B calcitonin family was generated and analysed in a similar way.

Generating subsets to identify subset errors
The MSAs were processed through the ExPASy 'remove redundancy' filter (Notredame, 2007), or in-house code, to remove sequences with 90% and 70% sequence identity. Together the sequences removed, along with the sequence left in the alignment to which they were similar, formed subsets of similar (or very similar in the case of 90% identity) sequences. Because of the high similarity within the subsets, the errors in the alignment were identified very readily. Such subset errors are clearly visible in Figure 1(a). While we show here that these errors can be significant, they usually appear in less important loop regions and have previously largely The motif is more apparent when the reader is aware of the misplacement of a residue within a gap, for example, R in XP_001494282 and block-shift errors, for example, in b7z7u7_human. (b) Errors in the subset have been corrected using Jalview to reveal errors in sequences a4kp32_danre and a9c3t8_danre. (c) The gap in the helical region has been removed by realigning sequences a4kp32_danre and a9c3t8_danre. been ignored as they can appear trivial, particularly as they do not usually affect the NORMD score,

Typical errors
Typical block shift and misplacement errors in the [R/K] KLH motif are shown in Figure 1(a). The errors involve the mis-placement of a residue, which commonly occurred either side of a gap, for example, R in XP-001,494,282, q4rxp0_dadre, celr2_rat and XP002751246, which should move to the right hand side of the gap to complete the motif (and thus align with other Rs in the same column). The errors also involve shifts in a short block of residues e.g. RML in celr1_human and RTL in q571l9_mouse (so that the R and L can align on the right hand side), In this subset, shifts of short blocks of 2 or 4 residues are also observed, for example, RS in q5y190_human and RM in v7z7u7_human, RTLR in XP_001196967 and XP_002661517 and KHLR. Sequence q4kp32_danre has two errors. Frame shift errors (block shift errors that continue to the end of the sequence) can also be observed but these are far less common. In general, such alternative positions, for example, for R in Figure 1(a), would be equally viable, especially if both columns contain R. When analysed within the context of a MSA, these errors are largely invisible. However, they become more apparent when subsets are analysed. Thus, Figure S1 shows a subset where the motif aligns perfectly, along with other subsets that contain similar errors to those in Figure 1(a). However, when the errors occur in highly similar subsets it is imperative that the alignment is corrected, because of the very nature of degenerate sequences.
We note that a block shift of 1 residue could give rise to an error of over 5 Å in the position of a C β atom in a homology model. Since the errors are inevitably associated with gaps, this procedure of error detection is particularly relevant to homology modelling when the loop regions of the modelled protein are of interest, since gaps are frequently more prominent in loop regions. This homology modelling problem is well illustrated with the class B GPCR alignments, which contain a [K/R] KL 1.63 H motif in intracellular loop 1 (Figures 1 and S1); this motif is universal  in the sense that it is also seen in class E and class A GPCRs, where the X-ray structures containing this motif all have a well-defined structure in which the L (or similar) at position 1.63 (generic numbering (Ballesteros & Weinstein, 1995)) makes an interaction with F (or similar) at position 8.50 on helix 8, which also contacts Y 7.53 (or similar) in the inactive structures; this hydrophobic interaction is often accompanied by a polar interaction between 1.61 or 1.62 and 8.49 . Although the motif is not well-known within class A or class B GPCRs as the percentage identity is not high, it came to light in an alignment of class A and class B GPCRs where it provides one of the more obvious similarities between the classes . Its role in GPCR structure and stability will be discussed elsewhere (Simpson et al., 2013). In many of the 50 subsets, the motif is well aligned, and various alternative forms of the motif are apparent, for example, RRLH, RSLR, SNLH ( Figure S1, Table 1). In some subsets, the motif is split by gaps (with no block shift errors etc., as in Figure S1a); this may be one of the reasons why this motif has not previously been widely discussed in the literature. In a small proportion of these subsets, however, there are block shift errors, as discussed above and shown in Figure 1(a) and Figures S1b-S1f, and this could disrupt a homology model. The most likely outcome in homology modelling is that the very specific structural form of the KKLH motif, typically seen in rhodopsin, is replaced by an unstructured loop.
The [R/K]KLH motif is not the only place where these errors occur. Additional errors in the alignment of the N-terminus of the calcitonin receptor are shown in Figure S2. It is difficult to say anything about the error in Figure S2a as it lies in a region where there is no structural information as the sequences of the known structures within this family are not as long as the ones in Figure S2a. The error in Figure S2b lies at the interface between the N-terminus and transmembrane helix 1. There is interest in modelling class B GPCR structures by combining models of the transmembrane region with X-ray crystal structures of the N-terminus (Dong et al., 2011;Parthier, Reedtz-Runge, Rudolph, & Stubbs, 2009). However, modelling the hinge region of these models could be limited by errors such as those in Figure S2b.
An error in the alignment of extracellular loop 2 (EL2) is shown in Figure S3. EL2 plays a key role in receptor activation and ligand binding Ivanov, Barak, & Jacobson, 2009) and errors in EL2 have been shown to limit the development and use of GPCR homology models (Ivanov et al., 2009), which can otherwise be useful in both the study of receptor activation and in docking studies to identify novel ligands (Carlsson et al., 2010;Katritch et al., 2010;Taddese, Simpson, Wall, Blaney, & Reynolds, 2013).
Errors that could result in insertions in intracellular loop 2/transmembrane helix 4 are shown in Figure S4; incorrect treatment of such errors may leave the helix or adjoining loop out of register, with key residues facing in the wrong direction. While deletions in helix 2 of class A GPCRs have been proposed as major evolutionary events (Chabbert et al., 2012), similar subsets should not differ in this regard. The GWG motif ( Figure S4a) or its GYG equivalent ( Figure S4b) are well known helical motifs Lagerstrom & Schioth 2008).
While the best place to insert the gap in Figures S4a and S4b. may not be obvious, it is probably better to insert the gap sufficiently to the left of the GWG/GYG motif to ensure that it is in the loop region. However, this

Comparing alignment programs
The ranking of the alignment programs, as judged by the total number of subsets with errors, is Kalign > MAFFT Clustal > Muscle ( Table 2). The order is surprising as Kalign, Muscle and MAFFT were developed to align large numbers of sequences rather than with accuracy as the sole criterion. The order presented here is not the same as that presented elsewhere (Katoh et al., 2002;Lassmann & Sonnhammer, 2005b), because the criteria are very different: here, we are assessing only the small number of subset errors that occur in very similar sequences rather than an alignment score based on the whole MSA that may be heavily influenced by divergent sequences. In general, the most appropriate alignment algorithm depends on the nature of the sequences (Thompson, Plewniak, & Poch, 1999) and the choice of alignment program should not be made on the basis of these results, but the results should influence the effort made to identify and correct subset errors. The norMD ranking (MaFFT > Kalign > Muscle > Clustal, Table S1) does not necessarily correlate with the number of subset errors because norMD scores the complete MSA whereas we only assess a small part of the MSA. The widely used Clustal program generally had a low error rate, but Kalign is unique in having almost no block shift or misplacement errors. It is interesting to note that different numbers of subsets and norMD scores were generated for each alignment even though the slightly different alignments were all generated from the same sequences. The comparison between Clustal and T-coffee, given in Table 2, was not limited to alignments of 50/100 sequences but rather larger T-coffee alignments were determined as described in the methods section. Here, Tcoffee generally performs considerably better than Clustal according to the norMD scores in Table S2, but Clustal gave fewer subset errors than T-coffee. Surprisingly, one of the Clustal alignments (1agrA) constructed by combining shorter Clustal alignments using profile alignments (as for T-coffee), gave a higher norMD score (0.698, Table S2) than that obtained for the standard Clustal alignment (0.630)possibly because the propagation of errors throughout the alignment was limited.
Errors are more likely to be observed when the sequence identity level is decreased, for example, from 90 to 70%, as shown in Table 3. Even more errors are likely to be observed when the sequence identity level is decreased, for example, down to 50 or 40%, (results not shown), but under these circumstances the method is far less valid as the sequences are no longer highly similar and structural alignments and sequence alignments are unlikely to be equivalent. However, we note that a large block shift might cause a sequence to be incorrectly omitted from a given subset, whereas it might be identified when the sequence identity criterion is relaxedfor Table 2. The number of similar subsets and the number of subsets containing errors for 90% degeneracy, arising from MSAs generated using Clustal, Muscle, MAFFT and K-align; the total number of errors is given in parentheses (). The lead sequence used to generate the alignment via blast and MSA is given by the PDB code (and chain) for the corresponding structure. For each MSA, the best method is indicated in bold and the worst is italicized.

Codes
Seqs. this reason a poor alignment could score better than a good alignment by our criteria.

Correcting MSAs
A simple way to correct the errors would be to edit the subsets and add them back to the whole MSA using a profile alignment. Thus, Figure 1(b) shows corrections to the alignment shown in Figure 1(a) enacted by using Jalview as an editor by (i) moving the various parts of the KKLH motif to coincide with the correct form of the motif as shown in Figure S1a and (ii) removing a single gap from each sequence between helix 1 and the motif (position 10). This leaves a gap of three residues in two sequences (a4kp32_danre and a9c3t8_danre) at around position 10, which is problematic because it occurs in the helix. This second error is only apparent when the first error is corrected; the second error can be fixed by moving the helical residues three places to the right, as shown in Figure 1(c). This subset can now be added back to main alignment using a profile alignment. Ideally, subsets should be generated at multiple degeneracy levels to identify all the block-shift and misplacement errors. For example, in the 1he8A alignment, the Clustal alignment contains one unique error that can be detected at any degeneracy level between 40 and 85%, while the T-coffee alignment contains three unique errors, one of which can be detected at 65-85%, 2 can be detected at 50-60%, while 3 can be detected at 40-45%. Different alignment programs insert gaps at different placeshence the different number of errors for the different methods in Tables 2 and 3. Corrections to errors where a residue is misplaced within a gap should be undertaken with reference to similar residues within both the subset and the whole MSA but they should clearly be in the same position within a highly similar subset.

Conclusions
Here we have presented a simple procedure for identifying block shift and misplacement errors in MSAs by the analysis of subsets of (highly) similar sequences in which the determination of the errors is relatively straightforward. The approach has been applied to a number of MSAs generated using Clustal, Kalign, MaFFT, Muscle and T-coffee. MSAs generated using Kalign had almost no block shift and misplacement errors while those generated using Muscle and T-coffee had the most. However, these errors are not the sole determinant of alignment quality as alignments with higher norMD scores sometimes had more block shift and misplacement errors. For this reason, the method is complementary to those that assess the whole alignment, including the more divergent sequences. The block shift and misplacement errors are inevitably associated with gaps and so may pose a threat to the loop regions of homology models since such loop regions generally correspond to regions of a MSA that are rich in gaps. It is in homology modeling where minor alignment errors can have large deleterious effects that this method is likely to be most beneficial. Moreover, it is possible that the [R/ K]KLH motif in intracellular loop 1 of class A class B and class R GPCRs, has not been previously discussed because these errors have helped to obscure it. The procedure for correcting these errors will depend on the degree of degeneracy within the subset. For highly similar subsets, identical residues involved in misplacement errors should almost certainly reside at the same Table 3. The number of subsets and the number of subsets containing errors arising from MSAs generated using Clustal and Tcoffee; the total number of errors is given in parentheses (). The lead sequence used to generate the alignment via blast and MSA is given by the PDB code (and chain) for the corresponding structure. For each MSA, the best method is indicated in bold and the worst is italicized. The number of errors refers to the number of subsets containing an error; when a subset contains more than one error the total number of errors is given in parentheses.
Clustal ( position; the alignment should be edited with reference to similar residues within both the subset and the whole MSA.

Supplementary material
The supplementary material for this paper is available online at http://dx.doi.10.1080/07391102.2013.770371.