BetaSuperposer: superposition of protein surfaces using beta-shapes

The comparison between two protein structures is important for understanding a molecular function. In particular, the comparison of protein surfaces to measure their similarity provides another challenge useful for studying molecular evolution, docking, and drug design. This paper presents an algorithm, called the BetaSuperposer, which evaluates the similarity between the surfaces of two structures using the beta-shape which is a geometric structure derived from the Voronoi diagram of molecule. The algorithm performs iterations of mix-and-match between the beta-shapes of two structures for the optimal superposition from which a similarity measure is computed, where each mix-and-match step attempts to solve an NP-hard problem. The devised heuristic algorithm based on the assignment problem formulation quickly produces a good superposition and an assessment of similarity. The BetaSuperposer was fully implemented and benchmarked against popular programs, the Dali and the Click, using the SCOP models. The BetaSuperposer is freely available to the public from the Voronoi Diagram Research Center (http://voronoi.hanyang.ac.kr).


Introduction
Similar molecules are evolutionarily related. If a new molecule is evolutionarily related with a known molecule, the function of the new molecule can be inferred from that of the known molecule with a degree of accuracy depending on the evolutionary distance between them (Holm & Sander, 1996). In general, there are different ways to define a similarity between two objects. As a simple example, two points in the plane may have different measures of similarity defined by the distance between them. Different metrics (e.g. l 1 -, l 2 -, and l ∞ -metrics) produce different distances from the two points. Likewise, there are different ways to define similarity between two molecular structures even if the measure of similarity is mathematically well defined. A most immediate similarity is the distance between the sequences of DNAs, RNAs, or proteins. Ever since Needleman and Wunsch used dynamic programming (DP) to measure the sequence similarity between two proteins (Needleman & Wunsch, 1970), a sequence similarity has been very popular for comparing proteins.
It is generally agreed that a structure is more informative in determining a molecular function, because it is more conserved than a sequence during evolution. Hence, a structural similarity can facilitate better understanding of a new molecule by referring to a known molecule whose function has been well studied and it may detect evolutionary links between them that cannot be identified by a sequence analysis only. Therefore, measuring a structural similarity between two molecules is important in understanding molecules. A structural similarity between two molecules in this paper is the similarity between their geometric shapes of three-dimensional molecular structures. There are two different points of view about the geometric similarity. The first is about an entire molecular structure including both the interior and the boundary of a molecule and the second is only about the molecular boundary. In both views, it is necessary to have two proteins appropriately "aligned" and "superposed" before the distance between them is measured.
While a protein structure may experience a considerable topology change during evolution due to the sequence change caused by genetic mutations and crossovers, residues constituting a catalytically active site tends to be conserved (Chothia & Lesk, 1986;Fischer, Bachar, Nussinov, & Wolfson, 1991;Fowle & Stillman, 1997;Holm & Sander, 1996) and a protein function occurs predominantly on or near a protein surface (Holm & Sander, 1996;Via, Ferré, Brannettia, & Helmer-Citterich, 2000). For example, Fischer et al. showed the wellknown similarity between the active sites of trypsin-like and subtilisin-like serine protease families even though they did not fold into similar structures in their entirety (Fischer, Wolfson, Lin, & Nussinov, 1994). Kauvar and Villar reported cases of nonhomologous proteins leading to an identical active site and proteins with a similar structure leading to different active sites with different functions (Kauvar & Villar, 1998). Rost reported the existence of the twilight zone of the sequence alignment and the structure alignment for proteins (Rost, 1999). Hence, the strength of evolutionary persistence is in the order of surface, structure, and sequence and thus measuring the surface similarity can be of priority.
This paper presents an effective and efficient algorithm, called the BetaSuperposer, to measure the similarity between two protein surfaces. We define a match (usually called the "equivalence") between two protein boundaries and perform a mix (usually called the "transformation") between their boundaries by rigid-body motions (i.e. translations and rotations) of one with respect to the other. Because a pair of these two steps iterates, we call it a mix-and-match. Each mix-and-match step attempts to solve an NP-hard problem. This implies that there should be two important considerations in the design of an algorithm: (i) How to reduce the problem size and (ii) how to devise a good heuristic. This paper contributes to both. In the previous studies, the first issue was handled by selecting a subset of molecule such as alpha-carbons, backbone atoms, fragments of residues, or secondary structure elements. This paper proposes a more effective and efficient approach of using protein boundary extracted by the beta-shape, which is a derivative structure of the Voronoi diagram. Regarding on the second issue, traditional approaches frequently used DP to compute a new match. This paper introduces a formulation using an assignment problem (AP), a special case of integer linear programming (ILP) problem. The Beta-Superposer algorithm also plays a core role in the recently developed docking software BetaDock by quickly finding a shape-complementarity between a ligand and a pocket on a protein boundary (Kim et al., 2011). The BetaSuperposer was fully implemented and benchmarked against the popular programs Dali (Holm & Park, 2000) and Click (Nguyen, Tan, & Madhusudhan, 2011) using a set of 24 structures from the SCOP database. Even if we describe the algorithm in terms of proteins, we emphasize that the BetaSuperposer can also be used for other macromolecules such as DNAs and RNAs. The BetaSuperposer program, for both 32-bit and 64-bit Linux platforms, is freely available from the Voronoi Diagram Research Center website (http://voronoi. hanyang.ac.kr).

Structure comparison vs. sequence comparison
From a computational point of view, comparing protein structures is more complicated than comparing their sequences (Via, Ferré, Brannettia, & Helmer-Citterich, 2000). Sequence comparison problem is one-dimensional (Frenkel, 2008) whereas structure comparison problem is three-dimensional with increased algorithmic complication and computational burden. It is not a surprise in that the alignment and superposition problems in a structure comparison are NP-hard, because many problems related to sequences are already NP-hard. Structure comparison involves both geometry and topology, whereas sequence comparison involves symbols and their relative orders in computational process. Structure comparison requires a systematic representation of topology among objects constituting proteins. Sequence comparison does not have any floating point arithmetic and thus there is no room for numerical error. Structure comparison, on the other hand, almost always involves floating point arithmetic, and therefore the accumulation of round-off errors always matters. Sequence comparison has relatively well-defined measures of similarity and its current main research concern lies in computational efficiency. In fact, a sequence comparison has a theoretical justification based on the extreme value distribution of the maxima of finite sequences of random variables (Coles, 2001). However, we are not aware of its counterpart for structure comparison.
While an amino acid is represented by a symbol in a sequence, it usually consists of dozens of atoms in a molecular structure. Hence, in structure comparison, different methods are used to reduce computational burden depending on an abstraction method used (Kar, Sherman, & Johnson, 1994;Kini & Evans, 1992;Lee, Suh, Kim, & Kim, 1999;Zhang & Wang, 2010). For example, atom groups, residues, secondary structures, tertiary structures, quaternary structures, domains, alpha-carbons, or backbone atoms are different units of abstraction.
Sequence comparison uses sequence data available from UniPROT that can be safely assumed complete and accurate without any noise. On the other hand, structure comparison uses structure data available from PDB, but the data are almost always incomplete and contains noise. The noise in the structure data can cause a significant twist in the topology due to its sensitivity to noise. The main source of the noise is as follows: (i) instances of a protein usually have different conformations, (ii) errors are involved in a measurement, and (iii) a subset of structure is usually missing. Carugo showed that there is a (linear) relationship between the RMSD and the resolution of PDB files in comparing domains (Carugo, 2003). Figure 1 is a screen capture of PDB website that shows the sequence of Antithrombin (one of the serpin proteins, PDB accession code: 1ant), which has 432 residues. The bar over the sequence denotes the residues existing in the PDB file. In other words, the three intervals marked by the ellipses denote the missing residues (40 residues in total) in the PDB file.

Surface comparison vs. entire structure comparison
Early studies of structure comparison were done mostly for entire molecular structure by aligning sequences based on structural criteria primarily based on inter-residue distances. Usually assuming the collinearity of two sequences, DP was used for the minimization of distance between two-aligned sequences based on the inter-residue Euclidean distances. It is important to note that DP preserves the sequence collinearity both before and after the optimization process. Comparing entire structure is usually applied to measure the similarity between proteins that are believed to be already somewhat similar.
Holm and Sander reported the Dali for pairwise alignments of structures based on a residue-residue distance matrix, where each residue was represented by alpha-carbons (Holm & Sander, 1993). Then, similar contact patterns in the matrix were compared by Monte Carlo method. Gerstein and Levitt reported a structural alignment algorithm using the backbones of two structures with iterations of DP and least-squares fitting (Gerstein & Levitt, 1998). Shindyalov and Bourne reported an incremental combinatorial extension algorithm for the alignment between two structures using DP and Monte Carlo simulation (Shindyalov & Bourne, 1998). Wohlers et al. reported an algorithm for the alignment of inter-residue matrices by building upon a maximum contact map overlap (Wohlers, Domingues, & Klau, 2010). Taylor and Orengo proposed a double DP which has two levels of DP running O(n 4 ) in the worst and average cases where n is the number of residues (Taylor & Orengo, 1989).
A surface comparison is different from the approaches above in that a sequence collinearity should not be a constraint in the minimization of the distance between two structures. This is because surface comparison ignores the correspondence among atoms internal of the boundaries. Hence, a surface comparison problem can be formulated as a combinatorial optimization problem using an ILP minimizing the interdistance for the equivalence set from all possible pairs of atoms in two protein structures. It turns out that determining the equivalence is a particular case of ILP called the AP that can be solved in O(n 3 ) time in the worst case using the Hungarian method (Kuhn, 1955;Edmonds & Karp, 1972), where n is the number of atoms in the equivalence. Note that an ILP takes an exponential time to solve despite that there is a reliable software like CPLEX (IBM, 2012a). Once the equivalence is determined, the superposition can be done by employing a coordinate transformation. We emphasize the process of determining the equivalence and call it equivalencing.
The reduction of the problem size in a formulation is important. Traditionally, the comparison of an entire structure was done using an abstraction such as alpha-carbons or backbone atoms. However, in the surface comparison, such an abstraction may not be very effective, because it can significantly twist protein surface. Therefore, it is desirable, if not necessary, to use the entire atom set. On the other hand, it is also desirable to use as small a dataset as possible for computational efficiency, and therefore methods to abstract a protein boundary is also necessary. The theory of the beta-shape provides a rigorous and efficient framework for these needs.

Beta-shape
Suppose that A = {a 1 , a 2 , …, a n } is a protein, where a i = (p i , r i ) is a three-dimensional spherical atom with the center p i and radius r i . Two atoms may intersect, but one cannot contain another. Let VC(a i ) denote a Voronoi cell for a i defined as VC(a i ) = {x 2 R 3 | d(x, p i )Àr i 6 d(x, p j )Àr j , i ≠j} where d(x, y) denotes the Euclidean distance between two points x and y. Then, the Voronoi diagram VD(A) is defined as VD(A) = {VC(a 1 ), VC(a 2 ), …, VC (a n )}. Note that VD(A) is the Voronoi diagram of spheres and is different from the ordinary Voronoi diagram of points. VD(A) is represented as a quadruplet VD( …} are the set of vertices, edges, faces, and cells in VD(A), respectively. The topology among these entities is also stored in a data structure. A Voronoi vertex is the center of an empty sphere tangent to the boundaries of four nearby atoms; a Voronoi edge is a locus of points equidistant from the boundaries of three nearby atoms; a Voronoi face is the mid-surface between the boundaries of two nearby atoms. Readers are referred to Kim, Kim, Cho, & Sugihara, 2006) for more details on the properties and the algorithms.
It is well known that the dual of the Voronoi diagram of points is the Delaunay triangulation and a simplicial complex which can be stored in a compact data structure and the topology traversal among the simplices in the complex is quite easy and efficient (Munkres, 1984;Boissonnat & Yvinec, 1998). However, the dual structure of the Voronoi diagram of atoms, called a quasi-triangulation, is not always a valid triangulation. However, the quasi-triangulation is mostly triangulation and can be stored in Interworld data structure after the anomalies are identified. The conversion between the Voronoi diagram and the quasi-triangulation takes O(m) time in the worst case, where m is the number of geometric entities in either structure. In the molecular world, m = O(n) for n atoms. Hence, the topologies of the Voronoi diagram and the quasi-triangulation are equivalent from both a computational and a informational point of view. For the details of quasi-triangulation and the duality between the Voronoi diagram and the quasi-triangulation, readers are referred to .
The constructs called the beta-shape and beta-complex have been recently proposed, and their properties and computational algorithms are reported in Kim, Cho, Sugihara, Ryu & Kim, 2010). The beta-hull is defined in a way similar to the alpha-hull (Edelsbrunner & Mücke, 1994). Consider a three-dimensional space filled with a soft matter with some spherical atoms of varying radii scattered within the matter. Carving out the matter with an omnipresent spherical eraser whose radius is β will result in a shape which we call a beta-hull. Since the eraser is omnipresent, there can be interior voids as well. If the spherical eraser is the probe for a water molecule, the boundary of the beta-hull is indeed the molecular surface, often called the Connolly surface (Connolly, 1983a(Connolly, , 1983b. The spherical eraser is called a probe.
Suppose that we have a beta-hull of A. We can straighten the surface of the beta-hull by substituting straight edges for the circular arcs and planar triangles for the spherical triangles, where the vertices are the centers of the atoms contributing to the boundary of the beta-hull. The straightened object bounded by planar facets is the beta-shape of A. In this paper, we consider that the beta-shape is connected to form a single component. Otherwise, each connected component of the beta-shape may be handled separately. The beta-shape is nonmanifold and may have dangling edges, dangling triangles, and even dangling tetrahedra.
We note the following: (i) the boundary of the betashape is the exterior boundary of the beta-complex and (ii) the beta-complex is a subset of the quasi-triangulation which lies within the boundary of the corresponding beta-shape. The beta-shape represents the proximity among the atoms on the boundary of a molecule and the beta-complex represents the proximity among all atoms in a molecule.
Above, we described the concepts of a beta-shape and a beta-complex using an infinite number of probes. However, we note here that their computation is not done in this way, but can be done very efficiently via the analysis of shapes of quasi-triangulation simplices with respect to the size of atoms and the probe. The algorithm to compute the beta-complex (and therefore the beta-shape as well) consists of three steps and is summarized as follows: (i) the computation of the Voronoi diagram of a molecule (taking O(n 3 ) time for general spheres and O(n 2 ) time for molecules both in the worst case), (ii) the conversion from the Voronoi diagram to the quasi-triangulation (taking O(n) time in the worst case), and (iii) the search for simplices for the beta-complex (taking O(log n + k) time in the worst case, where there are k simplices in the beta-complex). We note here that the first step, the computation of the Voronoi diagram for all proteins we tested in PDB, shows empirically O(n) time for n atoms.
We want to note that the Voronoi diagram and the quasi-triangulation are used for various applications including one reported in this paper. Hence, the first two steps, the computation of the Voronoi diagram and the corresponding quasi-triangulation, can be done offline as pre-processing and stored in the database. Therefore, the computation of the beta-complex corresponding to a particular value of β requires the third step only. Figure 2(a) shows a two-dimensional molecule A consisting of nine two-dimensional atoms, and Figure 2 (b) shows the vdw-boundary of A. Note that the vdwboundary contains two internal voids, a few shallow depressions, and a deep depression on the exterior boundary of the model. Figure 2(c) is the molecular surface of A corresponding to a small probe (shown in the black circle). Note that only one interior void is left in Figure 2(c), which corresponds to the larger void in the vdw-boundary of Figure 2(b). The void corresponding to the smaller one in the vdw-boundary has disappeared. Note also that the shape of the void in the molecular Figure 2. Protein, the boundaries of the protein, the beta-shape, and the beta-complex. (a) a set of nine atoms in R2, (b) the vdwboundary, (c) the molecular surface corresponding to a small probe, (d) the corresponding beta-shape, (e) the corresponding betacomplex, (f) a beta-shape corresponding to a larger probe, (g) the corresponding beta-complex with simplices removed from the quasi-triangulation, (h) the protein hormone receptor (1nav, 1984 atoms), and (i) the beta-shape of 1nav corresponding to β = 1.4 Å.
surface is different from that of the vdw-boundary. Hence, the molecular surface removes some local shape characteristics which are usually unnecessarily detailed for applications such as finding drug candidates. Regions where a water molecule cannot fit may not usually interact with other meaningful chemicals. Figure 2(d) shows the beta-shape corresponding to the molecular surface of Figure 2(c) where the shaded region is the interior of the beta-shape. The beta-shape in Figure 2(d) has an interior void corresponding to the void of the molecular surface and a dangling edge corresponding to a pair of atoms that are exposed to or touched by the probe. The dangling edge corresponds to the deep depression in the external boundary of the molecular surface. The boundary of the beta-shape has nine vertices and nine edges (six on the exterior boundary and three on the interior void). Figure 2(e) shows the beta-complex corresponding to the beta-shape in Figure 2(d). Note that each vertex of the beta-shape and beta-complex corresponds to an atom. Figure 2(f) and (g) show the beta-shape and beta-complex corresponding to a larger probe, respectively. Note that both the dangling edge and the internal void in the beta-shape of Figure 2(d) have now disappeared. The dotted line segments in Figure 2(g) form the quasi-triangulation of the molecule A together with the beta-complex in Figure 2(g). Figure 2(h) shows a threedimensional protein model, hormone receptor (PDB id: 1nav), consisting of 1984 atoms, and Figure 2(i) shows the beta-shape of 1nav corresponding to a probe of the radius 1.4 Å, i.e. a water molecule. The white spot in the middle of the beta-shape in fact denotes a tunnel passing through the molecular surface of the protein. Note that the three-dimensional beta-shape is not necessarily manifold, either. Details on the beta-shape and beta-complex can be found in (Kim et al., 2010;Kim, Seo, Kim, Ryu, & Cho, 2006).

Abstraction of protein boundary via beta-shape
Being an abstraction of protein boundary corresponding to a β-probe, the beta-shape is a good method to extract a smaller set of atoms that can represent protein boundary. The number of the vertices on the beta-shape can be significantly smaller than the number of atoms in the entire protein if the β-value is properly chosen. Let n be the number of protein atoms and N β be the number of vertices on the beta-shape boundary corresponding to β. In other words, N β is the number of boundary atoms touchable by a probe of radius β. Let r b ¼ N b n . We have selected a test set of 100 PDB files (see Table 1 in the Supplementary material for details). For a given β-value, we computed r β for each of the 100 test proteins, where the β-value is incrementally given in the range of 0 6 β 6 1000 Å. Let μ(r β ) be the average of the 100 r β 's at each β-value. Black circles in Figure 3 denote the curve of μ(r β ) vs. β for 0 6 β 6 15 Å. Note that μ(r β ) shows a very smooth, nonincreasing function.
The nonlinear regression analysis using the statistical package R (The R Project, 2012) yields the regression curve of μ(r β ) aŝ l(r b ) ¼ 2:1710 (b þ 1:1843) 1:3796 þ :0519 (1) which fits the μ(r β ) curve with the residual sum of squares .04614. We employ the followinĝ to get a simpler and more intuitive formula approximating Equation (1). Further analysis finds that α = 1 fits the dataset quite well. Figure 3 shows curves of Equation (2) with different values of α. Hence, a good approximation of N β for a protein with n atoms can be quickly obtained by For example, for water molecule that is approximated by a spherical probe with the radius 1.4 Å, N 1.4 ≈ .47n. Similarly, N 2.0 ≈ .38n, N 3.0 ≈ .30n, etc. We also conducted the analysis of variance of the 100 r β 's for each value of β between 0 and 1000 Å. Figure 4 shows the standard deviation σ of the 100 r β 's vs. β where 0 6 β 6 15 Å. The maximum of σ = .081 occurs at β = 1.9 Å and monotonically converges to an asymptotic value of .027. Definition 2 If i 1 < i 2 < … < i r and j 1 < j 2 < … < j r , A andB are called collinear.

Definition 3
Given E (A, B), the transformation of A towards a fixed B such that the distance defined on E(A, B) is minimized is called the superposition.
Hence, an equivalence is the precondition of superposition. The problem to find the optimal alignment of a protein sequence to a known protein structure is called the threading and was shown NP-hard (Lathrop, 1994;Lathrop & Smith, 1996). In structure, even the alignment problem is NP-hard (Eidhammer & Jonassen, & Taylor, 2004, p. 187). However, a surface comparison problem assumes no prior knowledge of what amino acids are equivalent and thus this problem is even harder. Suppose that two proteins A and B are given and E(A, B) is known. Let E(A, B) = {(α 1 , β 1 ), (α 2 , β 2 ), …, (α r , β r )}. Suppose that α i and β i represent the abstractions of substructures of A and B, respectively. For notational simplicity, suppose that α i and β i also denote their center coordinates. Assuming B fixed, we move A via a rigid-body transformation. Structure superposition finds the optimal transformation so that the distance Δ between A and B is minimized, where Á ¼ P k d k ; d ¼ jja k À b k jj; 1 6 k 6 r. The formalization of the superposition problem follows.

Problem 1. Find the transformation T such that
The best result can be obtained when r = m 6 n in Equation (5). However, Equation (4) is NP-hard and is not tractable even for moderate-sized proteins. Suppose that where R and t are a rotation matrix and a translation vector, respectively. Hence, Equation (5) minimizes when Equation (6) minimizes. If t is appropriately chosen and fixed, the problem can be reduced to a problem to find an optimal rotation R. The most common, reasonable practice for t is to use the difference vector between the centers of masses of atom sets in E(A, B). Then, Equation (6) reduces to find the rotation R such that

Minimize
where α' i is α i after the translation by t. Equation (7) solves Equation (5) provided that t is sufficiently good and w i = 1. Therefore, the problem now reduces to find a rotation matrix R in Equation (7). Because the choice of t is heuristic, an iterative process (between the equivalence set selection and superposition) is usually employed to obtain a satisfactory solution of Equation (6). There are different ways to to find R: Diamond determined the best transformation between two sets of coordinates and factorized it into a symmetric and an orthogonal matrices (Diamond, 1976); Kabsch determined the best rotation of one point set into a second point set by minimizing the weighted sum of squared deviations (Kabsch, 1976); Hendrickson determined the linear orthogonal transformation to superpose two point sets (Hendrickson, 1979); Kearsley solved the problem by minimizing the sum of squared distances between corresponding points for two point sets using a constrained least-squares method solved as an eigenvalue problem in quaternion (Kearsley, 1989).

Match: an equivalencing
Suppose A and B are proteins consisting of different number of atoms and positioned in a common coordinate system. In a surface comparison, there can be two ways to find an equivalence E(A,B): the DP approach and the AP formulation approach. Let the d ij be the distance between a i 2 A and b j 2 B. Consider the DP formulation of the problem where H i,j is the score function which can be achieved by aligning A and B. The time complexity of DP approach is O(n 2 ) in the both average and worst cases. However, the equivalence E(A,B) thus found is constrained so that the subsetsÃ andB are collinear. The problem can also be formulated as an ILP problem as follows.

Minimize
Subject to X i x ij ¼ 1; for all j; 1 6 j 6 r 6 n X j x ij ¼ 1; for all i; 1 6 i 6 r 6 m x ij ¼ 1; if a i and b j are aligned; 0; otherwise: E(A,B) found by Equation (9) is globally minimal without collinearity. While solving Equation (9), using an optimization library such as CPLEX takes an exponential time, this problem is the AP that can be solved by Hungarian method in O(n 3 ) time in the worst case.

Experiments and discussion
The proposed algorithm was validated through an extensive experiment, was implemented into a program called the BetaSuperposer. Then, the performance of the Beta-Superposer was compared with two other methods: the Dali (Holm & Park, 2000) and the Click (Nguyen, Tan, & Madhusudhan, 2011). Three experiments were conducted for the following purposes: (i) to validate the superiority of an AP formulation to a DP; (ii) to validate the superiority of the beta-shape abstraction of structures; (iii) the benchmark test. For the first two experiments, we created a test set consisting of 100 arbitrary PDB structures. For the benchmark test, we created a test set consisting of 24 PDB structures from four different superfamilies in the SCOP database so that each of the four classes (i.e. alpha, beta, alpha/beta, and alpha + beta) had a representative superfamily in the test set (Andreeva et al., 2008;Hubbard, Ailey, Brenner, Murzin, & Chothia, 1999). All metals and co-factors were removed from the structures before conducting the experiments. We used the cluster computer of the Voronoi Diagram Research Center consisting of 118 nodes; each node has two CPU's with 2 GB RAM where each CPU has AMD Opteron Dual Core 2.2 GHz.

Why an AP formulation?
To compare the AP and DP approach, we performed an experiment with a test set of arbitrarily chosen 100 structures from PDB. Given the test structures in the order of their sizes, we first chose the 50th in the set (1QXH) as the reference structure. Then, we compared all the structures in the test set against the reference structure. Figure 5 shows the comparison between DP and AP approaches for E(A,B) after 100 iterations (for the experimental purpose) of the mix-and-match. Figure 5(a) and (b) show the computation time and the corresponding RMSD, respectively, where circles denote the AP approach and the triangles denote the DP approach. In both figures, the horizontal axes denote proteins in the order of molecular size. Figure 5(a) shows that AP takes more computation than DP and Figure 5(b) shows that the RMSD of AP is significantly smaller than that of DP. This is because the worst-case time complexity of AP is O(n 3 ), whereas one for DP is O(n 2 ). Note that AP finds the global optimal solution. We note the performance of the AP approach around where the reference structure is located (This means that the surface comparison is done between similar sized proteins). Note that the AP approach takes only slightly more computation while its result are significantly better in RMSD than those of the DP approach.

Why the beta-shape abstraction?
Comparison of different abstraction methods is in order. Abstraction method influences the size of model and so does on solution quality and computation time. We selected three abstraction methods to compare two proteins: all atoms, alpha-carbons, and the beta-shape. Figure 6(a) shows the computation time of the three different abstraction methods using the AP approach taken by 100 iterations of the mix-and-match. The rectangles (topmost) denote using all atoms; the circles (in the middle) denote the beta-shape abstraction where β = 1.4 Å; the triangles (bottommost) denote using alphacarbons. Note that the beta-shape abstraction takes significantly smaller computation time than using all atoms while it still takes slightly more computation than using alpha-carbons. Figure 6(b) shows the solution quality of the three abstraction methods (measured in RMSD that is defined by the distance between the pair of atoms in the equivalence set at the final iteration). The symbols used in Figure 6(a) apply in Figure 6(b), too. The quality of solution using all atoms is obviously the best and the alpha-carbon atoms approach shows the worst quality. Figure 6(c) shows a more clear comparison among solution quality of the three abstraction methods. Let RMSD (β À all) denote the RMSD of the beta-shape method minus the RMSD of the all atoms method and RMSD(α À all) denote the RMSD of the alpha-carbon method minus the RMSD of the all atoms method. In Figure 6 (c), the circle denotes RMSD(β À all) (which has the mean value .508 and the standard deviation .827) and the triangle denotes RMSD(α À all) (which has the mean value 1.233 and the standard deviation 2.315). Note that the circles are much better distributed around the horizontal axis than the triangles. To summarize, the quality of solution using the beta-shape shows the solution quality close to the quality of all atoms despite that this approach requires much less computational requirement. For the details of data, see Figure 1 in the Supplementary material. Figure 7 shows the surface superposition computed by the AP algorithm using the beta-shapes of different β-values. Figure 7(a) shows the computation time for the 100 structures after 100 iterations. The topmost, rectangles denote when β = .0 Å where the number of vertices on the beta-shape boundary is almost always identical to the number of entire atoms in proteins. As the value of β increases, the number of vertices on the beta-shape boundary decreases and so does the computation time. The circles correspond to β = 1.4 Å; the triangles correspond to β = 3.0 Å, etc. Figure 7(b) shows the computation results in RMSD. Note that the RMSD for larger β-value is greater than that of a smaller βvalue. Figure 7(c) shows a more clear comparison among solution quality among different β values. In the figure, there are four curves: the circle denotes RMSD(β 1.4 À β .0 ) (which has the mean value .508 and the standard deviation .827), the triangle denotes RMSD(β 3.0 À β .0 ) (which has the mean value .230 and the standard deviation .438), the cross denotes RMSD(β 5.0 À β .0 ) (which has the mean value .285 and the standard deviation .744), and the diamond denotes RMSD(β 10.0 À β .0 ) (which has the mean value .509 and the standard deviation .843). Hence, β = 3.0 seems a good choice of the probe radius for the surface comparison using the beta-shape. See Figure 2 in the Supplementary material for details.
This experiment verifies the following two points: (i) AP approach is much better suited than the DP approach and (ii) the beta-shape abstraction is very well suited for the surface comparison between proteins when the value of β is appropriately chosen.

Benchmark test
The test set for the benchmark test consisted of 24 PDB structures from four different superfamilies in the SCOP database so that each of the four classes (i.e. alpha, beta, alpha/beta, and alpha + beta) had a representative superfamily in the test set. We chose two different families from each superfamily and three different structures from each family, where each structure was a chain in the corresponding protein. The details of the selected structures (e.g. the name of the superfamilies, the PDB IDs, the resolution, and the numbers of residues and atoms) are given in the Table 1. The structure 1G84 (Test Code: 9) does not have the resolution value, because it was determined from an NMR data. With this dataset, we conducted all pairwise comparisons using the three programs: the BetaSuperposer, the Dali, and the Click. Note that both Dali and Click use alpha-carbons only for the structure comparison whereas the BetaSuperposer uses boundary atoms. The comparison between two structures belonging to one superfamily is called an "intra-superfamily" comparison. Otherwise, it is called an "inter-superfamily" comparison. We note that the BetaSuperposer has two system parameters which are based on the previous two experiments. First, the βvalue, corresponding to the probe radius, controls the problem size in that it is related with the number of boundary atoms. Second, the number of mix-and-match iterations controls the computation time taken for the superposition. In this experiment, we used β = 1.4 Å and 50 mix-and-match iterations for each pair of structures. Figure 8 shows the solution quality (measured in RMSD) after the superposition of all structure pairs.  Table 1. Hence, each distinct structure pair corresponds to a pixel in the top view. The RMSD of each structure pair is depicted by the height in the bird's eye view and by the intensity in the top view. The darker a pixel is in the top view, the smaller the corresponding RMSD value is. Figure 8  The Click (Figure 8(b) and (e)) had an extremely narrow range of the RMSD (i.e. 0-2.510 Å) regardless that a pair consists of structures from same or different superfamily. Hence, the Click could not distinguish whether it was an inter-or intra-superfamily comparison. On the other hand, the Dali (Figure 8(c) and (f)) had a reasonable range of the RMSD (i.e. 0-11.4 Å) for intrasuperfamily comparisons. However, the Dali did not produce any result for inter-superfamily comparisons. The BetaSuperposer (Figure 8(a) and (d)) showed the following properties: (i) it produced a reasonable range of the RMSD (i.e. 0-14.816 Å) similar to the Dali for intrasuperfamily comparisons and ii) it discriminated intersuperfamily comparisons from intra-superfamily ones with a quantitative measure of structural similarity. Figure 9 provides visual examples of the observation above. Figure 9(a)-(c) are 1DLY (Test Code: 5, Superfamily Code: S1, Class: Alpha) in Globin-like superfamily, 1DN0 (Test Code: 7, Superfamily Code: S2, Class: Beta) in Immunoglobulin superfamily, and 1TQJ (Test Code: 15, Superfamily Code: S3, Class: Alpha/Beta) in Ribulose-phosphate binding barrel superfamily. In each figure, the left shows an entire structure whereas the right shows its backbone. All three structures belong to different superfamilies. However, the two structures in Figure 9(a) and (c) have relatively similar shapes whereas those in Figure 9(b) and (c) have significantly different shapes.
From the structure comparison between Figure 9(a) and (c), the BetaSuperposer and the Click produced 2.580 and 2.180 Å, respectively. Note that both structures in the Figure 9(a) and (c) have globular shapes despite that they have some difference in the number of atoms. From the structure comparison between Figure 9(b) and (c), the BetaSuperposer and the Click produced 12.368 and 2.310 Å, respectively. Note that both structures in the Figure 9(b) and (c) have significantly different shapes whereas they have similar number of atoms. The Dali did not produce any result for both comparisons because two structures belong to different superfamilies. Hence, this example verifies the computation result in Figure 8.
Recall that the RMSD values from the Click and Dali were computed from pairs of alpha-carbons, whereas the RMSD from the BetaSuperposer was computed from pairs of atoms. It is desirable to produce the RMSDs in a unified measure. Hence, we recomputed the RMSDs of the Click and Dali by all atom pairs by solving the AP. Figure 10 shows the results of the re-computation and Table 2 reports the detailed statistics.   The observations are as follows: (i) the range of the RMSD values of the BetaSuperposer (i.e. 0-14.816 Å) is narrower than those of the Click (i.e. 0-34.889 Å) and the Dali (i.e. 0-25.465 Å); (ii) within each superfamily, the BetaSuperposer has significantly lower RMSD values (for both the average and the standard deviation) than the other two programs, except the superfamily S3.
Refer to the Table 2. We further compared the performance of the BetaSuperposer with that of the Dali, because it had been most frequently used. For the Beta-Superposer, the averages of each superfamily were 2.558, 5.031, 3.281, and 2.644 Å, respectively, and thus the grand average was 3.379 Å. For the Dali, the averages of each superfamily were 2.835, 10.939, 2.696, and 3.275, respectively, and the grand average was 4.936 Å. Hence, we claim that the BetaSuperposer produced lower RMSDs than the Dali did for the intra-superfamily comparison. We verified this claim through the Wilcoxon signed rank test (Wilcoxon, 1945), which was a nonparametric statistical hypothesis test, for comparing two RMSDs (i.e. one from the BetaSuperposer and the other from the Dali) for each structure pair. This test assesses whether the population mean ranks of the BetaSuperposer and those of the Dali differ.
Each superfamily consisted of six structures and thus there were 15 structure pairs to be compared by each software for the intra-superfamily comparison. Therefore, there were 60 structure pairs in total to be compared by each software. We used the statistical package SPSS (IBM, 2012b) with the 60 RMSDs for each of the Beta-Superposer and the Dali for the test and obtained the pvalue of .001. This p-value implied that the distributions of RMSDs for the two softwares were significantly different from the statistical point of view. Hence, we claim that the BetaSuperposer is superior to the Dali.
For each of the four types of the intra-superfamily comparisons, we show two comparison examples with graphical visualization in the Figures 3-10 of the Supplementary material. Figure 3 of the Supplementary material shows the superposition of 1FSL (Test Code: 3) and 1DLW (Test Code: 4) in the superfamily S1. Figure 3 (a) and (b) are for the BetaSuperposer and the Dali, respectively. In each figure, the left, the middle, and the right columns denote three views from X-, Y-, and Zdirections, respectively; the upper row shows the Connolly surface of the two superposed structures (with respect to water probe); the lower row shows the ribbon diagram of the two superposed structures. All the other figures are given with the same convention. The Figures In addition to its superiority for comparing the boundaries, we make the following observations regarding on the secondary structure: • In general, the BetaSuperposer results in a better fit for the boundary, whereas the Dali results in a better fit for the secondary structure. This is shown in the superfamily S1 (Figures 3 and 4). • The BetaSuperposer results in a better fit for the secondary structure than the Dali when the structures are far from globular. This is shown in the superfamily S2 (Figures 5 and 6). • Suppose that the structures are globular and have one or more flapping loop on the structure boundary. If the size of the structures is sufficiently big, the Dali results in a better fit for the secondary structure than the BetaSuperposer does. This is because of the following reason: the influence of the flapping loop is significant for the BetaSuperposer because it considers the boundary only. On the other hand, this influence for the Dali is less significant if the structure size is sufficiently big. This is shown in the superfamily S3 (Figures 7  and 8). • Suppose that the structures are globular and have one or more flapping loop. If the structure size is sufficiently small, the BetaSuperposer results in a better fit for the secondary structure than the Dali does. This is because the influence of the loop has relatively is larger for the Dali when the structure size becomes smaller.

Conclusions
A molecular structure is important for determining its function. Because protein functions occur predominantly on or near protein surfaces, the geometric appearance or shape is important for many applications. Hence, the comparison of protein surfaces is important for both understanding molecular evolution and function and for drug design (Chang et al., 2011;Huang et al., 2010). In this paper, we proposed an effective and efficient algorithm, called the BetaSuperposer, to measure the similarity between two protein surfaces based on the beta-shape which is a derivative structure from the Voronoi diagram of atoms in molecules. The idea of the Beta-Superposer was the mix-and-match among atoms in two protein structures where each mix-and-match step attempted to solve an NP-hard problem. Therefore, we used the beta-shape as a good abstraction of protein boundary thus significantly reducing problem size. This is because the number of vertices on the beta-shape boundary is significantly smaller than other abstraction such as all atoms or alpha-carbons. We have also proposed the AP formulation to find an equivalence, because the collinearity is not a constraint in surface comparison of proteins.
The BetaSuperposer was fully implemented and benchmarked against popular programs, the Dali and the Click. The core algorithms were extensively tested with a set of 100 PDB structures. The AP formulation was superior to the ordinary DP approach for the match step. The beta-shape was superior to the other abstraction methods such as alpha-carbons or backbone atoms. The benchmark test revealed that the BetaSuperposer outperformed the Dali and the Click from the experiment using a set of 24 structures (six from each of the four superfamilies in the SCOP database). The BetaSuperposer is freely available to the public from the Voronoi Diagram Research Center (http://voronoi.hanyang.ac.kr).