Protein similarity from knot theory and geometric convolution

Shape similarity is one of the most elusive and intriguing questions of nature and mathematics. Proteins provide a rich domain in which to test theories of shape similarity. Proteins can match at different scales and in different arrangements. Sometimes the detection of common local structure is sufficient to infer global alignment of two proteins; at other times it provides false information. Proteins with very low sequence identity may share large substructures, or perhaps just a central core. There are even examples of proteins with nearly identical primary sequences in which alpha-helices have become beta-sheets. Shape similarity can be formulated (i) in terms of global metrics, such as RMSD or Hausdorff distance, (ii) in terms of subgraph isomorphisms, such as the detection of shared substructures with similar relative locations, or (iii) purely topologically, in terms of structure preserving transformations. Existing protein structure detection programs are built on the first two types of similarity. The third forms the foundations of knot theory. The thesis of this paper is this: Protein similarity detection leads naturally to algorithms operating at the metric, relational, and isotopic scales. The paper introduces a definition of similarity based on atomic motions that preserve local backbone topology without incurring significant distance errors. Such motions are motivated by the physical requirements for rearranging subsequences of a protein. Similarity detection then seeks rigid body motions able to overlay pairs of substructures, each related by a substructure-preserving motion, without necessarily requiring global structure preservation. This definition is general enough to span a wide range of questions: One can ask for full rearrangement of one protein into another while preserving global topology, as in drug design; or one can ask for rearrangements of sets of smaller substructures, preserving local but not global topology, as in protein evolution. In the appendix, we exhibit an algorithm for answering the general rearrangement question. That algorithm has the complexity of robot motion planning. In the text, we consider a more common case in which one seeks protein similarity by rearrangements of relatively short peptide segments. We exhibit two algorithms, one based on writhing numbers and one based on line weavings. The algorithms have time complexities O(n (4)) and O(s (11)), respectively, where n is the maximum number of residues in the proteins being compared and s is the number of secondary structure elements. In practice, the running times were nearly interactive. We report results obtained with a dozen pairs of proteins, exhibiting a range of typical features.


Paper Outline
Section 2 reviews related work on structural alignment and discusses the role of metrics.Section 3 provides an intuitive introduction to topological similarity, structural isotopies, and line weavings.Section 4 reviews the basics of knot theory.Section 5 is the technical heart of the paper.That section defines isotopies, similarity, and a precise version of the structure problem, then proves computability of that problem in the Appendix.Section 6 provides the connection between the general structure problem and our approximation based on writhing numbers, describes the writhing-based algorithm, and reports results.Section 7 describes our approach based on line weavings, and reports results.Finally, Section 8 discusses future directions and Section 9 summarizes.
2 Related Work

Structural Alignment
There are three major structural alignment tools in use today: DALI, VAST, and CE.All three are accessible off the PDB webpage.Since the appearance of these methods in the late 1990s, a host of other methods have appeared, which generally compare themselves to these three.One we have found useful is 3DSEARCH.We review these four methods here briefly.
DALI [25,27,26] aligns protein substructures using distance matrices.Distances are invariant to rigid body transformations, thereby avoiding the need for spatial alignment.DALI considers distances between alpha carbons; the distance matrices are indexed in residue order.Substructures that appear in similar relative spatial locations in the two proteins give rise to similar patterns between blocks of the distance matrices.DALI uses a clever Monte Carlo method to detect these patterns.It begins with small hexapeptides then repeatedly merges similarly related protein fragments into larger common substructures.One important aspect of DALI is an elastic similarity score; the significance of errors in distance alignments decreases with increasing distance.Consequently, substructures separated by larger distances can tolerate greater relative global motion, while residues nearer to each other must better preserve local shape.DALI is probably the gold standard for protein structure comparisons.Its main disadvantage is its relatively ad-hoc Monte Carlo structure and complexity.
CE [52,53] searches for protein fragments in one protein that are locally similar to protein fragments in another protein.It then extends these local alignments by a sequential scan down the protein backbones.This scan is reminiscent of dynamic programming in sequence alignment, but CE actually employs a clever greedy algorithm.CE uses distances between alpha carbons and rigid body superposition to define similarity and to guide the extension scan.A limitation of CE is its requirement that matching substructures occur in sequential backbone order.
VAST [21,22] and 3DSEARCH [54,55] focus on elements of secondary structure to align proteins.Both methods begin with building blocks that are pairs of secondary structure elements, one pair in each protein.VAST matches pairs of secondary structural elements that have a similar type, relative orientation, and connectivity, then builds larger structures by considering substructure similarities that are statistically surprising.This probabilistic similarity function is both a strong advantage and a potential limitation of VAST; a class of "similar" structures is significant, but not necessarily easily circumscribed.
3DSEARCH first finds pairs of secondary structure vectors in one protein that match well with pairs of vectors in the other protein.These initial alignments repeatedly seed a dynamic programming algorithm for aligning all secondary structure vectors in one protein with those in the other protein.Atom-level alignment occurs subsequently.3DSEARCH is potentially limited by its set of initial vector alignments.

Metrics
One of the difficulties with many alignment methods is the vagueness of their global similarity measures.Locally, these methods often measure similarity by the root-mean-squaredeviation (RMSD) between aligned atoms, or some related variation.RMSD of aligned atom coordinates is a wonderful measure of similarity for two shapes that are nearly identical.However, RMSD is a poor measure when the two shapes being compared differ significantly, particular when the two shapes contain some matching and some nonmatching subshapes.Existing alignment methods address this issue by seeding their routines with small matching subshapes, then repeatedly merging these into larger shapes.This process often succeeds well, but it is purely procedural.As a result, automatic classification of proteins remains brittle.
One possible alternative is to compare proteins using more general shape metrics, such as Hausdorff metrics [28].More appropriate for proteins may be invariants derived from knot theory.R0gen and Fain [48] suggest a metric based on curve invariants.Given a protein, they compute 30 different curve invariants, thereby mapping the protein to a point in 3£ 30 .They argue that this 30-dimensional measure satisfies the triangle inequality, and thus is a good method for grouping protein shapes into similarity classes at multiple levels of granularity.They demonstrate this claim empirically by classifying 20,937 protein domains into multiple levels, achieving 96% agreement with the CATH2.4classification [40,39] (both SCOP and CATH are widely accepted protein classification databases, created by a combination of automatic and human judgments).The primary invariant in [48] is the writhing number of a curve; the others are built from this.Section 4.3 examines writhing numbers in detail.lxis lnar Figure 1: Two TIM-barrels.On the left is Xylose Isomerase (PDB code: lxis), minus its tail.On the right is Narbonin (PDB code: lnar).Both proteins are displayed in RASMOL' s ribbon format [50].

Optimal Alignment Alternate Alignment
Figure 2: Two alignments of lxis (blue) with lnar (red).On the left is the optimal DALI alignment, on the right an alternate alignment formed by rotating lxis approximately 1/3 turn about the TIM-barrel.Both proteins are displayed in RASMOL' s backbone format.
3 Goals and Intuition

Topology
The long-range goal of our research is to develop compact representations of protein geometry useful for structure comparison.The most fundamental representations of geometry are topological in nature, providing information about incidence, relative location, and allowable motions.For proteins, the elements of knot theory are likely to be useful, not because proteins are or are not knots, but because the geometry and motions of protein backbones may be modeled using techniques from knot theory.
Topology offers high-level descriptions of shape and motion.While precise folding paths of proteins depend intimately on the details of steric constraints, electrostatic potentials, and biochemical entropies, fundamental fold descriptions should not.One should be able to recognize the similarity in folds between two proteins based purely on topological considerations.By way of intuition, the weaving of the threads in my shirt is a characteristic of that shirt, independent of whether I am hunched over my terminal, standing straight, tugging on my shirt, or allowing it to hang loosely.The threads will move and turn, but their relative topological relationships will remain unchanged, so long as I do not tear the shirt.Similarly, two proteins may have very different three-dimensional coordinates yet be instances of the same fold.Not tearing the threads in my shirt is analogous to the assumption that a protein will not break the covalent bonds in its backbone as it moves.
The key idea is that two proteins or subsegments of proteins are similar if there is a motion that transforms one into the other while avoiding backbone self-collisions.The role of knot theory is to offer simple descriptors (called invariants) by which one can assess the similarity of two proteins rapidly.

Invariants
Discovering useful invariants is at the core of modern knot theory.It is easy to find invariants that do not change as a curve deforms smoothly in space.It is much more difficult to find invariants that are sensitive enough to act as characteristics, meaning: (i) The invariant of a curve does not change with smooth deformations of that curve and (ii) the invariant can discriminate between two curves that are topologically dissimilar.(Two closed curves are topologically dissimilar if the curves cannot be deformed into each other except by tearing/cutting.)Research in modern knot theory entails discovering ever more sensitive knot invariants; finding a true characteristic is an open research question.We point to the nice introduction by Louis Kaufmann [29].In the context of proteins, we also point to the work of Taylor [57] on defining fundamental arrangements of protein shapes and the work by Willett [23] on tertiary structure graphs.

A Range of Problems
This paper constitutes our first step in developing topological shape descriptors for proteins.There are several lines of attack, with different levels of topological emphasis.Section 5 defines protein structural similarity in terms of collision-free motions.The key result of that section is the construction of a metric in protein space and a proof of its computabihty (in Appendix A).Section 6 then offers a more practical approach, using writhing numbers as the basis of a structure comparison algorithm.Finally, Section 7 returns to the topological foundation, developing an approach for structure comparison based on line weavings of secondary structure elements.
It is instructive to realize that the existence of collision-free motions is a purely topological concept, while the definition of a metric is dependent on the precise coordinates of the proteins' atoms.Similarly, the set of crossing numbers associated with a line weaving is a topological concept, while the set of writhing numbers associated with a polygonal curve is dependent on embedding coordinates.We thus have the following list of problems: • KNOT EQUIVALENCE: Decide whether two closed curves are topologically equivalent, that is, whether one curve can be transformed into the other using a smooth collisionfree motion.
• POLYGONAL CURVE SIMILARITY: Determine the smooth collision-free motion with least excursion that transforms one polygonal curve with n vertices into another polygonal curve with n vertices, preserving the existence and number of vertices during the motion.By the "excursion" of a curve we mean the maximum distance any vertex moves from its start or final position; Section 5 will define this notion precisely in terms of "(i2,<5)-isotopies".
• WEAVING EQUIVALENCE: Decide whether two arrangements of infinite lines are isotopic to each other, that is, whether one arrangement of lines can be transformed into the other without causing any of the moving lines to intersect or become parallel.
• EMBEDDING SIMILARITY: Decide whether two polygonal curves with equal number of vertices are everywhere locally similar.By local similarity we mean that two edges in one curve have nearly the same relative separation and orientation as their corresponding edges in the other curve.A special case of this problem is the limit in which "nearly the same" means "exactly the same".That special case asks whether two curves are completely the same shape, merely transformed by a rigid body motion.
An approach for deciding KNOT EQUIVALENCE exists, though with unknown complexity and uncertainty about its computabihty in the /x-recursive sense [24].A variant of this problem is UNKNOT, the problem to decide whether a closed curve is topologically equivalent to the unknotted loop.That problem is known to lie in NP and co-NP, but it is not known whether the problem is polynomial-time decidable [24,3,2].If a closed curve is known to be the unknot then it can be flattened quickly [12,11].Observe that KNOT EQUIVALENCE is a purely topological question.
POLYGONAL CURVE SIMILARITY is both a simplification and an elaboration of KNOT EQUIVALENCE.Simplifying, the curves are now piecewise linear with an equal number of vertices and the transformation preserves the existence and number of vertices throughout the motion.Elaborating, the curves need not be closed and the problem asks for a motion that minimizes the greatest excursion any vertex needs to make in order to establish the similarity.
The main theoretical result of the current paper is that this problem is effectively computable.As an aside, the proof shows that a simplified version of KNOT EQUIVALENCE, in which the curves are polygonal and the number of vertices remains constant during motion, lies in PSPACE.We also point to [45,14,5] for related PsPACE-hardness and -completeness results.Observe that POLYGONAL CURVE SIMILARITY has a strong topological component, but the precise distance value computed is dependent on embedding coordinates.
WEAVING EQUIVALENCE is an open problem.It is not even known how many different isotopy classes exist for a given number of lines, when the number of lines is large.For small numbers of lines the problem is well understood, and the isotopy classes are characterized by simple invariants.Our weaving-based algorithm uses such small sets of lines as seeds to match up the secondary structure elements of two proteins.Observe that WEAVING EQUIVALENCE is a purely topological question.
EMBEDDING SIMILARITY is the general curve recognition problem.In this paper we address the problem using writhing numbers.Observe that a solution to this problem depends on the embedding coordinates of the two curves.
We view our algorithms for EMBEDDING SIMILARITY and WEAVING EQUIVALENCE as approximations to the general POLYGONAL CURVE SIMILARITY problem.These algorithms therefore provide a basis for detecting common protein substructures.The main practical contribution of the current paper is the use of writhings and weavings to generate protein structure alignments.

Structural Alignment Isotopies
We will illustrate our topological goals using the two proteins shown in Figure 1.On the left is the core of Xylose Isomerase (PDB code: lxis), an enzyme that catalyzes the conversion of glucose into fructose.On the right is Narbonin (PDB code: lnar), a plant seed protein with no known enzymatic function.Both proteins are TIM-barrels, and thus are structurally alignable in a variety of ways.Approximately 70% of the residues are structurally similar, even though the two proteins have only 7% sequence identity.
Figure 2 shows two alignments.On the left is the optimal DALl-alignment.On the right is an alternate alignment, in which Xylose Isomerase has been rotated approximately 120 degrees about the TIM-barrel.Alignments similar to these would likely appear in the top ten list produced by any comprehensive structural alignment program.
Structural alignment occurs at multiple scales, ranging from global superpositions to local residue alignments, possibly with a variety of scales in between, such as secondary structure superposition.Figure 2 displays its two alignments as rigid body superpositions.Such superpositions tell part of the story.From a biochemical perspective, structural alignment programs must also produce pairings at the residue level.
Geometrically, one may think of structural alignment as a sequence of motions that establishes similarity by transforming one protein shape into another.For residue alignments one must therefore exhibit motions that transform segments of one protein's backbone into corresponding segments of the other protein's backbone.In order to avoid geometrically and biochemically silly alignments we require these motions to avoid self-collisions.Such motions are called isotopies.Focusing on isotopies rather than arbitrary motions and alignments provides a basis for believing that the shapes are inherently similar, as opposed to coincidentally similar, from a topological perspective.The isotopy for the right pair of helices in each frame is essentially a rotation about an axis perpendicular to the helices, followed by a loop rearrangement near the top of the blue helix.
Figure 3 shows isotopies between two helices of Xylose Isomerase and the corresponding two helices of Narbonin.There is one isotopy for each pairing of helices; each isotopy morphs a backbone segment of Xylose Isomerase into a backbone segment of Narbonin.The start of the two isotopies is given by a high-level rigid body superposition of the two proteins, in this case the optimal DALl-alignment of Figure 2. The amount of motion required by each local isotopy provides a rough measure of how similar the two pairs of helices are to each other, as requested by the POLYGONAL CURVE SIMILARITY problem.
It is instructive to observe that the two local isotopies are very different from each other.To first approximation, one is a rotation about one of the helix axes, while the other is a rotation about a perpendicular axis.Neither isotopy determines the other.Instead, both isotopies are motions that occur subsequent to a global rigid body motion that roughly superimposes one pair of helices on the other pair.Determining such a global rigid body superposition is analogous to selecting a convenient origin in motion space, around which one can then compute finer-grained local motions to establish shape similarity between subsegments of the two proteins.In practice, the two scales influence each other.A rigid body superposition may suggest local isotopies.Conversely, a collection of local isotopies may suggest a global rigid body superposition.Section 5 will make this notion precise, by defining isotopies and shape similarity relative to rigid body motions.

Line Isotopies
In this subsection we illustrate the basic principles of our long-range goal, to develop topological characterizations of protein shape similarity.The exposition focuses on a-helices but applies as well to /^-strands.2, now showing only the helix axes as line segments, lxis is in blue, lnar in red.Some of the line segments are labeled with identifiers ("xi" for helices in lxis and a ni" for helices in lnar).These generate the line weavings of Figures 5 and 6.

Helix Line Weavings
Figure 4 displays line segments that model the helix axes of the two alignments shown earlier in Figure 2. Ideally, one would like to describe this arrangement of lines in a compact fashion that reveals commonalities and differences.One possibility is to look at small subsets of lines and decide whether they are topologically equivalent to each other as (oriented) line arrangements, meaning that there is an isotopy that transforms one arrangement into the other.By an isotopy of a line arrangement one means motions of the lines in which the lines remain skew.
For instance, we have labeled four pairs of the helix axes in the left panel of Figure 4. Imagine drawing infinite lines through these axes; Figure 5 shows the results in two panels, with blue lines for Xylose Isomerase and red lines for Narbonin.Each panel describes a line weaving.It is clear from the figure that the two weavings are topologically equivalent, meaning that we can move the lines of one color without collisions in such a way that they are completely identical to the lines of the other color.We have labeled the lines with their backbone orientations and the crossings with six crossing numbers, which happen to all be a +" in this case.Crossing numbers will be explained in more detail in Section 4.1.For now, what matters is that these six labels are identical for the blue and red weavings, indicating that the two arrangements of four lines are topologically equivalent.
In contrast, consider the three labeled pairs of helix axes in the right panel of Figure 4.The two associated line weavings are shown in Figure 6 (from a rotated perspective for better viewing).It is clear that the weavings are different.In fact, the alternate alignment of Xylose Isomerase with Narbonin is generally quite good, but there is one topologically troublesome helix-pairing as the two line weavings of Figure 4 indicate.

Arrangements of Lines
We have just described the rudiments of the theory of line arrangements, an area closely related to knot theory.For a more comprehensive introduction see [60,61].Research in this area seeks to classify the topological equivalence classes of line arrangements under isotopy.There is exactly 1 topological equivalence class consisting of 2 skew (unoriented) lines, 2 classes of 3-lines, 3 classes of 4-lines, 7 classes of 5-lines, 19 classes of 6-lines, and 74 classes of 7-lines.The classification of general collections of skew lines is an open research question.One approach is to transform line arrangements into elements of braid groups, construct the links induced by the braids, and apply methods from knot theory [41,42].
The potential application to protein structure comparison arises in three contexts.First, structural alignment programs often represent proteins by their secondary structure vectors [21,54,58].Classifying such vector arrangements might provide simple invariants by which to label protein folds, as suggested by our previous examples.Second, the peptide plane bond vectors (such as N-CA, N-H, and N-C(O)) fully determine a protein's shape.Again, a classification of the possible arrangements of these vectors might provide simple means for recognizing the shapes of unknown proteins.For instance, the orientations of these vectors relative to a global axis can be discerned using NMR [59,4,31,19].This may provide an efficient method for distinguishing proteins experimentally.Third, the techniques from line classifications may carry over to more general structures.The key idea is to consider the space of transformations that preserve certain topological properties, such as non-intersection, then to discover invariants that distinguish the induced equivalence classes.

Crossing and Linking Numbers
One of the fundamental invariants of knot theory is the crossing number.Imagine viewing two oriented curve segments in space.For some viewing directions these curve segments will seem to cross each other.One segment will be closer to the viewer than the other.Thus if one projects the curves into a plane perpendicular to the viewing direction, one curve will seem to cross over the other.This relationship defines a crossing number, written s, with value -1 or +1.Specifically, imagine rotating the top curve so that its forward tangent at the crossing is parallel to the forward tangent of the bottom curve.Then e is given by the sign of the smallest angle required.See Figure 7.
Observe that for two oriented, skew, infinite, straight lines in three-dimensional space the crossing number does not depend on the viewing direction.It is a purely topological property of the line directions and their relative locations in space.We saw these crossing numbers earlier, in the form of "+" and "-" labels in Figures 5 and 6.For two distinct closed oriented curves c\ and c 2 in 3D space one can define the linking number L/c(ci,c 2 ) of the two curves as the sum of the crossing numbers divided by two.In turns out that this number does not depend on the viewing direction.Moreover, it is a topological invariant.This means that Lk(ci,c 2 ) is invariant to any smooth collision-free deformations of the two curves.
For simple curves the linking number provides a rough measure of how linked the curves are.See Figure 8 for examples.And in our previous discussion involving crossings of helix lines, we were essentially treating infinite lines as "half"-curves closed at infinity.
We caution that while crossing and linking numbers are topological invariants, they are not discriminating enough to be characteristics.For instance, the Whitehead link has linking number zero yet consists of two inseparable loops [29].In the case of lines, there is an arrangement of six (unoriented) lines which is not isotopic to its mirror image, yet both the given arrangement and its mirror image have matrices of crossing numbers that lie in the same switching class [7].This shows that crossing numbers are insufficient for classifying arbitrary arrangements of unoriented lines.According to folklore there is a similar example for oriented lines but we do not have a reference.
Also interesting in the case of oriented lines is an example in which a single line with specified crossing numbers relative to a set of fixed lines generates multiple isotopy classes [35].Moreover, Chazelle et al. [16] conjecture that there are examples in which the orientation class of a single line may have O(n 2 ) isotopy classes, where n is the number of fixed lines.
Fortunately, for small collections of oriented lines (e.g., 5 or fewer), crossing numbers fully characterize the isotopy classes.Consequently, if we see two weavings generated by a small set of oriented lines with permutation-equivalent crossing matrices, then we know there exists an isotopy that transforms one weaving into the other.Thus such weavings are good anchors by which to ground a search for global rigid body alignments.We will return to this topic in Section 7.

Gauss Integrals
It turns out that the linking number of two curves can be computed as a continuous integral.Formally, suppose c\ and c 2 are two closed non-intersecting curves in 3D space, specifically disjoint embeddings of S 1 into 3? 3 .Let G be the Gauss map applied to the difference between the curves, that is, the function G : S l x S 1 -> S 2 given by G(s, i) = (c 2 (t) -Ci(s))/\\c 2 (t)ci(s)||.Then the linking number of the two curves can be written in terms of the Gauss integral: -. = f / / ®i«<*^4*.(1) Here UJ is the differential 2-form measuring area on S 2 and G*cu is its pullback by G to S 1 xS 1 .Amazingly, for two distinct closed curves, this integral is always an integer.To gain some intuition, consider two closed curves in space (see also Figure 9 and imagine that each of the edges is tangent to a curve).Place a finger on each curve and consider the unit direction vector pointing from one fingertip to the other.This is a point on the unit sphere.Sum up the signed area covered on the sphere for all possible finger placements on the curves, with sign given locally by the crossing number e of the two curve tangents.This is the value computed by the integral.
With some effort one sees that the net area covered is the linking number of the two curves as previously defined.In particular, if the two curves are not linked, as in the leftmost frame of Figure 8, then the net area covered on the sphere will be zero.If the two curves are linked once as in the middle frame, then the sphere will be fully covered once, and so forth.Intuitively, for proteins, the extent to which the sphere is covered locally will provide us with a measure of the relative location and orientation of pairs of peptide segments.

Writhing and Linking
The writhing number of a curve measures the curve's self-linking.Previously we defined the linking number for two curves.Linking and writhing are related by the following famous Calugareanu-Fuller-White formula [18,20,62] defined for closed orientable ribbons in threedimensional space: Here Lk is the linking number of the two boundary curves of the ribbon, Wr is the writhing number of the central spine, and Tw is the twist of the two boundary curves.While Lk is a purely topological number, the other two numbers are not; they depend on the embedding of the ribbon.However, they are invariant to a large class of transformations, such as rigid body motions, even conformal (angle-preserving) mappings.We note in passing that the writhing number and the twist are almost never integers.
It turns out that the writhing number of a curve has the same algebraic form as the linking number.If c : S 1 -• 3ft 3 is a closed curve in space, then its writhing number is simply Wr(c) -Lk(c, c).Of course, in this case the function G is not well-defined on the diagonal (when t -s).A priori the integral Lk(c, c) need not exist.Dealing with this issue leads to the twist Tw [36]; it is a torsion-dependent term measuring how much one boundary curves intertwines with the other.We will not have any need for it, and will not discuss it further.Instead, our focus will be on matching subsegments of proteins by comparing writhings.

Protein Fragments:
The definitions continue to make sense for open curves, that is, 3D embeddings of intervals rather than circles.In particular, we will find the component writhing numbers, Lfc(ci,C2), of short protein backbone fragments, c\ and c 2 , to be useful shape indicators.

Writhing of Polygonal Curves
We will represent protein backbones as open polygonal curves 1 , connecting sequential residues via their alpha carbons. 2 For a very nice exposition on writhing numbers of polygonal curves see [1].That paper developed a clever O(n 1 ' 6 ) algorithm and a sweepline algorithm for computing the writhing of a polygonal curve, then applied the second algorithm to various lu open" means that the start and endpoints are distinct; "polygonal" means that the curve is piecewise linear. 2 In other contexts, e.g., NMR structure determination, amide protons ( 1 H N ) are more natural [17,8].proteins.Considerable work has used knot theory to understand the supercoiling and knotting behaviors observed in DNA, another polygonal curve (see [49] for a sample).Also, see [30,43] for some very interesting applications of robot motion planning to polygonal knot theory.
Polygonal curves simplify calculation of Equation (1).The integral becomes a finite sum: where Aij is the e-signed area on the sphere covered by vectors pointing from edge e* on the first curve to edge ej on the second curve.
Definition 1 We will refer to A^/Air as the edge-edge writhing of the two edges e* and ej.
Computing A^ is straightforward.Figure 9 illustrates the process.Algebraically, suppose the start and end points of the oriented edge e* are pi and p 2) and suppose the start and end points of oriented edge ej are q\ and q 2 .Consider the four extremal cross directions between the two edges: For skew edges e* and e^, the four directions di,d 2i d 3 , d 4 define the vertices of a parallelogram Pij in three-dimensional space whose supporting plane does not intersect the origin.
Projecting the parallelogram onto the unit sphere creates a spherical parallelogram.Its vertices are the unit direction vectors obtained from ^1,^2,^3,^4, its edges are arcs of great circles connecting these vertices, and its absolute area multiplied by the crossing number of the two edges is the desired signed area Aij.Computing the area of a spherical quadrilateral is also straightforward; one simply sums the interior angles of the quadrilateral and subtracts 2TT.Observe that A^ -Aj{.

Polygonal Curve Isotopies & the Structure Problem
As suggested by the SCOP definition, detecting protein similarity entails finding collections of paired substructures which are located roughly in the same relative locations in space.
Let us make this idea more precise.Recall that a polygonal curve is a piecewise linear embedding of the unit interval / into 3D space, c : / -• 3ft 3 .In particular, the curve is not selfintersecting.We can represent the curve as a sequence of representative points {pi,... ,p n }, namely the endpoints of the linear segments.In our case the points are the coordinates of a protein's alpha carbons.Any consecutive subsequence of a polygonal curve's representative points also defines a polygonal curve.
Definition 2 Suppose that p = {pi,... ,p n } and q = {qi,..., q m } are two polygonal curves.Suppose that E is a Euclidean rigid body motion on 3ft 3 (a rotation and translation).Let 5 > 0 be some positive number.We will say that curve p is (E,5)-isotopic to curve q if the following two conditions are satisfied: (ii) There is a polygonal-curve isotopy h mapping E(p) to q such that no representative point moves furiher than 6 from its initial or final location.More precisely, we require a continuous function h : I -> (3ft 3 ) n ; written as h(t) = (/&i(i),..., h n (t)), such that: (a) h{(0) = E(pi), for all i = 1,..., n.
(c) The sequence {/ii(t),... ,h n (t)} is a polygonal curve for all t, meaning that the points hi(t),..., h n (t) define a curve that is not self-intersecting for all times tel.
The 5 appearing in this definition is the "excursion" to which we referred in the intuitive introduction of Section 3. We will presently use this definition to compare subsegments of curves.The motivating intuition is to regard two proteins as structurally similar if there is some rigid body transformation that places one protein on top of the other well enough that (^-perturbations of local coordinates permit atom alignment without backbone self-collisions.The isotopy requirement mirrors formally the intuition of Sections 3.2 and 3.3: it measures similarity via classes of motions that preserve structure.Thus, for instance, two helices might match if and only if one can be transformed into the other without backbone self-collisions.Observe that the transformation could be quite large, depending on 5, but at all times preserves the backbone topology.(We note in passing a generalization: it might be interesting to restrict the class of isotopies further by requiring that the polygonal curve h(t) not intersect the rest of the protein at any time t.) For large n and medium-sized 5, condition (ii) can be complicated to check.It basically entails solving a high-degree-of-freedom motion planning problem.Fortunately, for many short protein fragments and small 5, the condition is similar to enforcing low RMSDs of the final alignments.The definition therefore addresses a wide tunable range of possible structural similarity questions.Monotonic Curve Isotopies Given a point pi and a line £ in 3D space one can project the point orthogonally onto the line.One can do the same for all representative points {pi,... ,p n } of some polygonal curve.The curve is said to be monotonic with respect to line £ if the order of the projected points is the same as the order of the points in the curve.This order orients the line.Short protein segments, such as a-helices and /3-strands, are often monotonic with respect to their best-approximating lines.
PROOF.Imagine drawing a line between pi and TT; for each i.Define a homotopy that moves each pi to TT^ along these lines.The homotopy preserves the polygonal curve (and thus is an isotopy) since the curve is monotonic.• Lemma 2 Suppose p and q are two polygonal curves with equal numbers of points, each monotonic with respect to some line.Let TT = {TTI, ... ,7r n } and a = {ai,...,a n } be the projections of the two curves onto their respective lines.Then d(p,q) < d(p,7r) + d(q,a) + inf^maxj \\(ii -E(iTi)\\j where E is taken from the set of rigid body motions that align the two oriented lines.

PROOF. See Appendix B. •
The bound in Lemma 2 is often generous.The lemma tells us that two monotonic curves whose line-projections are similar in ID are also readily isotopic in 3D.
For polygonal curves with equal numbers of points, d measures the spatial difficulty of transforming one curve into the other.It provides no such information for curves with different numbers of points.Instead, we now define structural similarity as the detection of local isotopies.We need one piece of additional notation.Suppose p = {p\,... ,p n } is a polygonal curve; let us define p k { as the polygonal subcurve {pi-k,... ,Pi,... ,£>;+&} whenever fc+1 < i < n-k.In other words, pf is the curve segment centered at pi, extending backwards and forwards by k points.

Definition 4
Suppose that p = {pi,... ,p n } and q = {qi,..., q m } are two polygonal curves.Let 6 > 0 be a positive number, k a nonnegative integer, and X some set of index pairs {{hi)}-We say that p is 6-structurally similar with k-strength alignment 1 if there exists some rigid body transformation E such that d#(pf,Qj) < 8 f or a M pairs (i, j) G X.
In English, this definition requires one curve to move rigidly over the other curve such that two paired collections of subcurves are nearly identical to each other, as measured by subsequent isotopy deformations.For k = 0, this definition is similar to aligning pointsets.For large k, the definition amounts to detecting overall curve similarity.In between, the definition captures the notion of structural alignment with rearrangements.In particular, the order of indices in the index set X need not be sequential.This leads to the following: Structure Problem: For given curves p and q, for S positive and k a nonnegative integer, compute all index sets X and their associated rigid body transformations E satisfying Definition 4.

Theorem 2
The Structure Problem is effectively computable.
PROOF.Follows from the proof of Theorem 1. • Although computable, the algorithm derived from our proof of Theorem 1 is horrendously exponential [13,14,32,51].One possibility is to use a motion planner specialized for knots, such as the untangling planner of [30].Alternatively, for our purposes, Lemmas 1 and 2 suggest a simplification: In the next two sections we will examine one approach based on edge-edge writhings and a second approach based on line weavings, both of which attack the Structure Problem by aligning line projections of peptide segments.

Protein Similarity from Geometric Convolution
In this section we examine more closely the construction of Figure 9.Our observations will motivate us to define a self-convolution datastructure for detecting structural similarity in proteins.

Writhing and Convolution
Definition 5 Suppose X and Y are two sets of points in R 3 .Then the geometric convolution of Y with X is the set of points Y © X = {y -x \ x £ X and y GY}.(Sometimes this is defined by saying that the geometric convolution ofY with X is the Minkowski sum ofY and -X.There are again strong connections to robot motion planning [33, 34J.In particular, Y © X defines the set of translations of X that cause collisions with Y.) Lemma 3 Assume e^ ej, and P^ are as defined at the end of Section 4-3.Then P^ = ej©e*.
PROOF.Definitional: P^ is the set of all vectors pointing from a point on e* to a point on Corollary 1 The edge-edge writhing Aij/iir of two oriented edges ei and ej is the absolute area of the convolution e.j © ei projected onto the sphere S 2 times the crossing number e of the two edges, divided by 4TT.

Corollary 2 Suppose edges ei and ej are given. The following four possibilities exist:
(a) The edges are skew.In this case P^ is a 2D polygon whose plane of suppori does not include the origin.The edge-edge writhing Aij/4n is therefore well-defined and nonzero.
(b) The edges are coplanar but not parallel.In this case P X j is again a 2D polygon, but now its plane of suppori does include the origin.The polygon P^ may or may not touch the origin.i^\{0} projects to a great-circle arc on the sphere, and the writhing A^/ATT is therefore zero.
(c) The edges are parallel but not colinear.In this case the polygon P^ degenerates to colinear line segments lying on a line that does not pass through the origin.The writhing Aij/An is zero.
(d) The edges are colinear.In this case the polygon P^ degenerates to colinear line segments lying on a line that passes through the origin.The polygon P^ may or may not touch the origin.Pij\{0} projects to one or two points on the sphere and the writhing A^/ATT is again zero.

Corollary 3
The edges e* and e 3 intersect if and only if polygon P^ touches the origin.
Corollary 3 tells us that we can count edge incidence by counting polygons touching the origin.Suitably generalized, that hints at a method for determining structural similarity

Self-Convolution
Earlier we observed that many successful structural alignment programs compare arrangements of pairs of lines.We now extend that idea to writhing polygons.In reading Lemma 4 imagine that we are comparing a pair of peptide segments in one protein with another pair in another protein.
Lemma 4 Consider four oriented edges: e\, 62, /i, fi> There is a rigid body transformation E mapping the edges (ei,e2) to the edges (/i,/2) if and only if there is a rotation R about the origin such that R(e2 © e\) = fi © f\ while preserving vertex correspondence.

PROOF. See Appendix C. •
Corollary 4 // R is a rotation such that the maximum distance between corresponding vertices of the two polygons R(e2Qe\) and f2Q.f1 is 8, then there is a rigid body transformation E such that e\ and e 2 are (E,5)-isotopic to f\ and $2, respectively.

PROOF. See Appendix D. •
When Corollary 4 applies we say that the polygons are 6-isotopic.

Definition 6 Ifp is a polygonal curve, we define the geometric self-convolution ofp, written ®{p), to be the generating polygons ofpQp:
= {Pij I Pij = ej © e*, with e» and ej edges in the curve p}.
A writhing polygon P^ delineates internal translations of a polygonal curve that cause self-collisions, namely of edge e» with edge ej.The self-convolution ®(p) therefore describes internal translations that may change the topological shape of the curve p.
Given two curves p and g, we will seek structural similarity by comparing the curves' selfconvolutions.Lemma 4 suggests that we mod out by rotations and translations, and focus instead on comparing the configurations of the polygons {Pij}-Corollary 4 relates configuration similarity to isotopy distance.A writhing polygon has six configuration parameters: the two edge lengths, the angle between the edges, the distance from the origin, and two orientation parameters describing the polygon normal.We have found it useful to cluster using two features: edge-edge writhing and distance from the origin.Writhing provides a mixed measure of all six degrees of configuration freedom; retaining distance mitigates the roughly inverse-square effect of distance on writhing.Similarity is easily checked, using for instance a best-aligning rotation in Corollary 4.

Comparing Self-Convolutions
We now combine the isotopy and self-convolution ideas to implement an algorithm for detecting common protein structure.There is one additional wrinkle, needed to deal with the segment length parameter k in Definition 4. When constructing the self-convolution <8>(p), we replace the polygon P i3 with a polygon formed from the best-line projections of the peptide segments p\ and p^ as motivated by Lemmas 1 and 2. Denote this polygon by For the writhing number we use the true writhing of the two peptide segments, that is,
2. Hash the polygons {P k j(p)} and {P k j(q)} based on w k j and dij, ignoring near zeros.
3. For each nonempty (or sufficiently full) hash bucket B w d of polygons do the following: • For each pair of 5-isotopic polygons P G <8 k (p) and Q G ® k (q) in B w d, compute the rigid map E implied by Corollary 4. Hash the rigid map with its generating polygons.
The generating polygons and rigid maps associated with a hash bucket in Step 3 • offer an approximate solution (X, E) to the Structure Problem.The entire hash table describes all nontrivial alignments at the given hash table resolutions.We ignore polygons with near zero writhing or distance to avoid degeneracies.The solutions are approximate in the sense that the polygons P k -are based on best-approximating edges and the maps E are clustered, potentially dilating 5.
Figure 10 shows the magnitudes of the writhings {w k j} obtained from the self-convolution structures of lxis and lnar.These writhings were generated using polypeptide segments consisting of 11 residues, that is, with k = 5.The self-writhings of helices is evident in the bright red and orange bands along the diagonals of the matrices.The writhings of different /^-strands appear as magenta off-diagonal peaks.The 8-fold symmetry of the TIM-barrel is clearly evident.Finally, the green speckle patterns indicate writhings of a-helices with /3-st rands.

Analysis
The convolution-based algorithm runs in time O(k 2 n 2 + k 2 m 2 + s 2 /e 2 P + l/e 6 E ) and space O(n 2 + w? + l/e| + l/e%) where n and m are the number of points in p and q, k is the half-length of a peptide segment, s is the maximum number of pairwise similar polygons appearing in a polygon hash bucket, and ep and CE are the resolutions of the polygon and rigid body hash tables, respectively.
In practice, k and e# are constants.We took k = 5 and CE = 0.1.l/e| is the size of the hash table for Euclidean transformations.We represented each transformation as a 4D quaternion and a 3D translation, projected the quaternion into 3D, then hashed the resulting 6 numbers.Although s can be 0(n 2 ), it depends on ep.Choosing this carefully, the ratio s/ep can become O(n).In that case, the algorithm has O(n 2 ) behavior, with n the maximum protein length.The hash table resolutions constrain the observable distance 6.The hash tables could be replaced by k-D trees, Voronoi diagrams, or other clustering methods [44], but we did not do so.

Results from Self-Convolution
We implemented the algorithm in (an old 8-bit) Lisp on a lGHz Windows PC.Running times for proteins with 300 residues were typically a minute or two, half of that garbage collection.(We chose that particular implementation simply because the author had written an extensive geometric and numerical library over the years in it, permitting easy interactive prototyping of ideas.We expect that a production-quality implementation in C++ would likely be 10-100 times faster.)Here are three interesting pairs of proteins: 5atl_A vs. 8atc_A: These are two different conformations of the catalytic chain A in Aspartate Carbamoyltransferase (ATC), a famous allosteric protein involved in the synthesis of pyrimidine nucleotides [56].Chain A has two domains, that rotate with respect to each other as part of the process.Two loops change conformation drastically.Our algorithm detects both the similarities and the differences.The rigid map with the greatest number of aligned segments lies within 2° in rotation and 0.6A in translation of the correct alignment.
Our subsequent atom-alignment code assigns 289 of the 310 residues with RMSD l.oA; the remaining residues constitute the two non-alignable loops.See Figure 11.
3adk vs. lgky: Adenylate Kinase (PDB code: 3adk) and Guanylate Kinase (PDB code: lgky) are two transferases catalyzing two ATP-dependent phosphorylations.These two proteins have mere 19% sequence identity, are different lengths (194 vs. 186 residues), and include both matching and nonmatching secondary structures.Our code finds the alignment shown in Figure 12.The rigid map lies within 5° and 0.5A of the CE-alignment.Our subsequent atom-alignment assigns 165 atoms with RMSD 2.9A, closely matching CE.   lxis vs. lnar: These are the two TIM-barrels we used extensively to illustrate the ideas of Section 3. We considered the 321 residues of lxis without its tail versus the 289 residues of lnar.The two protein chains have 7% sequence identity.As we mentioned earlier, there are several possible alignments, related by rotation around the central barrel (see again Figure 2).This pair of proteins is interesting because even in optimal alignment there are significant angular differences between aligned helices.Such comparisons originally motivated our isotopy definitions.Our code finds an alignment with RMSD 3.3A, differing by 14° and 1.5A from the optimal DALl-alignment.See Figure 13.

Comparing Crossing Numbers
Section 4 discussed line weavings of helix axes as a topological gauge of similarity.We have implemented that idea using line weavings derived from a protein's secondary structure elements, namely its a-helices and /?-strands.An a-helix generates an oriented line representing the helix axis, while a /?-strand generates an oriented line that best approximates the strand.
In order to deal with geometric singularities we model crossing numbers using three values, namely -1, 0, and +1.We assign the value 0 whenever two lines are nearly coplanar.Our code considers pairs, triples, and quadruples of lines, depending on the number of secondary structure elements available.Pairs of lines generate a single crossing number, triples generate three crossing numbers, and quadruples generate six crossing numbers.The code hashes sets of lines based on the number of their positive and negative crossing numbers, allowing 0 to act as a wild card.For instance, the three sets of crossing numbers {+1,+1,+1,+1,+1,-1}, {0,+l,+l,+l,+l,-l}, and {+l,+l,+l,+l,+l,0} are all comparable.All three sets of crossing numbers might represent essentially the same quadruple of lines, except that two lines are (nearly) coplanar in two of the quadruples.The code looks for topologically similar weavings between proteins first by checking that their crossing numbers hash to the same bucket, then by checking whether their crossing matrices are related by a permutation matrix.Throughout, crossing number 0 acts as a wild card.
Each pairing of topologically similar line weavings between proteins generates a rigid map that aligns one quadruple (or triple or pair) of lines in one protein with a quadruple (or triple or pair) of lines in the other protein as well as is possible using a rigid map.Our code discards rigid maps that do not properly align their generating lines within some tolerance when viewed as points in the space of lines.Given such a core alignment of generating secondary structure elements, the code then extends the alignment to other secondary structure elements by looking for nearby neighbors.
In summary, the basic algorithm is:

Weaving-based Matching Algorithm
Given two proteins, detect structural similarity as follows: 1. Compute approximating lines for secondary structure elements in the two proteins.
2. Generate weavings of such lines in each protein (primarily quadruples).
4. Extend the matchings to other secondary structures in the two proteins.
In short, instead of hashing on geometric writhing numbers as in Section 6, we now hash on a topological invariant of the line weavings.Our results for the three pairs of proteins mentioned earlier are very similar using the two approaches.Table 1 lists some more; details in Section 7.2.
We note in passing that Step 4 can be performed in many ways.We use a bipartite graph matching algorithm, in which the underlying cost function is an L^ measure, described further in Section 7.2.3.In addition, at various locations the code uses a variety of measures to prune or extend alignments.We omit the details.
The complexity of the weaving-based approach is potentially high -there are O(s 4 ) quadruple-line weavings in a protein, where s is the number of secondary structure elements in the protein, leading potentially to O(s 8 ) comparisons between proteins.Extending an alignment of one pairing of weavings to all the secondary structure elements in the two proteins may require O(s 2 ) effort to compute similarity and O(s 3 ) to run an optimizing bipartite graph matcher.This suggests an overall complexity of O(s n ) for a straightforward implementation.In practice, we did not encounter exorbitant runtimes.In fact, with some exceptions, we generally found that our weaving-based matcher executed much faster than our writhing-based matcher.For many examples, the code ran in seconds to minutes, despite being implemented in an old 8-bit Lisp, though for proteins with large numbers of secondary structures the code sometimes ran for 20-60 minutes.Again, a production-quality implementation in C++ would likely be 10-100 times faster.
One reason we observed reasonable runtimes is that we restricted the focus of our weavingbased matcher in the following three ways: (a) When generating quadruples (or triples or pairs) of lines, the code requires the underlying secondary structure elements to lie spatially within some distance cutoff of each other.The precise distance is an input parameter to our code.We consistently used 30A, which is about three-quarters the diameter of a typical protein domain.
(b) When generating quadruples and their associated rigid maps, the code first considers quadruples of a-helices, turning to quadruples of /?-strands only if there are insufficiently many helices, and then turning to mixed quadruples of helices and strands, if necessary.
(c) When extending alignments from quadruples to all the secondary structure elements in the two proteins, the code only matches secondary structure elements of the same type (a to a and (3 to (3).(Of course, it would be easy to remove that restriction.) Restriction (a) in particular is good at limiting the number of generating quadruples.Since secondary structure elements are physical, the number of such elements that can be packed into a volume of 30A is bounded by some constant.Thus the algorithm effectively only considers O(s) weavings in each protein, leading to an overall complexity of O(s 4 )-O(s 5 ) 1 depending on how Step 4 is implemented.We note in passing that geometric hashing could reduce this complexity even further.

Results from Line Weavings
Table 1 shows the alignments obtained by our weaving-based matching algorithm.As explained in the previous subsection, the code first matches weavings of a small number of lines in one protein with topologically equivalent weavings in the other protein.Each such pairing of line weavings seeds a routine that computes alignments between larger sets of secondary structure elements in the two proteins.The remainder of this section explains Table 1 further.

Alignment Rankings and Backbone Sequentiality
"Rank" in the table refers to an ordering given by the similarity measure pu-Section 7.2.4 describes this measure further.
The table depicts two alignments of lwsy_A with 2rus_A, namely those ranked #1 and #15.These proteins are TIM-barrels, exhibiting considerable rotational symmetries.The helices and strands are analogous to teeth in a gear, with consequent symmetry.The nominally correct alignment, as determined by CE, happens to rank #15.Interestingly, it is the first alignment in the ranking that preserves backbone sequentiality.If one asks the code to favor backbone order-preserving alignments, then the nominally correct alignment appears as the overall winner.
The comparison of 3adk with lgky also has an ambiguity in its ranking.The #1 ranked alignment differs slightly from the nominally correct alignment.Again, this alignment also does not completely preserve backbone sequentiality.The first alignment that does preserve backbone sequentiality is indeed the nominally correct alignment, which happens to be ranked #2.
In all other cases, the first ranked alignment is also the nominally correct one, as measured by DALI, CE, and/or 3DSEARCH.

Crossing Consistency
An alignment between a set of n secondary structures in one protein and a set of n secondary structure in a second protein generates an associated crossing matrix in each protein.Each protein's crossing matrix contains the crossing numbers associated with the infinite lines that represent the aligned secondary structures.Each matrix is an nxn symmetric matrix with zeros on the diagonal.
For each alignment, one can compare the crossing numbers in the two crossing matrices generated by that alignment.The entries a Bad/Sig:Tot" in Table 1 do just that.a Tot" counts the number of crossings, that is the number of entries in the upper triangle of the Each row represents an alignment of two protein chains.The alignments were seeded using line weavings as explained in the text.
The left set of columns lists the protein chain names (Protl and Prot2), their sequence similarity as a percentage, and the number of secondary structure elements (SSEs) eligible for alignment in each chain.The code only considers a-helices with at least five residues and /?-strands with at least three residues.
The middle set of columns depicts the results of an alignment: the rank of the alignment, the number of lines matched between the two proteins (|SSE|), a measure of the deviation between paired lines (L2), a measure of the line crossing consistency (Bad/Sig:Tot), and a cumulative similarity measure (pu)-L2 measures a deviation, so small values are preferred; 0 is the smallest possible value.pu measures similarity, so large values are preferred; 1 is the largest possible value.The overall "Rank" is based on pu.
The right set of columns assesses the accuracy of the results obtained.The first two columns show the deviation, in terms of distance offset and angular rotation, between the rigid map inferred directly from the line alignments and the optimal rigid map obtained from DALI, CE, or 3DSEARCH.The last column shows the RMSD between aligned CA atoms (alpha carbons), as computed by our atom alignment code (this alignment code starts with a rigid map computed from the line alignments, then tries to align both proteins, not just the secondary structures, using an iterative bipartite-graph closest-point routine).crossing matrix; it has value n(n -l)/2, where n is the the number of secondary structures in each protein that have been aligned.Some of these entries will be 0, indicating (nearly) coplanar lines."Sig" counts the number of corresponding entries that are nonzero in both crossing matrices."Bad" counts the number of these entries that are inconsistent, meaning that two secondary structure lines have crossing number "+1" in one protein while their aligned counterparts have crossing number a -1" in the other protein.
(a) (b) Figure 14: Panel (a): Line weavings for the optimal alignment of lwsy_A (blue) and 2rus_A (red).Panel (b): Overlay of the crossing matrices for the two weavings.An entry is blank if one or both of the crossing numbers is zero, it is the sign of the crossing number if the crossing numbers agree, and it is a red X if they disagree.
The weavings used to seed an alignment always have fully consistent crossing matrices.However, one would not necessarily expect the crossing matrices corresponding to an overall alignment induced by that seed to be consistent.After all, a-helices and /^-strands are actually finite-length polypeptide segments, not infinite lines.Thus a motion of a helix or strand could preserve the overall topology of a protein but change the crossing numbers associated with the protein's representation by infinite lines.It thus comes as a pleasant discovery that crossing matrices generally are indeed fairly consistent globally for good structural alignments.
By way of example, Panel (a) of Figure 14 depicts the line weavings for the correct alignment of lwsy_A with 2rus_A.Panel (b) shows the overlay of the crossing matrices for the two weavings.It is interesting to observe both the roughly hyperbolic shape formed by the line weavings as well as the block diagonal structure of the crossing matrix.The first 8 rows and columns in the matrix represent lines of a-helices; the last 8 rows and columns represent lines of /?-st rands.Internal to each of these two sets of lines, the crossings are primarily positive.Crossings across sets, that is, between an a-line and a /3-line, are primarily negative.The reason for this is the symmetry of the TIM-barrel and the fact that the backbones of a-helices and /?-strands are oriented oppositely relative to the barrel axis, as inspection of the proteins shows.

The L2 Measure
In ranking and extending alignments, the code considers various error measures, including the length of the alignment and an L2 measure of line embeddings, which we now explain.While weavings are constructed from infinite lines, the Ul measure is based on finite line segments that represent the protein's secondary structure elements.A finite line segment is a straight-line embedding of the unit interval [0,1] into 3D space.Given two oriented line segments h : [0,1] -• 3ft 3 and k : [0,1] -• 3ft 3 , a standard least-squares metric for measuring their similarity is: Given a collection of line segments {hi,..., h n } in one protein, paired with a corresponding collection of line segments {£4,..., k n } in a second protein, one can measure the goodness of the alignment as follows:

L2 _ n
The value "L2" thus obtained appears in Table 1.It is an analogue for oriented linealignments of the RMSD measure often used for atom-alignments.

Similarity and Rank
In Table 1, the value P12 provides yet another measure of how well one protein (Protl) may be aligned with a second protein (Prot2).The "Rank" column of Table 1 refers to a ranking by P12 value.The value lies in the range [0,1], with 1 optimal.It combines three different measurements, namely the number of aligned secondary structure elements, the Ul measure, and the crossing consistency, as follows: Here S\ is the ratio of secondary structure elements in Protl to the number of elements appearing in the optimal alignment, s 2 is L2/(4A), and s 3 is 10*Bad/Sig.When combining multiple measures, small exponents reduce the significance of any one deviation, while large exponents increase the significance; we use exponent 4 to amplify any deviations above 1 in the values {51, s 2l S3}.Thus the divisor 4A in s 2 simply asserts that deviations below 4A are not terribly significant; similarly the multiplier 10 in 53 asserts that crossing errors exceeding 10% are significant.We picked these numbers without any tuning, based simply on intuition developed in observing protein alignments.Likely other values would be equally good or better.
The precise value of pi 2 is not significant; we caution against reading too much into its absolute value.Instead, it is a rough qualitative dimensionless number for assessing how well Weaving-based   2 and 3. Blue shows the weaving-based similarity, while red shows the percentage of residues aligned by CE.The data is taken from row #6 in each of Tables 2  and 3.The axis "Protein2" uses integers; these refer to the columns in which each protein appears in the tables.the lines of Protl may be placed onto the lines of Prot2.The number is purposefully not symmetric.For instance, a small protein domain might appear as a subdomain of another protein.This would mean that p i2 might be near 1, while the opposite comparison p 2 i might be considerably less than 1.For good alignments in which the Ul and Bad/Sig values are low, pi2 effectively measures the fraction of secondary structure elements in Protl that have been aligned.
Table 2 depicts the similarity values for all possible comparisons between 12 of the 24 proteins from Table 1, using the version of the weaving-based alignment algorithm that favors preserving backbone order.The values of 100pi 2 in a single row give a rough relative comparison of how well one protein matches all the others.For some proteins a clear baseline is apparent, indicating an alignment of four secondary structure elements, the minimum possible using quadruples.For instance, the value 57 appears frequently in the row for Ia6m, indicating an alignment of 4 of the 7 possible secondary structure elements.
We ran the same comparisons using CE, obtaining qualitatively similar results.Table 3 depicts the results, showing for each protein the percentage of its residues aligned with the other proteins.This data roughly mirrors the data of Table 2.For instance, Figure 15 graphs the alignment data for lpbg_A, a large protein for which the two methods agree quite well.As a reminder, p 12 values are based only on secondary structure alignments, whereas the CE-derived percentages are based on all residues.

Protein Descriptions
In selecting proteins we examined SCOP, focusing primarily on classes "a+b" and a a/b", plus a few others.We chose proteins with a range of sequence similarities.Two of the protein pairs, (Ia6m, llhs) and (ld9c_A, 2rus), are all-helical.One protein (lmjc) is all-sheet.The others contain a mix of a-helices and /3-sheets.Here is a brief description of all the proteins appearing in Tables 1 and 2: (5atl_A, 8atc_A) The taut and relaxed conformations, respectively, of the catalytic chain A in Aspartate Carbamoyltransferase, a protein involved in the synthesis of pyrimidine nucleotides.Chain A has 310 residues, forming two domains, each consisting of a /3sheet and several a-helices.The two domains are joined at a hinge.
(lfpk_A, lfpk_B) These are chains A and B of the dimer Fructose-1,6-Bisphosphatase, a hydrolase involved in gluconeogenesis in the liver.Each chain consists of 335 residues, forming three /3-sheets and several a-helices.
(lki7_A, lqhLA) These are two different complexes of thymidine kinase from the herpes simplex virus.Thymidine Kinase is a phosphotransferase.Chain A consists of 374 residues (329 of which are represented in the PDB file) forming one /3-sheet and numerous a-helices.
(lhyq_A, lcp2_A) lhyq is a bacterial cell-division regulator (minD).Chain A consists of 263 residues (233 of which are represented in the PDB file), forming one large /3-sheet surrounded by several a-helices.Icp2 is a nitrogenase iron protein.Chain A consists of 269 residues, again forming a large /3-sheet surrounded by a-helices.
(latn_A, 3hsc) latn is Actin from rabbit, while 3hsc is Heat Shock Cognate 70 from cow.Chain A of latn consists of 372 residues, forming five /3-sheets and several a-helices.3hsc consists of 384 residues, forming five /3-sheets and several a-helices.The two proteins share a common ATPase domain [26].
(ld9c_A, 2rig) Id9c is Interferon-Gamma from cow, while 2rig is Interferon-Gamma from rabbit.Chain A of Id9c consists of 121 residues, while 2rig consists of 119 residues, in both cases forming an all-helical protein.
(lwsy_A, 2rus_A) Chain A of Tryptophan Synthase (lwsy) consists of 265 residues, forming a TIM-barrel in the Tryptophan family.Chain A of (2rus) consists of 457 residues, forming a TIM-barrel in the RuBisCo family; there are several additional /^-strands and there is additional domain structure outside the common barrel motif.
(lmjc, Ia62) lmjc is Cold Shock from E. coli with 70 residues.It is an all (3 protein, with a six-strand barrel fold.Ia62 is the RNA binding domain of E. coli rho factor (often called Rho 130).It consists of two subdomains, an amino terminal helical region and a /5-barrel carboxy terminal domain.The /3-barrel domain binds either ssDNA or RNA and is structurally homologous to the oligonucleotide-saccharide binding domain.Thus, lmjc should appear homologous to the j3 domain of Ia62.
8 Future Work Topology: We suspect that considerable additional information may be gained by focusing more closely on the pure topology of proteins.For instance, a purely topological hashing scheme would look more closely at the crossing matrix.In the case of quadruples, the following two numbers fully characterize isotopy classes of four oriented pairwise-skew lines: (i) The sum of all the triple linking numbers in the weaving, and (ii) the cardinality of the positive (or negative) crossing numbers.(A triple linking number is the product of the three crossing numbers defined by some triple of lines.Invariant (i) is the sum of all such products for all possible triples of lines in a given weaving.See [60].)As mentioned, we currently hash on invariant (ii), expanded to accommodate coplanar lines, then compare permutations of crossing matrices to select topologically equivalent weavings.We subsequently discard line alignments that do not make physical sense.Our approach currently therefore is not purely topological, but takes rough account of the line embeddings.Future research should explore further in both directions: the more topological direction as well as the geometrically more specific direction.As we have indicated, for large numbers of lines, many of the topological problems are wide open.

Sheets:
This paper approximated both a-helices and /?-strands using lines, then developed an algorithm for matching lines and line weavings.While successful, we suspect that such an approach only captures part of the structure contained in /^-sheets.Such sheets have both a two-dimensional surface structure and a component one-dimensional line structure.In other contexts, such as our work on detecting protein similarities from sparse NMR data [19], we have discovered and used very natural two-dimensional polytope structures for representing /3-sheets, based on hydrogen-bonding.We suspect that higher-dimensional generalization of line weavings to surface foliations may provide additional useful topological information.Such generalizations may prove particularly useful in contexts where /^-strands are only poorly approximated by straight lines, due to twisting and bending.
General Loops: Our topological approach currently is limited in a practical sense to secondary structure elements.Section 5 outlined the theoretical foundation of a general approach able to deal with arbitrary polypeptide segments, not just those defining a-helices and /3-strands.Future work needs to extend the current results in that direction.The driving goal should be to derive compact topological descriptors characteristic of protein shapes and to circumscribe the hypervolume of potential topological shapes actually inhabited by proteins.

Summary
This paper introduced the notion of isotopy deformations into structural alignment.The paper explored the relationship between writhing and self-convolution.Self-convolution compactly describes edge-edge interactions and extends naturally to interactions of curve segments.Writhing and separation are useful shape descriptors for clustering pairs of curve segments.The paper presented an algorithm for matching substructures by clustering similar segment pairs, then clustering among the induced rigid maps.The paper also explored line weavings as a means for characterizing protein structures by arrangements of a-helices and /3-strands.The paper presented an algorithm for matching proteins based on line weaving topology.Future work should extend these knot theoretic ideas to include surface representations and general loops, then classify protein shapes topologically.
10 Acknowledgment I am very grateful to Dr. Gordon S. Rule in the Department of Biological Sciences for countless wonderful conversations and discussions regarding protein structure and biochemistry over the past six years.

Figure 3 :
Figure3: Isotopies of two pairs of helices, shown at three different snapshots in time.Two helices of lxis (blue) morph into their counterparts in lnar (red).The isotopy for the left pair of helices in each frame is essentially a quarter-turn rotation about the helical axis.The isotopy for the right pair of helices in each frame is essentially a rotation about an axis perpendicular to the helices, followed by a loop rearrangement near the top of the blue helix.

Figure 4 :
Figure 4: This figure again depicts the two alignments of Figure2, now showing only the helix axes as line segments, lxis is in blue, lnar in red.Some of the line segments are labeled with identifiers ("xi" for helices in lxis and a ni" for helices in lnar).These generate the line weavings of Figures5 and 6.

Figure 5 :
Figure 5: Line weavings generated from the four labeled pairs of edges shown in the optimal alignment of Figure 4.Each labeled helix edge generates a thick infinite line in the weaving.The yellow arrows indicate the backbone directions.The viewing perspective is the same as in Figure4, looking square at the paper, only from further back so all crossings are visible.

Figure 6 :
Figure 6: Line weavings generated from the three labeled pairs of edges shown in the alternate alignment of Figure 4. Again, each labeled helix edge generates a thick infinite line in the weaving.The viewing perspective is from the right side of the drawing depicted in Figure 4, looking tangential to the paper.

Figure 7 :
Figure 7: The two types of crossings and their crossing numbers.

Figure 8 :
Figure 8: The linking of two curves is defined as the sum of the crossing numbers divided by two.This figure shows a pair of unlinked curves, a pair of singly linked curves, and a pair of doubly linked curves.

Figure 9 :
Figure 9: Edges e* and ej generate a parallelogram P^ of interedge directions, with vertices ^1,^2,^3,^4.The absolute area of the parallelogram projected onto the unit sphere is eA i3l where e is the crossing number of the two edges (in the figure, e = +1.)The edge-edge writhing is defined to be AIJ/ATV.

Figure 11 :
Figure 11: Alignment of 5atl_A (blue) and 8ATC_A (red) found by our convolution-based algorithm.The backbones match nearly perfectly, except where they should not, namely two loops that undergo significant conformational change (these appear near the top left and the top right in the figure).

Figure 12 :
Figure 12: Alignment of 3adk (blue) and lgky (red).The proteins have mere 19% sequence identity and include both matching and nonmatching secondary structures.Roughly 80% of the two proteins should align.One can see this in the figure, with the left parts matching well and some of the right clearly not.

Figure 15 :
Figure 15: Overlay of p i2 and residue percentages for alignments of lpbg_A with the proteins listed in Tables2 and 3. Blue shows the weaving-based similarity, while red shows the percentage of residues aligned by CE.The data is taken from row #6 in each of Tables2 and 3.The axis "Protein2" uses integers; these refer to the columns in which each protein appears in the tables.
and only if p and q are not isotopic for any (E,S), e.g., if the number of representative points differs.Computing d is the problem we called POLYGONAL CURVE SIMILARITY in the intuitive introduction of Section 3.
Theorem 1 d is a metric and d is effectively computable.PROOF.See Appendix A. •

Table 1 :
Alignment of proteins from weaving topologies.

Table 2 :
Weaving-based similarities for cross comparisons of 12 proteins with each other.The table depicts 100pi2, producing values in the range [0,100], with 100 optimal.For otherwise good alignments, pu is roughly the fraction of Proteinl's secondary structure elements that have been aligned.For reference, the column labeled "Table1" refers to the nominally-correct comparison of Proteinl with its counterpart in Table1.

Table 3 :
CE alignments.The table depicts the percentage of Proteinl's residues aligned by CE with each of the other proteins, "size" is the number of residues considered by CE in Proteinl.