Protein similarity from knot theory and geometric convolution

Shape similarity is one of the most elusive and intriguing questions of nature and mathematics. Proteins provide a rich domain in which to test theories of shape similarity. Proteins can match at different scales and in different arrangements. Sometimes the detection of common local structure is sufficient to infer global alignment of two proteins; at other times it provides false information. Proteins with very low sequence identity may share large substructures, or perhaps just a central core. There are even examples of proteins with nearly identical primary sequence in which α-helices have become Β-sheets.Shape similarity can be formulated (i) in terms of global metrics, such as RMSD or Hausdorff distance, (ii) in terms of subgraph isomorphisms, such as the detection of shared substructures with similar relative locations, or (iii) purely topologically, in terms of the cohomology induced by structure preserving transformations. Existing protein structure detection programs are built on the first two types of similarity. The third forms the foundations of knot theory.The thesis of this paper is: Protein similarity detection leads naturally to an algorithm operating at the metric, relational, and homotopic scales. The paper introduces a definition of similarity based on atomic motions that preserve local backbone topology without incurring significant distance errors. Such motions are motivated by the physical requirements for rearranging subsequences of a protein. Similarity detection then seeks rigid body motions able to overlay pairs of substructures, each related by a substructure-preserving motion, without necessarily requiring global structure preservation. This definition is general enough to span a wide range of questions: One can ask for full rearrangement of one protein into another while preserving global topology, as in drug design; or one can ask for rearrangements of sets of smaller substructures, each of which preserves local but not global topology, as in protein evolution.In the appendix, we exhibit an algorithm for answering the general question. That algorithm has the complexity of robot motion planning. In the text, we consider a more common case in which one seeks protein similarity by rearrangements of relatively short peptide segments. We exhibit an algorithm based on writhing numbers that runs in time O(n2) to O(n4). We define and use a new datastructure, called geometric self-convolution, within this algorithm.Contributions: We believe that this is the first paper to consider carefully the need for combining metric and homotopic qualities in seeking protein similarity. We provide a parameterized definition of similarity that leads naturally to a metric in protein space. We exhibit algorithms for computing the metric and detecting similarity. We report results obtained with three pairs of proteins, each pair exhibiting different typical characteristics.


Introduction
Determining structural similarity between proteins is one of the most central and common problems within proteomics, yet there exist no simple universally accepted algorithms for solving this problem.Indeed, the most widely used existing 3D structural alignment tools (e.g., Dali [15], Vast [12], CE [35], and 3dSearch [37]) are likely to disagree in their specific atomic alignments and sometimes even in their top-scoring secondary structure alignments when presented with proteins that have low sequence similarity and low structural similarity.
As of late August 2003 the Protein Data Bank (PDB) [31,5] contained in excess of 22,000 protein structures, up approximately 1,000 since early June.Many of these proteins are highly similar structures.There are only approximately 4,000 different folds represented in the PDB, roughly a ratio of 1:5 (fold:structure).Given a new protein, the probability is high that it is similar to an existing protein.Detecting such similarity quickly is essential for classifying a protein and understanding its biological function.
More importantly, as the growth in new structures outpaces the growth in new folds, it is likely that the role of structural similarity will need to become much more fine-grained than it is today.Biological discoveries will lie in unusual, possibly very sparse, structural similarities, rather than in rough fold-level classifications.For instance, in looking at the backbone CA atoms of a β-sheet, one can easily detect two roughly orthogonal families of curves in the sheet, one family parallel to the constituent β-strands, the other perpendicular to the strands.This observation raises at least the geometric question whether there are two proteins that place their β-sheets, viewed as two-dimensional sheets, in the same spatial locations, even though the underlying makeup of the one-dimensional strands is orthogonal.Such proteins would hint at some very interesting biochemical/genetic rearrangements.It is unclear whether any existing structural alignment tool today could detect such similarities automatically.It is easy to construct less exotic but equally interesting questions.As X-Ray and NMR methodologies enter high-throughput capability, the questions arise easily, yet many go unanswered.
Lacking is a good definition of "similarity", even for today's alignment tools.The Structural Classification of Proteins (SCOP) website [25] offers the following: "Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections."This sounds good, it is intuitive, and it is applied every day to classify proteins.But what really does it mean?When is a secondary structure "major", when is a collection of secondary structures in the "same arrangement", and which "topological connections" are really relevant?
This paper focuses on topological incidence and polygonal writhing as a gauge of geometric similarity.We take our inspiration from a recent fundamental paper [32] that classifies protein structures in terms of Gauss integrals, motivated by ongoing work on knot invariants [4].In this paper we explore the connection further, leading naturally to a metric in protein space and a datastructure for representing geometric self-convolutions of polygonal curves.We comment on connections with methods from Robot Motion Planning.Finally, we implement an algorithm for detecting substructural similarities between proteins and report results on three pairs of proteins.

Structural Alignment (Related Work)
There are three major structural alignment tools in use today: Dali, Vast, and CE.All three are accessible off the PDB webpage.Since the appearance of these methods in the late 1990s, a host of other methods have appeared, which generally compare themselves to these three.One we have found useful is 3dSearch.We review these four methods here briefly.
Dali [15,17,16] aligns protein substructures using distance matrices.Distances are invariant to rigid body transformations, thereby avoiding the need for spatial alignment.Dali considers distances between alpha carbons; the distance matrices are indexed in residue order.Substructures that appear in similar relative spatial locations in the two proteins give rise to similar patterns between blocks of the distance matrices.Dali uses a clever Monte Carlo method to detect these patterns.It begins with small hexapeptides then repeatedly merges similarly related protein fragments into larger common substructures.One important aspect of Dali is an elastic similarity score; the significance of errors in distance alignments decreases with increasing distance.Consequently, substructures separated by larger distances can tolerate greater relative global motion, while residues nearer to each other must better preserve local shape.Dali is probably the gold standard for protein structure comparisons.Its main disadvantage is its relatively ad-hoc Monte Carlo structure and complexity.
CE [35,36] searches for protein fragments in one protein that are locally similar to protein fragments in another protein.It then extends these local alignments by a sequential scan down the protein backbones.This scan is reminiscent of dynamic programming in sequence alignment, but CE actually employs a clever greedy algorithm.CE uses distances between alpha carbons and rigid body superposition to define similarity and to guide the extension scan.A limitation of CE is its requirement that matching substructures occur in sequential backbone order.
Vast [12,13] and 3dSearch [37,38] focus on elements of secondary structure to align proteins.Both methods begin with building blocks that are pairs of secondary structure elements, one pair in each protein.Vast matches pairs of secondary structural elements that have a similar type, relative orientation, and connectivity, then builds larger structures by considering substructure similarities that are statistically surprising.This probabilistic similarity function is both a strong advantage and a potential limitation of Vast; a class of "similar" structures is significant, but not necessarily easily circumscribed.
3dSearch first finds a pair of secondary structure vectors in one protein that best matches a pair of vectors in the other protein.This initial alignment seeds a loop whose basic step is a dynamic programming algorithm for aligning the two proteins' secondary structure vectors with each other.Atom-level alignment occurs subsequently.3dSearch is limited by its initial best-pairing decision.

Writhing and Knot Invariants
One of the difficulties with the previous alignment methods is the vagueness of their global similarity measures.Locally, these methods often measure similarity by the root-mean-square-deviation (RMSD) between aligned atoms, or some related variation.As Røgen and Fain [32] point out, RMSD of aligned atom coordinates is a wonderful measure of similarity for two shapes that are nearly identical.However, RMSD is a poor measure when the two shapes being compared differ significantly, particular when the two shapes contain some matching and some nonmatching subshapes.Existing alignment methods address this issue by seeding their routines with small matching subshapes, then repeatedly merging these into larger shapes.This process often succeeds well, but it is purely procedural.As a result, automatic classification of proteins remains brittle.
One possible alternative is to compare proteins using more general shape metrics, such as Hausdorff metrics [18].More appropriate for proteins may be invariants derived from knot theory.Røgen and Fain [32] offer such a new perspective.Given a protein, they compute 30 different curve invariants, thereby mapping the protein to a point in 30 .They argue that this 30-dimensional measure satisfies the triangle inequality, and thus is a good method for grouping protein shapes into similarity classes at multiple levels of granularity.They demonstrate this claim empirically by classifying 20,937 protein domains into multiple levels, achieving 96% agreement with the CATH2.4classification [27,26] (both SCOP and CATH are widely accepted protein classification databases, created by a combination of automatic and human judgments).
The primary invariant in [32] is the writhing number of a curve; the others are built from this.The writhing number of a curve essentially measures self-linking of a curve.It is connected to the linking number of two curves by the following famous Cȃlugȃreanu-Fuller-White formula [9,11,44]: The formula applies to a narrow closed orientable ribbon in three-dimensional space.Lk is the linking number of the two boundary curves of the ribbon, W r is the writhing number of the central spine, and T w is the twisting number of the two boundary curves.Lk is a purely topological number.Purely topological means that it is invariant to any smooth deformation that avoids self-intersections.The other two numbers are not topological invariants; they depend on the embedding of the ribbon.However, they are invariant to a large class of transformations, such as rigid body motions, even conformal (angle-preserving) mappings.
To gain intuition, suppose we orient the ribbon by orienting its spine.For proteins, the backbone plays the role of the spine, and it is naturally oriented by residue order.Now imagine projecting the ribbon onto a 2D plane orthogonal to a randomly chosen direction.The curves defining the ribbon will seem to cross each other at some locations in the plane of projection.At each crossing we keep track of which curve lies above the other, then assign the crossing a number, called , with value −1 or +1.The value depends on the orientation of the two curve segments at the crossing as well as on the over-under relationship.Specifically, imagine rotating the top curve locally so that its forward tangent at the crossing is parallel to the forward tangent of the bottom curve.Then is the sign of the smallest angle required.See Figure 1.Keeping this geometric picture in mind, the linking number Lk counts the sum of signed crossings between the ribbon's two boundary curves, divided by two.This sum is independent of projection direction.The writhing number W r counts the sum of signed self-crossings of the ribbon's spine, now averaged over all projection directions.Finally, the twist T w is a torsion-dependent term that measures how much one boundary curves intertwines with the other.We will not have any need for it, and will not discuss it further.Instead, our focus will be on matching subsegments of proteins by comparing writhings.
The intuition that linking and writhing numbers count crossings is very important from the perspective of knot theory.From our perspective, we need an additional, related, piece of intuition.Consider two closed curves in space (see also Figure 2 and imagine that each of the edges is tangent to a curve).Place a finger on each curve and consider the unit direction vector pointing from one fingertip to the other.This is a point on the unit sphere.Sum up the signed area covered on the sphere for all possible finger placements, with sign given locally by the crossing number as a function of finger placements.With some effort one sees that the net area covered is the linking number Lk of the two curves.Amazingly, this number turns out always to be an integer.If the two curves are not linked, the net area covered will be zero, if the two curves are linked once (such as two magician's rings), the sphere will be covered once, and so forth.Intuitively, for proteins, the extent to which the sphere is covered locally will provide us with a measure of the relative location and orientation of pairs of peptide segments.
Linking: Formally, suppose c 1 and c 2 are two closed non-intersecting curves in three space, specifically disjoint embeddings of S 1 into 3 .Let G be the Gauss map applied to the difference between the curves, i.e., the function G : Then the linking number of the two curves can be written in terms of the Gauss integral: Here ω is the differential 2-form measuring area on S 2 and G * ω is its pullback by G to S 1 × S 1 .
Writhing: It turns out that the writhing number of a curve has the same algebraic form; if c : S 1 → 3 is a closed curve in space, then its writhing number is simply W r(c) = Lk(c, c).
In other words, "writhing is self-linking".Of course, in this case the function G is not welldefined on the diagonal (when t = s).A priori the integral Lk(c, c) need not exist.Dealing with this issue leads to the twist T w mentioned earlier [24].The writhing number is almost never an integer.Protein Fragments: The definitions continue to make sense for open curves, i.e., 3D embeddings of intervals rather than circles.In particular, we will find the component writhing numbers Lk(c 1 , c 2 ) of short protein backbone fragments, c 1 and c 2 , to be useful shape indicators.

Writhing of Polygonal Curves
We will represent protein backbones as open polygonal curves1 , connecting sequential residues via their CA atoms (alpha carbons). 2For a very nice exposition on writhing numbers of polygonal curves see [1].That paper developed a clever O(n 1.6 ) algorithm and a sweepline algorithm for computing the writhing of a polygonal curve, then applied the second algorithm to various proteins.Considerable work has used knot theory to understand the supercoiling and knotting behaviors observed in DNA, another polygonal curve (see [33] for a sample).Also, see [19,29] for some very interesting applications of robot motion planning to polygonal knot theory.Polygonal curves simplify calculation of Equation (1).The integral becomes a finite sum: where A ij is the -signed area on the sphere covered by vectors pointing from edge e i on the first curve to edge e j on the second curve.
Definition 1: We will refer to A ij as the edge-edge writhing of the two edges e i and e j .
Computing A ij is straightforward.Figure 2 illustrates the process.Algebraically, suppose the start and end points of the oriented edge e i are p 1 and p 2 , and suppose the start and end points of edge e j are q 1 and q 2 .Consider the four extremal cross directions between the two edges: For skew edges e i and e j , the four directions d 1 , d 2 , d 3 , d 4 define the vertices of a parallelogram P ij in three-dimensional space whose supporting plane does not intersect the origin.Projecting the parallelogram onto the unit sphere creates a spherical parallelogram.Its vertices are the unit direction vectors obtained from d 1 , d 2 , d 3 , d 4 , its edges are arcs of great circles connecting these vertices, and its absolute area multiplied by the crossing number of the two edges is the desired signed area A ij .Computing the area of a spherical quadrilateral is also straightforward; one simply sums the interior angles of the quadrilateral and subtracts 2π.Observe that A ij = A ji .

Arrangements of Lines
Closely related to knot theory and potentially to protein structure similarity is the theory of line arrangements.For an introduction to this area see [42,43].Two arrangements of skew lines in three-dimensional space are said to be topologically equivalent if there is an isotopy that transforms one arrangement into the other.By an isotopy one means motions of the lines in which the lines remain skew.There is exactly 1 topological equivalence class consisting of 2 skew lines, 2 classes of 3-lines, 3 classes of 4-lines, 7 classes of 5-lines, 19 classes of 6-lines, and 74 classes of 7-lines.The classification of general collections of skew lines is an open research question.One approach is to transform line arrangements into elements of braid groups, construct the links induced by the braids, and apply methods from knot theory [28].
The potential application to protein structure comparison arises in three contexts.First, structural alignment programs often represent proteins by their secondary structure vectors [12,37,40].Classifying such vector arrangements might provide simple invariants by which to label protein folds.Second, the peptide plane bond vectors (such as N-CA, N-H, and N-C(O)) fully determine a protein's shape.Again, a classification of the possible arrangements of these vectors might provide simple means for recognizing the shapes of unknown proteins.For instance, the orientations of these vectors relative to a global axis can be discerned using NMR [41,2,20,10].This may provide an efficient method for distinguishing proteins experimentally.Third, the techniques from line classifications may carry over to more general structures.The key idea is to consider the space of transformations that preserve certain topological properties, such as non-intersection, then to discover invariants that distinguish the connected components of this space of transformations.

Polygonal Curve Homotopies and the Structure Problem
As suggested by the SCOP definition, detecting protein similarity entails finding collections of paired substructures which are located roughly in the same relative locations in space.
Let us make this idea more precise.Recall that a polygonal curve is a piecewise linear embedding of the unit interval I into 3D space, c : I → 3 .In particular, the curve is not selfintersecting.We can represent the curve as a sequence of representative points {p 1 , . . ., p n }, namely the endpoints of the linear segments.In our case the points are the coordinates of a protein's alpha carbons.Any consecutive subsequence of a polygonal curve's representative points also defines a polygonal curve.
Definition 2: Suppose that p = {p 1 , . . ., p n } and q = {q 1 , . . ., q m } are two polygonal curves.Suppose that E is a Euclidean rigid body motion on 3 (a rotation and translation).Let δ > 0 be some positive number.We will say that curve p is (E, δ)-homotopic to curve q if the following two conditions are satisfied: (ii) There is a polygonal-curve homotopy h mapping E(p) to q such that no representative point moves further than δ from its initial or final location.More precisely, we require a continuous function h : (b) h i (1) = q i , for all i = 1, . . ., n.
(c) The sequence {h 1 (t), . . ., h n (t)} is a polygonal curve for all t, meaning that the points h 1 (t), . . ., h n (t) define a curve that is not self-intersecting for all times t ∈ I.
We will presently use this definition to compare subsegments of curves.The motivating intuition is to regard two proteins as structurally similar if there is some rigid body transformation that places one on top of the other well enough that δ-perturbations of local coordinates permit atom alignment.The homotopy requirement is phrased in a way that mimics the definition of isotopic lines we saw in Section 3.3, that is, via classes of motions that preserve structure.Thus, for instance, two helices might match if and only if one can be transformed into the other without backbone selfcollisions.Observe that the transformation could be quite large, depending on δ, but at all times preserves the backbone topology.(We note in passing a generalization: it might be interesting to restrict the class of homotopies further by requiring that the polygonal curve h(t) not intersect the rest of the protein at any time t.) For large n and medium-sized δ, condition (ii) can be complicated to check.It basically entails solving a high-degree-of-freedom motion planning problem.Fortunately, for many short protein fragments and small δ, the condition is similar to enforcing low RMSDs of the final alignments.The definition therefore addresses a wide tunable range of possible structural similarity questions.
Thus d(p, q) = ∞ iff p and q are not homotopic for any (E, δ), e.g., if the number of points differs.
Theorem 1: d is a metric and d is effectively computable.Proof: See Appendix B.1.

Monotonic Curve Homotopies
Given a point p i and a line in 3D space we can project the point orthogonally onto the line.We can do the same for all representative points of some polygonal curve.The curve is said to be monotonic with respect to line if the order of the projected points is the same as the order of the points in the curve.This order orients the line.Short protein segments, such as β-strands and α-helices, are often monotonic with respect to their best-approximating lines.
Lemma 1: Suppose p = p 1 , . . ., p n is a polygonal curve monotonic with respect to line .Let π = π 1 , . . ., π n be the polygonal curve obtained by projecting p onto .Then Proof: Imagine drawing a line between p i and π i for each i.Define a homotopy that moves each p i to π i along these lines.The homotopy preserves the polygonal curve since the curve is monotonic.
Lemma 2: Suppose p and q are two polygonal curves with equal numbers of points, each monotonic with respect to some line.Let π = π 1 , . . ., π n and σ = σ 1 , . . ., σ n be the projections of the two curves onto their respective lines.Then d(p, q) ≤ d(p, π) + d(q, σ) , where E is taken from the set of rigid body motions that align the two oriented lines.
See Appendix B.2 for a proof.The bound is often generous.The lemma tells us that two monotonic curves whose line-projections are similar in 1D are also readily homotopic in 3D.
For polygonal curves with equal numbers of points, d measures the spatial difficulty of transforming one curve into the other.It provides no such information for curves with different numbers of points.Instead, we now define structural similarity as the detection of local homotopies.We need one piece of additional notation.Suppose p = p 1 , . . ., p n is a polygonal curve; let us define p k i as the polygonal subcurve p i−k , . . ., p i , . . ., p i+k whenever k+1 ≤ i ≤ n−k.In other words, p k i is the curve segment centered at p i , extending backwards and forwards by k points.Definition 4: Suppose that p = {p 1 , . . ., p n } and q = {q 1 , . . ., q m } are two polygonal curves.Let δ > 0 be a positive number, k a nonnegative integer, and I some set of index pairs {(i, j)}.We say that p is δ-structurally similar with k-strength alignment I if there exists some rigid body transformation E such that d E (p k i , q k j ) ≤ δ for all pairs (i, j) ∈ I.
In English, this definition requires one curve to move rigidly over the other curve such that two paired collections of subcurves are nearly identical to each other, as measured by subsequent homotopy deformations.For k = 0, this definition is similar to aligning pointsets.For large k, the definition amounts to detecting overall curve similarity.In between, the definition captures the notion of structural alignment with rearrangements.In particular, the order of indices in the index set I need not be sequential.This leads to the following: Structure Problem: For given curves p and q, for δ positive and k a nonnegative integer, compute all index sets I and their associated rigid body transformations E satisfying Definition 4.

Theorem 2:
The Structure Problem is effectively computable.Proof: Follows from Theorem 1.
Although computable, the algorithm derived from our proof of Theorem 1 is horrendously exponential [6,7,21,34].One possibility is to use a motion planner specialized for knots, such as the untangling planner of [19].Alternatively, for our purposes, Lemmas 1 and 2 suggest a simplification: In the next two sections we will examine an approach based on edge-edge writhings that attacks the Structure Problem by aligning line projections of peptide segments.

Understanding Protein Similarity by Geometric Convolution
In this section we examine more closely the construction of Figure 2. Our observations will motivate us to define a self-convolution datastructure for detecting structural similarity in proteins.

Writhing and Convolution
Definition 5: Suppose X and Y are two sets of points in R 3 .Then the geometric convolution of Y with X is the set of points Y X = {y − x | x ∈ X and y ∈ Y }. (Sometimes this is defined by saying that the geometric convolution of Y with X is the Minkowski sum of Y and −X.There are again strong connections to robot motion planning [22,23]).Lemma 3: Assume e i , e j , and P ij are as defined at the end of Section 3.2.Then P ij = e j e i .
Proof: Definitional: P ij is the set of all vectors pointing from a point on e i to a point on e j .

Corollary 1:
The edge-edge writhing A ij of two edges e i and e j is the area of the convolution e j e i projected onto the sphere S 2 times the crossing number of the edges' supporting lines.

Corollary 2:
Suppose edges e i and e j are given.The following four possibilities exist: (a) The edges are skew.In this case P ij is a 2D polygon whose plane of support does not include the origin.The writhing A ij is therefore well-defined and nonzero.

(b)
The edges are coplanar but not parallel.In this case P ij is again a 2D polygon, but now its plane of support does include the origin.The polygon P ij may or may not touch the origin.P ij \{0} projects to a great-circle arc on the sphere, and the writhing A ij is therefore zero.
(c) The edges are parallel but not colinear.In this case the polygon P ij degenerates to colinear line segments lying on a line that does not pass through the origin.The writhing A ij is zero.

(d)
The edges are colinear.In this case the polygon P ij degenerates to colinear line segments lying on a line that passes through the origin.The polygon P ij may or may not touch the origin.It projects to one or two points on the sphere and the writhing A ij is again zero.

Corollary 3:
The edges e i and e j intersect if and only if polygon P ij touches the origin.
Corollary 3 tells us that we can count edge incidence by counting polygons touching the origin.Suitably generalized, that hints at a method for determining structural similarity.

Self-Convolution
Earlier we observed that many successful structural alignment programs compare arrangements of pairs of lines.We now extend that idea to writhing polygons.In reading Lemma 4 imagine that we are comparing a pair of peptide segments in one protein with another pair in another protein.
Lemma 4: Consider four edges: e 1 , e 2 , f 1 , f 2 .There is a rigid body transformation E mapping the edges (e 1 , e 2 ) to the edges (f 1 , f 2 ) if and only if there is a rotation R about the origin such that R(e 2 e 1 ) = f 2 f 1 while preserving vertex correspondence.Proof: See Appendix B.3.

Corollary 4:
If R is a rotation such that the maximum distance between corresponding vertices of the two polygons R(e 2 e 1 ) and f 2 f 1 is δ, then there is a rigid body transformation E such that e 1 and e 2 are (E, δ)-homotopic to f 1 and f 2 , respectively.Proof: See Appendix B.4.
When Corollary 4 applies we say that the polygons are δ-homotopic.Definition 6: If p is a polygonal curve, we define the geometric self-convolution of p, written ⊗(p), to be the generating polygons of p p: , with e i and e j edges in the curve p }.
Given two curves p and q, we will seek structural similarity by comparing the curves' self-convolutions.Lemma 4 suggests that we mod out by rotations and translations, and focus instead on comparing the configurations of the polygons {P ij }.Corollary 4 relates configuration similarity to homotopy distance.A writhing polygon has six configuration parameters: the two edge lengths, the angle between the edges, the distance from the origin, and two orientation parameters describing the polygon normal.We have found it useful to cluster using two characteristics: edge-edge writhing and distance from the origin.Writhing provides a mixed measure of all six degrees of configuration freedom; retaining distance mitigates the roughly inverse-square effect of distance on writhing.Similarity is easily checked, using for instance a best-aligning rotation in Corollary 4.

Structure Detection
We now combine the homotopy and self-convolution ideas to implement an algorithm for detecting common protein structure.There is one additional wrinkle, needed to deal with the segment length parameter k in Definition 4. When constructing the self-convolution ⊗(p), we replace the polygon P ij with a polygon formed from the best-line projections of the peptide segments p k i and p k j , as motivated by Lemmas 1 and 2. Denote this polygon by For the writhing number we use the true writhing of the two peptide segments, that is, Algorithm: Given polygonal curves p and q, distance δ > 0, and integer k ≥ 1, detect structural similarity as follows: 1. Compute ⊗ k (p) and ⊗ k (q).
2. Hash the polygons {P k ij (p)} and {P k ij (q)} based on w k ij and d ij , ignoring near zeros.
3. For each nonempty (or sufficiently full) hash bucket B wd of polygons do the following: • For each pair of δ-homotopic polygons P ∈ ⊗ k (p) and Q ∈ ⊗ k (q) in B wd , compute the rigid map E implied by Corollary 4. Hash the rigid map with its generating polygons.
The generating polygons and rigid maps associated with a hash bucket in Step 3 • offer an approximate solution (I, E) to the Structure Problem.The entire hash table describes all nontrivial alignments at the given hash table resolutions.We ignore polygons with near zero writhing or distance to avoid degeneracies.The solutions are approximate in the sense that the polygons P k ij are based on best-approximating edges and the maps E are clustered, potentially dilating δ.

Analysis
The Algorithm runs in time O(k 2 n 2 +k 2 m 2 +s 2 / 2 P +1/ 6 E ) and space O(n 2 +m 2 +1/ 2 B +1/ 6 E ) where n and m are the number of points in p and q, k is the half-length of a peptide segment, s is the maximum number of pairwise similar polygons appearing in a polygon hash bucket, and P and E are the resolutions of the polygon and rigid body hash tables, respectively.
In practice, k and E are constants.We took k = 5 and E = 0.1.1/ 6 E is the size of the hash table for Euclidean transformations.We represented each transformation as a 4D quaternion and a 3D translation, projected the quaternion into 3D, then hashed the resulting 6 numbers.Although s could be Θ(n 2 ), it depends on P .Choosing this carefully, the ratio s/ P becomes O(n).Thus, the algorithm has O(n 2 ) behavior, with n the maximum protein length.The hash tables could be replaced by k-D trees, Voronoi diagrams, or other clustering methods [30], but we did not do so.

Results
We implemented the algorithm in Lisp on a 1GHz PC running Windows.Running times for proteins with 300 residues were typically 1 minute or less, half of that garbage collection.
Here are three interesting pairs of proteins (see Appendix A for larger figures): 5at1 A vs. 8atc A: These are two different conformations of the catalytic chain A in Aspartate Carbamoyltransferase (ATC), a famous allosteric protein involved in the synthesis of pyrimidine nucleotides [39].
Chain A has two domains, that rotate with respect to each other as part of the process.Two loops change conformation drastically.Our algorithm detects both the similarities and the differences.The rigid map with the greatest number of aligned segments lies within 2 • in rotation and 0.6 Å in translation of the correct alignment.Our subsequent atom-alignment code assigns 289 of the 310 residues with RMSD 1.0 Å; the remaining residues constitute the two non-alignable loops.See figure to left (large view in Figure 3).3adk vs. 1gky: These two proteins have mere 19% sequence identity, are different lengths (194 vs. 186 residues), and include both matching and nonmatching secondary structures.Our code finds the alignment shown in the figure to the left (large view in Figure 4).The rigid map lies within 5 • and 0.5 Å of the CE-alignment.Our subsequent atom-alignment assigns 165 atoms with RMSD 2.9 Å, closely matching CE.
1xis vs. 1nar: These are two TIM barrels, with 7% sequence identity.We considered the 321 residues of 1xis without its tail versus the 289 residues of 1nar.
There are several possible alignments, related by rotation around the central barrel (Figure 5 shows two).This pair of proteins is interesting because even in optimal alignment there are significant angular differences between aligned helices.Such comparisons motivated our homotopy definitions.Our code finds an alignment with RMSD 3.3 Å, differing by 14 • and 1.5 Å from the optimal Dali-alignment.See figure to left (large view in Figure 6).

Summary
This paper introduced the notion of homotopy deformations into structural alignment.The paper explored the relationship between writhing and self-convolution.Self-convolution is a compact way of registering edge-edge interactions, and extends naturally to interactions of curve segments.Writhing and separation are useful shape descriptors for clustering pairs of curve segments.The paper presented an algorithm for matching substructures by clustering similar segment pairs, then clustering among the induced rigid maps.and homotopies e, f establishing that p is (E, d(p, r) + /2)-homotopic to r and that r is (F, d(r, q) + /2)-homotopic to q.Now let H = F • E and define h : I → ( 3 ) n by the rules h(t) = F (e(2t)) for t ∈ [0, 1  2 ] and h

Appendix A: Large Alignment Figures
. h is continuous since e, f , and F are continuous and since h( 12 )≡F (e(1)) = F (r) = f (0)≡h( 12 ).Let's look at the conditions (ii) of Definition 2 with data H and h: )) or f (2t − 1), both of which define polygonal curves.
(d) First, suppose that t ∈ [0, 1  2 ].In that case: Thus h establishes that p is (H, δ)-homotopic to q with δ = d(p, r) + d(r, q) + .Since is arbitrary this shows that d(p, q) ≤ d(p, r) + d(r, q).Part 2 (computability): The relevant decision question is: If p and q are polygonal curves and s is a rational number, is d(p, q) < s ?This question can be formulated as a sentence in the first order theory of the reals, hence is decidable.Here is a sketch of the proof: We can assume again that the two curves each have n points.Suppose that E ∈ SE (3) and δ ≥ 0 are given.We then have a robot motion planning for n point robots moving in three dimensions.The start configuration for robot #i is E(p i ), the goal configuration is q i .Robot #i is constrained to move within the intersection of two balls of radius δ, one centered at its start, the other at its goal.Moreover, edges drawn between two different pairs of successively indexed points may not move so as to intersect.More precisely, let h i (t) be the location of robot #i at time t, and let e i (t) be the line segment [h i (t), h i+1 (t)], for i = 1, . . ., n− 1.Then for all times t, we require that e i (t) e j (t) = ∅ if 1 ≤ i ≤ j − 2 ≤ n − 3 and e i (t) e i+1 (t) = {h i+1 (t)} if 1 ≤ i ≤ n − 2. This problem is effectively decidable as a question within the first order theory of the reals, in fact it lies within Pspace.See, for instance, [6,7,21,14].We thus have a procedure for deciding whether p is (E, δ)-homotopic to q.Let P (p, q, E, δ) be the corresponding predicate, then formulate the following sentence: ∃δ, ∃E : (0 ≤ δ) ∧ (δ < s) ∧ P (p, q, E, δ) For suitable parameterization of SE (3), this sentence is a rewording of the original decision problem as a problem within the first order theory of the reals and thus is decidable.If we want, we can further quantify over s to isolate the value d(p, q) as accurately as we desire.

Appendix B.4 [Corollary 4, p. 10]
Corollary 4: If R is a rotation such that the maximum distance between corresponding vertices of the two polygons R(e 2 e 1 ) and f 2 f 1 is δ, then there is a rigid body transformation E such that e 1 and e 2 are (E, δ)-homotopic to f 1 and f 2 , respectively.
Proof: We will use similar notation, algebra, and techniques as in the proof of Lemma 4.

Figure 1 :
Figure 1: The two types of crossings and their crossing numbers.

Figure 2 :
Figure 2: Two edges, e i and e j , generate a parallelogram P ij of directions, with vertices d 1 , d 2 , d 3 , d 4 , whose -signed spherically projected area A ij is their edge-edge writhing.(In this figure the edges are configured so that = +1.)

Definition 3 :
Define functions d E and d on pairs of polygonal curves as follows:

Figure 3 :
Figure 3: Alignment of 5at1 A (blue) and 8ATC A (red) found by our writhing-convolutionbased Algorithm.The backbones match nearly perfectly, except where they shouldn't, namely two loops that undergo significant conformational change.These loops appear near the top left and top right in the figure.

Figure 4 :
Figure 4: Alignment of 3ADK (blue) and 1GKY (red).The proteins have mere 19% sequence identity and include both matching and nonmatching secondary structures.Roughly 80% of the two proteins should align.One can see this in the figure, with the left parts matching well and some of the right clearly not.
Denote the resulting combinatorial structure consisting of all {(P k ij (p), w k ij (p), d ij (p))} by the symbol ⊗ k (p).