Predicting Protein Folding Kinetics via Temporal Logic Model Checking

We present a novel approach for predicting protein folding kinetics using techniques from the ﬁeld of model checking . This represents the ﬁrst time model checking has been applied to a problem in the ﬁeld of structural biology. The protein’s energy landscape is encoded symbolically using Binary decision diagrams and related data structures. Questions regarding the kinetics of folding are encoded as formulas in the temporal logic CTL. Model checking algorithms are then used to make quantitative predictions about the kinetics of folding. We show that our approach scales to state spaces as large as 10 23 when using exact algorithms for model checking. This is at least 14 orders of magnitude larger than the number of conﬁgurations considered by comparable techniques. Furthermore, our approach scales to state spaces at least as large as 10 32 unique conﬁgurations when using approximation algorithms for model checking. We tested our method on 19 test proteins. The quantitative predictions regarding folding rates for these test proteins are in good agreement with experimentally measured values, achieving a correlation coefﬁcient of 0.87.


Introduction
In the world of proteins, form usually follows function.Consequently, proteins are often studied in terms of their atomic-resolution structures.A detailed analysis of an enzyme's active site, for example, may reveal the mechanism by which it catalyzes a given reaction.Protein structures are not static, however, and conformational changes often play important functional roles.Moreover, large-scale conformational changes are also associated with a number of diseases, most notably the prion-related diseases.For these reasons, and others, it is interesting to study how a given protein moves between conformations.Such examinations may provide valuable insights into basic biology and pathology, as well as to the design of therapeutic or preventative interventions for certain classes of disease.
In this paper, we focus on what is typically the largest conformational change a protein will exhibit -folding.By folding we refer to the act of moving from a completely denatured form to the so-called native configuration.Unfortunately, there is no experimental technology that can provide atomic-resolution detail into the entire process of folding (or any other large-scale conformational change, for that matter).For this reason, computational methods are used to study large-scale conformational changes, including folding.Our work builds on prior research on the protein unfolding problem.In contrast to the well-known protein folding problem, the unfolding problem assumes that the native structure is already known.The computational challenge is to find low energy pathways between the unfolded and folded states.More specifically, we consider the Gō theory of (un)folding [12] wherein the folding process is driven by the formation of native contacts between residues (i.e, those present in the native structure).Non-native interactions deemed negligible, and are therefore ignored.Obviously, this is a highly simplified theory of folding.Nevertheless, this theory has been shown capable of making accurate quantitative predictions regarding the kinetics of folding (e.g., [1,8,11,16]).
Like previous algorithms for Gō-like theories, our algorithms operate on finite-state models of the protein's energy landscape.The primary contribution of our work lies in the observation that finite-state models of folding can be formally analyzed using techniques from the field of model checking [10].Model checking refers to a family of algorithms for automatically verifying dynamic properties of concurrent reactive processes.Historically, model checking has been used to verify the correctness and safety of circuit designs, communications protocols, device drivers, and other classes of software.More recently, model checking algorithms have been introduced for analyzing the properties of stochastic systems.Such model checking algorithms for stochastic systems have been used in the field of systems biology to verify properties of biochemical and regulatory networks (e.g., [15]).To our knowledge, however, model checking has not been applied to any problem within the field of structural biology.This paper is the first to do so.
There are three primary advantages of a model-checking approach to studying protein folding pathways: First, model checking algorithms compute over symbolic representations of finite state models, not explicit representations.The computational complexity of model checking algorithms is polynomial in the size of the encoding of the finite-state model.Thus, if a given finite-state model can be compressed, extremely large state spaces can be considered.Unfortunately, finding a minimal encoding for an arbitrary finite-state model is NP-hard.However, good heuristics for finding compact encodings exist.For example, model checking algorithms have been able to verify properties of systems having more than 10 20 states since 1990 [7], and have been applied to systems with as many as 10 120 states [5,6].In this paper, we show that using exact algorithms for model checking, energy landscapes with as many as 10 23 states are tractable.This is at least 14 orders of magnitude larger than has been attempted by comparable algorithms for studying protein folding pathways.We also show that energy landscapes with at least 10 32 states are tractable when using approximation algorithms for model checking.Second, model checking relies on formulas in a temporal logic to express precise queries about the behavior of the finite-state model.Temporal logics are very expressive and can be used to ask many questions of interest to protein folding.Third, model checking algorithms are exact; they are not simply a means for sampling or simulating the behavior of a system.There are, however, finite-state models that are too large for traditional model checking algorithms.For these, we use an algorithm for performing approximate model checking [19] which provides a guarantee on the quality of the computed result.
The organization of this paper is as follows: In Section 2, we define our model of protein folding.In Section 3, we briefly discuss model checking, and demonstrate how to encode the protein folding problem in a form suitable for model checking.In Section 4, we report the results of applying our method to 19 proteins and show that our quantitative predictions of folding rates are well-correlated with experimental values.We conclude with a discussion of ongoing work in applying model checking to the study of protein folding pathways.

A Simplified Model of Protein (Un)Folding
In this section we describe our model of protein folding; it is identical to that used in [16] and very similar to those reported elsewhere [1,8,11].
The thermodynamics of folding is governed by the Gibbs free-energy: ∆G = ∆E − T ∆S.Here, E is the energy (in kcal mole −1 ) of inter-residue interactions (e.g., hydrogen bonds, hydrophobic interactions, etc), S is the configurational entropy (in kcal mole −1 K −1 ), and T is the absolute temperature (in Kelvin).Free energy is a balance between the stabilizing contributions of inter-residue interactions and the loss of configurational entropy as the protein folds.
Definitions: Let P = a 1 , a 2 , ..., a n be a protein with n amino acids (aka residues) and m atoms.Let C ⊂ R 3m be the set of possible configurations/embeddings of P such that each C i ∈ C is consistent with the laws of physics.Let C F ∈ C be the native configuration of P as determined by, say, X-ray crystallography.Following [16] we define a contact as two non-hydrogen atoms from two different residues that are within 4 Å of each other 1 .Contacts between residues (i, i ± 1) and (i, i ± 2) are ignored because they tend to be present in every configuration of P .A contact map, M, is an n × n matrix where element M(i, j) is the number of contacts between residues i and j.We define a separate contact strength map, M S , that is the same size as M but whose elements are obtained by mapping the elements of M as follows: 1-5 contacts → 1; 6-10 contacts → 2; 11-15 contacts → 3; 16-20 contacts → 4. Intuitively, M S classifies contacts as being weak, medium, strong, or very strong.
The model assumes that folding is driven by the formation of the native contacts, and that nonnative interactions are negligible.Therefore, the state space of the protein can be modeled using a binary string, B ∈ {0, 1} n .Here, B(i) is 0 if the ith residue is completely unfolded and 1 if it is folded.There is an entropic penalty for each 1 in B which must be compensated for by the stabilizing energies of the native contacts.In particular, if B(i) = B(j) = 1, then we assume that the contacts between residues i and j (if any) are formed, and that the energy of that interaction can be used to offset the entropic penalty.
Under this model, there are 2 n possible states.Let B U be the bit string containing all 0's, and let B F be the bit string of all 1's.B U and B F correspond to the unfolded and folded states, respectively.Every other bit string corresponds to a partially folded state.Each state can be mapped to its free energy as follows: where α is the strength of a single contact and β is the entropic penalty for folding a single residue2 .The Boltzmann factor (i.e., weight) for any given configuration is a function of its energy, the gas constant (R) and the temperature, T; it is given by: w(B) = exp (−G(B)/RT ).Since we are only interested in changes in free energy (i.e., ∆G), we arbitrarily set G(B U ) = 0.A protein's energy landscape is constructed by applying Eq. 1 to every possible configuration.In this paper, it can be thought of as an n-dimensional discrete function.Computationally, our task is to find a low-energy path (or a set of paths) between B U and B F in the energy landscape.Thus, we must define a set of allowable transitions.Under the model, state s can only transition to those states that are similar.In practice, this means that transition are only allowed between pairs of states whose bit vector representations have small Hamming distance.In this paper, we allow transitions between pairs of states with Hamming distance 1.A toy example of the model for a 3-residue protein is shown in Figure 1.

Kinetics
The reaction kinetics of folding are described in terms of an energy profile along a chosen reaction coordinate.A reaction coordinate is a projection of the energy landscape onto one of lowerdimension.The energy profile tracks the total energy for each position along the reaction coordinate.Given an appropriately chosen reaction coordinate, one can make quantitative predictions regarding the rate of folding from the energy profile.There are a number of potentially relevant reaction coordinates from which to chose when studying protein folding including radius of gyration, solvent accessible area, number of folded residues, and so forth.Following Muñoz and Eaton [16], we will use the number of folded residues (i.e., the number of 1's in B) as our reaction coordinate.
For each position 0 ≤ k ≤ n, there are n k binary strings, each with its own energy.Let

(-2)
Figure 1: A toy example of the protein folding model.This finite-state model corresponds to a 3-residue protein.The state variables and the energy (in parens) are placed inside each node.The state labeled 000 is the unfolded state; the state labeled 111 is the folded state.In our experiments, we considered proteins with between 16 and 107 residues.
The Boltzman-weighted total energy for each position k along the reaction coordinate is G k = −RT ln( b∈B k w(b)).The energy profile for FKBP-12 is shown in Figure 2. In theory, it is possible to construct the energy profile by explicitly enumerating all 2 n binary strings.In practice, it is common to sample from the set of possible configurations.The algorithms reported in [1,8,11,16], for example, operate on state space ranging in size from 10 4 to 10 9 configurations.In contrast, we seek to consider the entire space of binary strings by adopting symbolic techniques from the field of model checking.
We note that because the Boltzmann weight of a configuration is exponentially related to the negative energy of its configuration, we can compute an upper bound for each G k by considering only the smallest-energy configurations for each k.It is these low-energy configurations we identify via model checking.Specifically, we seek to find the energy of the lowest-energy configuration for each k. 3 We will denote the lowest energy as Gk .
Given the value of Gk for all 0 ≤ k ≤ n, there are a number of ways to predict folding rates.Under a transition-state theory, for example, the folding rate, k ∝ k 0 exp(−∆G ‡ /RT ) where k 0 is a constant and ∆G ‡ = argmax k Gk − G0 .In this paper, we use a more accurate way to predict the folding rate in terms of the rate of decay of the average number of folded residues starting from the folded state [16].

Path length
Figure 2: Energy profile for FKBP-12, as computed by our method.

Model Checking
The field of model checking was born from a need to formally verify the correctness of hardware designs.Since its inception in 1981, it has expanded to encompass a wide range of techniques for formally verifying finite-state transition systems, including those with stochastic behavior.Model checking algorithms are simultaneously theoretically very interesting and very useful in practice.Significantly, they have become the preferred method for formal verification in industrial settings over traditional verification methods like theorem proving, which often need guidance from an expert human user.A complete discussion of model checking theory and practice is beyond the scope of this paper.The interested reader is directed to [10] for a detailed treatment of the subject.

Modeling Concurrent Systems
Let AP be a set of atomic propositions.An atomic proposition, a, is a Boolean predicate referring to some property of the system.A Kripke strucutre, M , over AP is a tuple, M = (S, S 0 , R, L).
Here, S is a finite set of states, S 0 ⊆ S is the set of initial states, R ⊆ S × S is a total transition relation between states, and L : S → 2 AP is a labeling function that labels each state with the set of atomic propositions that are true in that state.Variations on the basic Kripke structure exist.For example, if the system is stochastic, then we replace the transition relation, R, with a stochastic transition matrix, T where element T (i, j) contains either a transition rates (for continuous-time Markov models) or a transition probability (for discrete-time Markov models).

Application to Energy Landscapes
The Kripke structure used in this paper closely follows the model of protein folding described in Section 2. The set of states, S, is isomorphic to the set of n-bit binary strings.The set of initial states, S 0 , corresponds to (B U ).The transition relation, R, allows transitions between pairs of states whose bit-vector representations have Hamming distance 1.
The labeling function, L, maps each state to an energy and works as follows: Recall that B k is the set of bit strings where k bits are 1 and n − k bits are 0. In this paper, our atomic propositions are generally of the form: "is the minimum energy of B ∈ B k = c?".An interesting property of proteins is that that the energies of folding are bounded to a relatively small, constant-size range.In particular, the difference between G(B U ) and G(B F ) is generally 1 to 10 kcal mol −1 .The energy barrier which separates the unfolded and folded states is also typically 10 kcal mol −1 or smaller at room temperature.Indeed, the energy barrier must be small, or else folding won't occur.Thus, the domain of possible energies is, in effect, bounded by a constant of around 20 kcal mol −1 .This range is not related to the size of the protein.The set of possible states, on the other hand, is exponential in the size of the protein.Due to the discrete nature of our energy function and the fixed precision of the parameters α and β in Eq. 1, we can then apply the pigeonhole principle and conclude that the number of unique energy values is also constant.This will ultimately lead to a very efficient representation of the labeling function, as discussed in the next section.
In summary, assuming a Gō-like model of folding, we have shown that a protein's energy landscape can be encoded as a Kripke structure.In the model checking literature, Kripke structures are not represented explicitly, but rather symbolically.In the next section we discuss techniques for representing Kripke structures symbolically.

Symbolic Encodings of Kripke Structures
The basis for symbolic encodings of Kripke structures, which ultimately facilitated industrial applications of model checking, is the reduced ordered Binary Decision Diagrams (BDDs) introduced by Bryant [4].BDDs are directed acyclic graphs that symbolically and compactly represent binary functions, f : {0, 1} n → {0, 1}.While the idea of using decision trees to represent boolean formulae arose directly from Shannon's expansion for Boolean functions, two key extensions made to it were the use of a fixed variable ordering and the sharing of sub-graphs.The first extension made the data structure canonical while the second one allowed for compression in its storage.A third extension, also introduced in [4], is the development of an algorithm for applying Boolean operators to pairs of BDDs, as well as an algorithm for composing the BDD representations of pairs of functions.Briefly, if f and g are Boolean functions, the algorithms implementing operators APPLY(f ,g,op) and COMPOSE(f ,g) compute directly on the BDD representations of the functions in time proportional to O(|f ||g|), where |f | is the size of the BDD encoding f .BDDs can be generalized to Multi-terminal BDDs (MTBDD) [9], which encode discrete, real-valued functions of the form f : {0, 1} n → R. Significantly, MTBDDs can be used to encode real-valued vectors and matrices, and algorithms exist for performing matrix addition and multiplication over MTBDDs [9].These algorithms play an important role in several model checking algorithms for stochastic systems [3].

Application to Energy Landscapes
As previously mentioned, we can encode energy landscapes using Kripke structures.It follows, therefore, that energy landscapes can be encoded symbolically using a combination of BDDs and MTBDDs.In particular, the transition relation, R, and the labeling function, L, can be encoded using BDDs and MTBDDs, respectively.
In practice, the construction of the BDDs and MTBDDs is done automatically from a highlevel language describing the finite-state system and its behavior.Here, we use the specification formalism of reactive modules [2] as provided in the model checking tool PRISM [13].Briefly, each residue is modeled as a separate two-state process (i.e., folded or unfolded).Residues change state asynchronously, and only one residue is allowed to change at any given time (thereby enforcing the Hamming-distance rule).The set of possible states of the system corresponds exactly to the set of n-bit strings.The set of allowable transitions is ultimately encoded as a BDD and the labeling function is encoded as a MTBDD.

Temporal Logics
Temporal logic is a formalism for describing behaviors in finite-state systems.They have been used since 1977 to reason about the properties of concurrent programs [18].There are a number of different temporal logics from which to chose, and different logics have different expressive powers.In this paper, we use a small subset of the Computation Tree Logic (CTL).CTL formulae can express properties of computation trees.The root of a computation tree corresponds to the set of initial states (i.e., S 0 ) and the rest of the (infinite) tree corresponds to all possible paths from the root.A complete discussion of CTL and temporal logics is beyond the scope of this paper.The interested reader is directed to [10] for more information.
The syntax of CTL is given by the following minimal grammar: Here, a ∈ AP , is an atomic proposition (e.g., "does state s have energy c?"); "true" is a Boolean constant; ¬ and ∨ are the normal logical operators; E is the existential path quantifier (i.e., "there exists some path from the root in the computation tree"); and X and U are temporal operators corresponding to the notions of "in the next state" and "until", respectively.Given these, additional operators can be derived.For example, "false" can be derived from "¬true" and the universal quantifier, AXφ, can be defined as ¬EX¬φ.

Application to Protein Folding
Clearly, CTL formulas can express a rich set of properties concerning reachability (e.g., "will the protein end up in a particular configuration?") and/or the logical ordering of events (e.g., "will the second residue fold before the first one?").Numerous extensions to CTL exist which facilitate questions regarding explicit timings (e.g., "will the protein fold within t milliseconds?")or likelihoods (e.g., "what is the probability that the protein fold within t milliseconds?").In this paper, we only consider CTL formulas of the following form: let a kc ∈ AP be an atomic proposition that asks "does the state s have k folded residues and have energy c?", the CTL formula E[true U a] asks "Is there a path from S 0 to some other state, s ∈ S, such that s |= a?" To find the minimum energy state for fixed k, we can perform a binary search over different values of c. 4Recall, that we argued that the range of energies is bounded by a constant and that the number of unique energy values is also constant.Therefore, the cost of the binary search is O(1).

Model Checking Algorithms
Having defined a Kripke structure and the CTL formula, we can then use existing model checking algorithms for verifying the formula, given a symbolic encoding of the Kripke structure.A model checking algorithm takes a Kripke structure, M = (S, S 0 , R, L), and a temporal logic formula, φ, and finds the set of states in S that satisfy φ: {s ∈ S | M, s |= φ}.The complexity of model checking algorithms varies with the temporal logic and the operators used.For kinds of formulas used in this paper (i.e., E[φ 1 Uφ 2 ]), an explicit state model checking algorithm requires time linear in the size of the finite-state model and in the length of the formula ([10] p 35-36).
Of course, for very large state spaces, even linear time is unacceptable.Symbolic model checking algorithms operate directly on BDD/MTBDD encodings of the Kripke structure and CTL formula.Briefly, the temporal operators of CTL can be characterized in terms of fixpoints.Let P(S) be the powerset of S. A set S ⊆ S is a fixpoint of a function τ : P(S) → P(S) if τ (S ) = S .Symbolic model checking algorithms define an appropriate function, based on the formula, and then iteratively find the fixpoint of the function.This is done using set operations that operate directly on BDDs/MTBDDs.The fixpoint of the function corresponds exactly to {s ∈ S | M, s |= φ}.The interested reader is encouraged to read [10], ch.6 for more details.
Explicit-state and symbolic model checking algorithms are exact.There are also approximation algorithms for model checking algorithms (e.g., [19]), which employ sampling techniques and hypothesis testing.Such algorithms provide guarantees, in terms of the probability of the property being true, and can scale to much larger state spaces.These do not use BDDs/MTBDDs, but rather operate on the high-level language description of the finite-state model (see Sec. 3.2).We explored the use of both exact and approximate algorithms for model checking in our experiments.

Experiments and Results
We replicated the experiments of Muñoz and Eaton [16] who made predictions on 19 proteins 5The largest protein in that set, FKBP-12 (PDB id 1FKB), has 107 residues.Muñoz and Eaton consider state spaces in the range of size O(10 3 ) to O(10 9 ) states.In contrast, we have successfully performed exact model checking on state spaces of size 2 76 ≈ 10 23 using 2GB of memory on a single processor of a 4-node cluster.The time taken for these experiments is shown in Table 1.For proteins up to 74 residues, the longest runtime was under 30 minutes.Then, there is a jump to almost 7 hours for a 76-residue protein.The increase in time is due to thrashing of virtual memory.In general, The computation time is dominated by the time to construct the MTBDD.The actual cost of performing the model checking is under 90 seconds.Both load time and model checking time are correlated with the length of the protein for proteins up to 74 residues, with a correlations of 0.77 and 0.78, respectively, (p = .02).However, these are not monotonically related to length.No significant correlations between load times, model checking time and actual folding rates were observed.We also ran experiments with an approximation algorithm for model checking [19].These all completed in under 11 minutes.The time to perform approximate model checking is strongly correlated with protein length (R = 0.97, p 0.001).The largest state space we considered using the approximation algorithm has 2 107 ≈ 10 32 states.
Figure 2 shows one sample energy profile computed using model checking for the protein FKBP-12.Using the technique described in [16] for transforming the free-energy profile into a quantitative prediction of folding time, we predicted the folding times for each of the 19 proteins.The correlations between the logarithms of the predicted folding rates and the experimentally measured values [14] are shown in Figure 3.The correlation coefficient between predicted and experimental values is 0.87.By comparison, Muñoz and Eaton achieve correlation coefficients between 0.83 and 0.87 on the same proteins, depending on which approximation was used.Plaxco and co-workers developed a simple method for predicting folding rates based on contact order (a length-normalized average sequential distance between contacting residues) [17].Their correlation coefficient on 18 of the 19 proteins studied in this paper was 0.64.The mean absolute error of our predictions is 1.55.In comparison, the mean errors reported for two different techniques on a similar, but not identical, set of proteins in [8] was 2.77 and 3.42, respectively.

Conclusions and Future Work
We have presented an approach to predict the rate of folding using techniques from the field of model checking.We believe this paper represents the first application of model checking to a problem in structural biology.The key advantages of this approach are that it scales to extremely large state spaces and that it is exact.In terms of accuracy, our predictions of folding rate are well-correlated with experimentally determined values.However, it remains to be seen whether such levels of accuracy can be obtained when analyzing significantly larger proteins.
There are numerous extensions to this work that we intend to pursue.First, we have only begun to explore the kinds of queries that can be encoded in temporal logics.Second, a more thorough analysis of the relationship between the answers obtained via exact and approximate model checking is necessary.Finally, our model does not actually include any stochastic behavior.We have developed stochastic variants of our model of folding and we intend on applying model checking algorithms for stochastic systems to these.A comparison between the stochastic and non-stochastic techniques is planned.

MTBDD
be the set of bit strings with k 1's and n − k 0's.

Figure 3 :
Figure 3: Scatter plot of log predicted (y-axis) and actual (x-axis) folding rates.The correlation coefficient is 0.87, p 0.001

Table 1 :
Build MC Time Approximate MC PDB Id Residues Time (seconds) (seconds) Performance Statistics.MC = model checking.Column 3 indicates whether exact or approximate model checking was used.MTBDD build times are only relevant to exact MC because approximate MC does not use MTBDDs.The approximation error bound was set to 1% of the energy for these experiments.