Towards an accurate prediction of the thermal stability of homologous proteins

Protein thermostability has been the focus of growing research interests in the last decades since its understanding and control play important roles in the optimization of a wide series of bioprocesses of academic and industrial importance. The complexity of this issue is rooted in the fact that the mechanisms ensuring thermal resistance are not unique and specific, but rather family- or even protein-dependent. Therefore, and despite the amount of research already accomplished, obtaining fast and precise thermal stability predictions is still a challenge, especially on a large scale. This article deepens the study of protein thermal stability and is focused on the prediction of its best descriptor, the melting temperature Tm. The relations between Tm and a series of factors that are expected to influence the protein stability are analyzed and discussed. Different Tm-prediction methods that utilize these factors, sometimes with additional information about homologous proteins, are introduced, and their individual performances are evaluated. The best methods are based on temperature-dependent statistical potentials, on the environmental temperature of the host organism, on the fraction of charged residues, and on the number of residues. They are combined to build an improved prediction method with significantly increased score. The root mean square deviation between the computed and experimental Tm-values for 45 proteins of known structure from 11 families is about 7°C in cross-validation and decreases to 5°C when 10% outliers are removed. The associated linear correlation coefficients are equal to .91 and .95, respectively.


Introduction
The understanding of the rules that govern the thermostability remains one of the important issues of protein science, whose achievement would be extremely valuable for many applications. In particular, if one were able to rationally design enzymes of biotechnological interest that remain stable and active at temperatures that are higher or lower than their physiological temperatures, the bioprocesses in which they are involved could be optimized and made more efficient. From a theoretical point of view, it is furthermore interesting to understand at the protein level, the complex mechanisms that allow the extremophile organisms to adapt to their extreme environment.
Unfortunately, the analyses aiming at explaining such mechanisms show a strong family dependence, and the contribution of the different types of amino acid interactions to the thermal stability is still debated (see for example (Berezovsky, 2011;Chakravarty & Varadarajan, 2002;Folch, Dehouck, & Rooman, 2010;Kumar & Nussinov, 1999Kumar, Tsai, & Nussinov, 2001Thompson & Eisenberg, 1999;Vogt, Woell, & Argos, 1997)). Such understanding is made extremely intricate by the temperature dependence of the amino acid interactions, as some of them are more stabilizing than others in the high-temperature regime and vice versa at low temperature, and by the general lack of temperature-dependent energy functions.
The complex T-dependent behavior of the atomiclevel forces is reflected at the protein level by the existence of several scenarios for modulating the resistance to temperature (Nojima, Ikai, Oshima, & Noda, 1977;Razvi & Scholtz, 2006). The stability of a protein is defined by the folding free energy ΔG°(T) as a function of the temperature, which shows an inverted bell-like shape; its thermodynamic stability is defined by the value of ΔG°(T R ) at room temperature T R , whereas its thermal stability is defined by the value of T m , the temperature midpoint at which the (un)folding transition occurs and where ΔG°vanishes. One strategy a protein uses for increasing its T m consists in decreasing its folding free energy ΔG°(T) at all temperatures. Another is to shift the whole stability curve ΔG°(T) toward higher temperatures, and the last widens the bell-shape of the ΔG°(T)-curve. Only in the first scenario, the thermodynamic and thermal stabilities are correlated. In the others, the situation is subtler.
Despite the fundamental and applicative importance of the thermal stability topic, this issue remains poorly investigated. Indeed, only few tools Ghosh & Dill, 2010;Gorania, Seker, & Haris, 2010;Ku et al., 2009;Pucci, Dhanani, Dehouck, & Rooman, 2014;Robertson & Murphy, 1997) have been built to predict on a large scale the melting temperature T m which is the best descriptor of the thermal stability. In contrast, the thermodynamic stability, or more precisely the thermodynamic stability changes upon mutations, has been much more studied and a lot of prediction tools are available in the literature. The reason for this difference is rooted in the fact that it is much more difficult to predict T m than ΔG°(T R ), as the former requires knowing the T-dependence of the amino acid interactions, whereas the second not and can quite well be estimated using classical, T-independent, energy functions.
In trying to fill this gap, we describe and compare different strategies for studying the thermal stability. Some of them have already been presented elsewhere Ghosh & Dill, 2010;Gorania et al., 2010;Ku et al., 2009;Robertson & Murphy, 1997), and others are new or have not yet been tested. The guiding thread consists in analyzing systematically and on a common test set the performance of each individual method and to combine the best of them with the aim of building a tool that yields the most accurate T m -predictions. The corner stone of our combined method is the use of temperature-dependent statistical potentials, derived from sets of proteins with different thermal properties. These potentials are to the best of our knowledge the only energy functions that estimatealbeit in a simplified mannerthe temperature dependence of amino acid interactions.

Methods
We present and analyze eight different T m -prediction methods, which are trained on different data sets and tested on a common subset of 45 proteins with known T m, belonging to 11 homologous families. These are myoglobin, α-amylase, acylphosphatase, α-lactalbumin, lysozyme, adenylate kinase, cell 12A endoglucanase, cold shock protein, ribonuclease, cytochrome P450, and β-lactamase. The list of all proteins with their specific characteristics is given in Table S1 of the Supplementary Material. They have been chosen by requiring their three-dimensional (3D) structure and melting temperature to be experimentally determined. Moreover, they have to belong to a homologous family with at least three entries. When different experimental T m -values were available for a single protein, we chose the one measured at the pH closest to 7; if different measurements were performed at the same pH value, we considered their average. Buffer compositions and ionic strength, as well as other environmental conditions at which the melting temperatures were determined, were not considered in the construction of the data set (see  and Supplementary Table S1).
The input data is, according to the T m -prediction method, the amino acid sequence of the target protein, its experimental 3D structure, and/or the environmental temperature of its host organism (T env ). More precisely, the first two strategies are based on T env and the third method on the sequence similarity between the proteins. Two additional methods are based on the statistical potential formalism in its standard formulation (4th) and in a modified version that takes into account more properly the thermal properties of the proteins (5th). Methods 6 and 7 are derived from the amino acid composition of the target protein and from its number of residues, respectively. The last method makes use of the distribution of B-factor values determined by X-ray crystallography. We assessed the performances of these different methods using a leave-one-out cross-validation procedure. The protein whose T m is predicted is thus excluded from both the optimization of the model's parameters and the derivation of the statistical potentials.
Methods 1 and 2: deriving T m from T env It is well known that there is a relation between the melting temperature T m of a given protein and the environmental temperature of its host organism T env . Here, we define T env as the optimal growth temperature for the micro-and cool-blooded organisms and as the body temperature for the warm-blooded organisms. This relation is due to the fact that thermophilic organisms host thermostable proteins even if the contrary is not true: mesophilic organisms can indeed host proteins that remain active for temperatures well above their T env . Altogether, a quite good linear correlation between T m and T env is observed in general.
For Method 1, we used the formula relating T m and T env that has been derived in Dehouck et al. (2008) from a set of 127 monomeric proteins with known melting temperature and X-ray structure, which belong to different homologous families. It reads as: One can expect the correlation between T m and T env to increase when restricted to a given family of homologous proteins, as proteins of the same family tend to show similar characteristics. To investigate this dependence, we computed T m -T env regression lines inside each of the 11 abovementioned families; they are reported in Supplementary Table S2. Method 2 is based on these familydependent T m -T env regression lines. Its drawback is the statistical significance of these lines, which is questionable due to the small number of members in each family.

Method 3: deriving T m from sequence similarity
This method simply amounts to assigning as T m of a target protein, the T m of the protein that has the closest sequence and belongs to the same family. Indeed, the thermostability properties are in principle related to the primary sequence, and thus, proteins with higher sequence similarity should show similar thermal behaviors. The drawback of this method is the small number of proteins in each family, which makes the statistical significance of the analysis difficult to assess.
Method 4: deriving T m from statistical potentials Here, motivated by the fact that the thermodynamic and thermal stabilities are frequently correlated, we predicted the melting temperature using as main ingredient the value of the folding free energy computed using the standard statistical potential formalism. This formalism is well known and often used for predicting thermodynamic stability changes upon point mutations, protein structure, the ranking of protein-ligand interactions, and proteinprotein binding affinities (Dehouck et al., 2009;Dehouck, Gilis, & Rooman, 2006;Dehouck, Kwasigroch, Gilis, & Rooman, 2011;Dehouck, Kwasigroch, Rooman, & Gilis, 2013;Gohlke, Hendlich, & Klebe, 2000;Hamelryck et al., 2010;Huang & Zou, 2006;Parthiban, Gromiha, & Schomburg, 2006;Shen & Sali, 2006). In its standard derivation, a potential of mean force (PMF) is extracted from the frequencies of observation of sequence-structure associations in a data set of 3D protein structures. In this analysis, we used the data set of 166 proteins with known structure and melting temperature introduced in . Let c and s be structure and sequence elements, respectively. Using the inverse Boltzmann law, we computed the PMF as: where F represent the relative frequencies of c and/or s elements, which are then expressed as a function of their number of occurrences n; k B is the Boltzmann constant and T the absolute temperature. Note that we chose to model the unfolded state in Equation (2) by a reference state in which the structure element c does not depend on the sequence. More specifically, we built distance potentials that describe tertiary interactions and measure the propensity of a residue a i or a given pair of residues (a i , a j ) (i and j denote positions along the polypeptide chain) to be separated by a certain spatial distance d ij from the other residues or from each other. We also considered torsion and solvent accessibility potentials that describe local (secondary) interactions along the polypeptide chain and measure the propensity of residues a i or (a i , a j ) to be associated with a certain backbone torsion angle domain t k or to a certain value of the solvent accessibility h k (Dehouck et al., 2006). For details on the derivation of the data set, the construction of potentials, and the choice of the parameters, we refer the reader to .
We thus derived different potentials from the data set two distance potentials ΔΩ(a i , a j , d ij ) and ΔΩ(a i , d ij ), three torsion potentials ΔΩ(a i , a j , t k ), ΔΩ(a i , t k ), and ΔΩ(a i , t j , t k ), and three solvent accessibility potentials ΔΩ(a i , a j , h k ), ΔΩ(a i , h j , h k ), and ΔΩ(a i , h k )and used them to evaluate the folding free energy of a target protein that belong to the family f as: where N is the number of residues of the target protein, and positions i and j are in a window of 17 residues centered around k for the potentials ΔW 3 , ΔW 4 , … ΔW 8 . For the distance potential, the d values between 3.0 and 8.0 Å were grouped in 25 bins of .2 Å each, while two additional bins consider the distances larger than 8.0 and smaller than 3.0 Å. The values of the solvent accessibility h, defined as the ratio between the solvent accessible surface in the given structure and in the extended tripeptide Gly-X-Gly conformation and computed by in-house program (Dalkas, Teheux, Kwasigroch, & Rooman, 2014), have been discretized into five different bins (0-5, 5-15, 15-30, 30-50, and 50-100%) in the derivation of the solvent accessibility potentials. Finally, b α and Z f are free parameters to be optimized. Note that in the current method, only some of the energetic contributions listed in Equation (4), more specifically ΔW 1 , ΔW 2 , ΔW 3 , ΔW 4 , have been utilized in the definition of the folding free energy DG f to avoid overfitting problems in the determination of the parameters .
Finally, assuming that the thermal and the thermodynamic stabilities are perfectly correlated, and thus that T m and ΔG°are linearly anticorrelated, we computed the melting temperature as a linear function of the folding free energy: The numerical coefficients Z f , b α , c 1 , and c 2 were optimized so as to minimize the root mean square deviation between experimental and computed melting temperatures, called σ, for the data set of 45 proteins; note that the normalization coefficient Z f is family dependent. This procedure was performed in cross-validation: each target protein whose T m we want to predict was removed both from the data set from which the potentials were derived and from the parameter optimization of Equations (3-5) (see  for further details).
Method 5: deriving T m from temperature-dependent statistical potentials We used here a modified version of the statistical potential formalism, which has been introduced in Folch, Rooman, & Dehouck (2008), Folch et al. (2010), . The basic idea behind this method is that the temperature dependence of the amino acid interactions must be taken into account to study protein thermostability more appropriately.
To consider such dependence, we built different data sets in which only proteins with given thermal properties were considered. Since only a limited number of experimental protein structures with known T m are currently available, the data set was divided into two subsets (only). In the first subset, only mesostable proteins with T m below the threshold value of about 65°C (chosen to have an equal number of proteins in the two subsets) were considered, while in the second subset, only the thermostable proteins with T m larger than the threshold were included.
From this point on, the strategy proceeds along the same steps as the previous method. First, we derived the different potentials as described in Equation (4) from each of the two data sets, with the only difference that now the values of the number of occurrences, the frequencies, and the resulting potentials depend on a temperature, called T m , which is defined as the mean value of the melting temperature of the proteins belonging to the data set from which they are extracted ( T m % 50 C for the mesostable ensemble and T m % 80 C for the thermostable one). We then extended Equation (3) to estimate the T m -dependent folding free energy of a given protein that belongs to the family f as: where, the sum is over the same four potentials as in Method 4. We assumed that the difference in folding free energy at different temperatures, denoted as ΔY: is related to the melting temperature through a linear relation: In Figure 1, the correlation between ΔY and T m is schematically illustrated. Again, all the coefficients were optimized in cross-validation, so as to minimize the root mean square deviation σ between the experimental and computed melting temperatures.
Method 6: deriving T m from the amino acid composition The sixth method consists in deriving the melting temperature from the amino acid composition of a given protein. In the literature, there have been a lot of efforts devoted to understand the relation between thermostability properties and the amino acid composition, usually by comparing the composition of thermophilic proteins Figure 1. Schematic illustration of the folding free energy difference ΔY between the thermostable free energy ΔG°computed at about 80°C and the mesostable free energy ΔG°c omputed at about 50°C. Such quantity is used as a key ingredient in the prediction Method 5 and in the combined method.
In the data set of 166 proteins considered here, we observe only a weak correlation between the melting temperature and the fraction of charged residues (linear correlation coefficient r = .23, P-value .0024), an anticorrelation with the fraction of polar residues (r = −.21, P-value .0057) and no correlation with the fraction of hydrophobic residues (r = .04).
Despite these weak correlations, we tried to use the amino acid composition he a i (with α = 1 … 20) to predict T m through the following simple expression: The twenty-one coefficients, b α , were fixed in crossvalidation by a standard linear regression analysis that minimizes the σ-value for the whole set of 166 proteins.

Method 7: deriving T m from the number of residues
The seventh method is based on the number of residues N of the protein. Indeed, since this quantity is related to both the heat capacity and the enthalpy changes upon folding (ΔC p and ΔH m , respectively) and to the compactness of the protein structure, one can expect a good correlation between its value and the melting temperature (Ghosh & Dill, 2010;Robertson & Murphy, 1997). Moreover, there is a strong correlation between N and the change in surface area of the protein upon folding (ΔASA), which seems also to be related to the thermal stability (Ghosh & Dill, 2010). These simple models take into account the fact that thermostable proteins are often smaller and more rigid compared to their mesostable homologs.
The exact relation used in deriving T m is linear in N and simply reads as: with the parameters c 5 and c 6 optimized in cross-validation to minimize σ for the set of 166 proteins. Note that a similar prediction method was built from the computation of ΔASA using an analogous expression: However, no substantial difference in the prediction performances has been observed using this equation instead of (10).

Method 8: deriving T m from B-factor values
The residual flexibility of the folded protein structure was suggested to be related to the thermal stability of the protein (Zavodszky, Kardos, Svingor, & Petsko, 1998), even though no precise quantitative proofs are found in the literature. In general proteins with a higher thermal stability tend to be more rigid and show small conformational fluctuations with respect to their mesophilic counterparts.
In this method, we used the values of the atomic displacement parameters (B-factors) obtained from the X-ray structures, which provide indications about the degree of structural flexibility, to predict the protein melting temperature. We defined the B-factor B i of a residue i as the mean value of the B-factors of all its atoms. To prevent the B-factor of a protein to be dominated by outliers, we compute for each residue, following (Smith et al., 2003), the value of M i defined as: with hBi the average B-factor of the protein. The outliers i were defined as the residues having M i > 3.5. The mean [MB], variance [SB], and kurtosis [KB] of the B-factor distributions for the 45 proteins in our data set, from which outlier residues were excluded, were then computed. Using the simple functional form: we predicted the values of the melting temperature of a protein from its experimental B-factor distribution.

Combined method
The last method that we analyze is a combination of some of the best performing individual methods. Based on the performance of the eight above-described T m -prediction methods (see Results section), we selected Methods 1, 5, 6, and 7 to construct a new and better performing prediction tool. We exclude Methods 3, 4, and 8 because they do not perform sufficiently well, and Method 2 because it is essentially a variant of Method 1. Note that several other combinations have been tested but led to lower prediction scores. We also tried to limit the number of terms in the combination to keep the number of parameters to be identified as low as possible, since the data set on which we train the model is not very large and can thus easily lead to overfitting problems. The basic features that we use in the final combined method are thus the following: the environmental temperature of the host organism T env , the temperaturedependent statistical potentials, the number of residues N, and the fraction of charged residues Q.
More precisely, the melting temperature was predicted by the linear relation: The five potentials considered in this prediction are ΔW 1 , ΔW 4 , ΔW 5 , ΔW 7 , and ΔW 8 , as specified in Equation (4). Note that the combination used is slightly different from the one considered in Method 5 and presented previously in . Furthermore, the number of residues does not appear as an additive term (as in Method 7), but rather as a normalization factor of the potentials. The eight coefficients (b α , α = 1 … 8) that appear Equation (14) were identified by an ordinary least square regression analysis aimed at minimizing the root mean square deviation σ between predicted and experimental T m -values; this procedure is performed with a rigorous cross-validation using a leave-one-out method.

Results
We show here the results of the methods described in the previous section in an attempt to understand the reasons of their different levels of performance. We also discuss the main sources of error and the way in which they can affect the computational results, and briefly comment on the possible further improvements of our prediction tool. The values of the root mean square deviation σ between the computed and predicted T m computed on the whole set of 45 proteins, obtained with the eight individual methods and with the combined method, are listed in Table 1. In addition, Table 2 shows the σ-values obtained for each of the 11 protein families separately.
The worst performing method, with σ ≈ 25°C, is the one based on sequence similarity (Method 3). Note that the score of this method can be expected to improve with the increase of the number of proteins inside the families, and thus by the presence of proteins more similar to the target protein. However, we do not observe a significant correlation between the prediction performances of this method and the sequence identity (SI) inside the families, as seen in Table 2. We are thus led to conclude that, although the score of this method will probably increase with the number of proteins in each family, it will nevertheless remain limited, and that sequence similarity cannot be used as a reliable parameter in thermal stability analyses. Other works in the literature reach analogous conclusions. For example, it has been shown that two sequences that differ by few residues can have completely different thermal stability behaviors (see for example the cold shock proteins Bs-Csp (from Bacillus subtilis, mesostable) and Bc-Csp (from Bacillus caldolyticus, thermostable) (Perl, Mueller, Heinemann, & Schmid, 2000).
Another method that does not perform very well is the one based on the amino acid composition (Method 6), which yields a σ-value of about 16°C. Different attempts have been performed in the literature to predict the melting temperature just from the primary sequence. Statistical analyses attempted to understand whether the abundance of certain types of amino acids can be related to the thermal stability mostly by comparing protein sequences belonging to mesophilic to thermophilic organisms. Unfortunately, the results that we found confirmed the earlier findings (Jaenicke & Böhm, 1998;Ponnuswamy et al., 1982;Zhou et al., 2008), indicating that no specific relations can be inferred between the composition of a protein and its thermostability properties. Note, however, that the method does not perform so poorly when the 10% outliers are removed (σ ≈ 13°C).
Method 7, which is based on the number of residues, does not perform particularly well either, with a root mean square deviation of about 17°C. This result seems at a first sight a bit counterintuitive: some arguments in Table 1. Performance and statistical significance of the different Tm-prediction methods. In the third and fourth columns, we report the root mean square deviation σ and the Pearson correlation coefficient r between predicted and experimental Tm-values in cross-validation (leave-one-out procedure). The 10% worst predicted elements (5 proteins) are excluded from the computation of σ* (fifth column). Statistical t-test for each prediction method, and Hotelling t-test between each method and the combined method, are shown in the sixth and seventh columns, respectively. the literature seem to suggest that the compactness of a protein structure influences its thermal stability Robertson & Murphy, 1997). The smaller and the more compact the proteins, the more stable they are at high temperature and vice versa. However, we found that the T m -correlations with the number of residues (and also with the change in solvent accessible surface area upon folding ΔASA, data not shown) are low and thus the resulting predictions do not show very good performances. Note that the number of residues of the members of the same family is approximately equal, which explains our findings. Method 8 is derived from the B-factor analysis and does not show a good performance either, with a root mean square deviation of about 16°C that reduces to 12°C when the 10% outliers are removed. In principle to perform a correct comparison between B-factor values belonging to different proteins one should consider the different temperatures at which the protein crystallization experiments were performed. Unfortunately, these data are not always available and this probably makes the method not so accurate. Despite this, in some of the families, i.e., acylphosphatase, lysozyme, myoglobin, and cold shock protein, the correlation between the computed average value of the B-factor distribution [MB] and the melting temperature is very high, while in others, these two quantities seem instead completely uncorrelated (as in the case of α-amylase). A more detailed and complete analysis is thus needed to further deepen the interesting relation between the thermal stability, the flexibility of the protein structure, and the B-factor distribution.
The method derived from the standard statistical potentials (Method 4) shows a high root mean square deviation of about 18°C. The principal reason for this poor value can be explained by the fact that the thermodynamic and thermal stabilities are not perfectly correlated.
As described in the introduction, different strategies are indeed used by the proteins to enhance their thermal resistance, and the increase in the thermodynamic stability is just one of them. The limited score of this method justifies our choice of designing methods that directly predict T m , without detour through ΔG°prediction. Note, moreover, that the prediction of ΔG°and thus of T m is strongly dependent on the choice of the reference state (see Equation (2)). The reference state dependence is instead almost smoothed out when differences in folding free energy are considered as in Method 5.
The other statistical potential-based method (Method 5), which is based on temperature-dependent potentials, performs better than the standard one. It indeed shows a root mean square deviation of about 13°C, which falls down to 11°C when 10% outliers are removed. The T mpredictions obtained with this method on all the 45 proteins of the set are plotted in Figure 2(a). It is actually not surprising that this modified version of the statistical potentials works better than the standard one. Indeed, we have explicitly built these potentials to take into account more properly the temperature dependence of the amino acid interactions, since only proteins with given thermal properties are considered in the data set from which they are derived.
The environmental temperature is one of the best performing features (Methods 1 and 2). The results of the T m -predictions obtained with Method 1 for all considered proteins are plotted in Figure 2(b). The root mean square deviation between this prediction and the experimental T m value is about 10.5°C. When we derive the T env -T m relation inside each family (Method 2), the prediction is slightly worse (σ ≈ 13°C) and this is probably due to the small number of members in the families. However, when the 10% outliers are removed, the scores of Methods 1 and 2 are comparable. These good Table 2. Root mean square deviations σ between predicted and experimental T m -values in cross-validation for each of the 11 protein families obtained with the eight methods and the combined method, average pairwise sequence identity inside the families computed by the FASTA program (Pearson & Lipman, 1988), and number of proteins n in each family. performances are explained by the fact that T env and T m are necessarily correlated since thermophilic organisms host thermostable proteins even if it is absolutely not true that mesophilic organisms host exclusively mesostable proteins. Not surprisingly therefore, if we divide the test set into two subsets that contain only proteins from thermophilic (T env > 40°C) and mesophilic (T env < 40°C) organisms, respectively, the predictions are quite accurate for the thermostable subset with σ = 4.2°C (Method 1) and σ = 4.7°C (Method 2), and much less for the mesophilic subset with σ = 11.1°C (Method 1) and σ = 14.6°C (Method 2). However, the prediction performance of this type of methods is very sensitive to the data set of target proteins; we expect that the inclusion of additional entries will not improve it, and may even decrease it. Indeed, if we consider the T m -prediction of a set of proteins belonging to the same organism, such methods cannot clearly give satisfactory predictions by themselves, even though they give some positive indications in a first approximation. One has also to take into account that the environmental temperature is not always well defined or that sometimes the error on its determination is quite high. For example, the strong dependence of the optimal growth temperatures on the pH and other environmental conditions for the different microorganisms considered in this analysis can be expected to have a non negligible impact on the accuracy of the T m -predictions.
With the aim of improving the accuracy of the melting temperature predictions, we combine some of the best performing features into a new method. The selected features are the temperature-dependent statistical potentials, the environmental temperature, the number of residues, and the fraction of charged residues. Indeed, though the residue composition does better than the fraction of charged residues when considered individually (Method 6), the latter is preferred in combination with the other features. The experimental T m -values and those predicted with this final method are plotted in Figure 2(c). The root mean square deviation between the experimental and predicted T m -values is found to be equal to 7°C in cross-validation and drops to 5°C when 10% outliers are removed. The corresponding linear correlation coefficients are quite high and equal to .91 and .95, respectively.
In order to test the significance of our results, we performed a series of statistical analyses whose results are reported in Table 1. The statistical significance of each individual method compared to a random predictor is estimated by the t-test on the Pearson coefficient r between the predicted and measured melting temperatures. Methods 1, 2, 5, 6, 7 as well as the combined method show P-values that are lower than .05. These predictions can thus be considered as statistical significant. Furthermore, the combined method was compared to the individual methods using the Hotelling t-test to check the significance of its improved performance. As seen in Table 1, the P-values are all lower than or equal to .05, which indicates that the score of our newly developed method is significantly higher than that of the individual methods.
An interesting observation is that the performance of the combined method weakly anticorrelates with the average sequence identity (SI) inside the families, as shown in Table 2 (Pearson coefficient between σ and SI of −.36). Thus, despite the fact that Method 3 that is based only on sequence similarity is the worst performing method, some information on homologous proteins yet contributes to the performance of our final method. Note that it would obviously be of great interest to be able to perform T m -predictions without needing data about homologous proteins, but this leads to a drop in the performances.
Finally, we would like to stress that the score of our best T m -predictor is quite satisfactory given the errors and limitations that we had to manage. These are: • The small size of our data set. Obviously, not enough proteins with experimentally determined 3D structure and melting temperature are available to build an accurate predictor. • In some of the analyzed structures, the presence of ligands, such as the heme in the myoglobin and cytochrome P450 families, which contribute substantially to the protein stabilization and unfortunately cannot be taken into account in the statistical potential formalism, introduces errors in the folding free energy estimations and thus even larger errors in the melting temperature determinations. Note that the stabilizing effect of ligands is implicitly taken into account in the T env -based methods. • The experimental errors can have also an impact on our predictions. These include not only the true experimental errors that are usually of the order of 1°C, but also the fact that the experimental conditions (pH, ionic strength, etc.) at which the measurements are performed are not always the standard ones.
To analyze the robustness of our T m -prediction method in relation with the latter point, in other words, whether measurement errors or different environmental conditions significantly affect the performance of our tool, we made some tests by introducing by hand some errors in the T mvalues of the proteins of our data set. In particular, we made a first test in which we introduced 10°C error on 10% randomly chosen proteins from our set, and a second test where we made 3°C error on 20% of the proteins. The average impact on the performances of our method was found to amount to about 5-10%; note that these errors can either increase or decrease the score. This test confirms the non-trivial but limited impact of errors related to the non-homogeneity of the experimental conditions, and thus the robustness of our model.

Conclusion
In this article, we made significant progress toward the design of an optimal melting temperature predictor.
Several features that have been described as influencing the temperature resistance have been considered and their performance have been tested on a common data set of 45 proteins with experimentally determined structure and T m , which belong to 11 different homologous families. The best performing features have then been combined to optimize the prediction, and lead to a substantially increased score. These features involve the environmental temperature of the host organism, the folding free energy computed at different temperatures, the number of residues, and the fraction of charged residues. The root mean square deviation between the computed and experimental T m -values is equal to 7.1°C in cross-validation with this combined method, and drops to 5.2°C when the 10% worst predicted entries are excluded. We would like to emphasize that this is by far the best T m predictor described in the literature.
Since the thermal stability issue is still far from being fully understood, it is of primary importance to get more insight and to deepen its comprehension. Although we have undoubtedly advanced compared to previous approaches, there is still room for further optimization.
An obvious improvement of the method relies on an increase of available experimental data. More proteins of known structure and T m will yield more accurate statistical potentials. They will also allow considering other types of statistical potentials describing more complex secondary and tertiary interactions. Moreover, more proteins inside each homologous family would improve the identification of the parameters of the model.
Other improvements could come from defining different functional combinations between the factors considered or from including new factors not considered in this article, such as the electric dipole moment (Perl et al., 2000) or additional information about protein dynamics (Wunderlich & Schmid, 2006;Cilia et al., 2013). It would also be interesting to quantify the impact of the error of the T env determination on the final results and possibly reduce it.
Finally, it would be quite useful to add an estimation of the contribution of ligands to the thermoresistance, since in some of the families analyzed their presence probably modify the thermal stability characteristics.

Disclosure statement
No potential conflict of interest was reported by the authors.