Molecular modelling of the mass density of single proteins

Using molecular dynamics (MD) simulations, the density of single proteins and its temperature dependence was modelled starting from the experimentally determined protein structure and a generic, transferable force field, without the need of prior parameterization. Although all proteins consist of the same 20 amino acids, their density in aqueous solution varies up to 10% and the thermal expansion coefficient up to twofold. To model the protein density, systematic MD simulations were carried out for 10 proteins with a broad range of densities (1.32–1.43 g/cm3) and molecular weights (7–97 kDa). The simulated densities deviated by less than 1.4% from their experimental values that were available for four proteins. Further analyses of protein density showed that it can be essentially described as a consequence of amino acid composition. For five proteins, the density was simulated at different temperatures. The simulated thermal expansion coefficients ranged between 4.3 and 7.1 × 10−4 K−1 and were similar to the experimentally determined values of ribonuclease-A and lysozyme (deviations of 2.4 and 14.6%, respectively). Further analyses indicated that the thermal expansion coefficient is linked to the temperature dependence of atomic fluctuations: proteins with a high thermal expansion coefficient show a low increase in flexibility at increasing temperature. A low increase in atomic fluctuations with temperature has been previously described as a possible mechanism of thermostability. Thus, a high thermal expansion coefficient might contribute to protein thermostability.


Introduction
Proteins are versatile molecules with interesting material properties. Based on sequence information only, they fold into a well-defined structure with high mechanical (Gosline, Demont, & Denny, 1986) and thermal stability (Singleton & Amelunxen, 1973), and have an amazing degree of flexibility (Huber, 1979). A delicate balance between structure and flexibility is essential for their biological function. All these material properties are coded in a linear sequence of a few 100 amino acids. Since the first description of sequence (Sanger & Tuppy, 1951a, 1951b and structure (Kendrew et al., 1958) of proteins, understanding the properties of proteins on the basis of their sequence and structure is of utmost interest to biochemists and biotechnologists.
The density of a single protein (protein mass divided by protein volume) is a material property that since long has attracted attention, not only because it can be easily and precisely measured, but also because protein density and packing of amino acid side chains are the basis of many biophysical and biochemical properties such as stability towards heat (Goldstein, 2007) or organic solvents (Kawata & Ogino, 2010), flexibility and dynamics (Haliloglu, Bahar, & Erman, 1997;Halle, 2002) and aggregation (Meersman & Dobson, 2006). The densities of many proteins have been experimentally determined in dry state and solvated in water. At 20-25°C, the majority of proteins have a density of about 1.35 g/cm 3 in aqueous solution (Fischer, Polikarpov, & Craievich, 2004) and slightly less in dry state (Berlin & Pallansch, 1968;Brill, Siegel, & Olin, 1962), but some proteins have been identified with densities that are increased by almost up to 10% (Squire & Himmel, 1979).
To understand the molecular basis of the observed dependency of density on sequence and structure, different models have been suggested. For proteins with experimentally determined structure (Berman et al., 2000), the density has been evaluated from their volume. However, the volume sensitively depends on the method and parameters used, and therefore the absolute values of protein density evaluated by different methods might differ considerably (Andersson & Hovmoller, 1998;Quillin & Matthews, 2000;Tsai, Taylor, Chothia, & Gerstein, 1999). For example, the protein density values of pyruvate kinase determined by different volume calculation methods ranged from 1.18 g/cm 3 (Andersson & Hovmoller, 1998) to 1.47 g/cm 3 (Quillin & Matthews, 2000). In a second approach, the densities of proteins were related to their molecular weight. By systematically comparing the densities of proteins with different molecular weights, it was observed that the densities are similar for all proteins with a molecular weight above 20 kDa (1.35 g/cm 3 ) and increase gradually for low-molecular weight proteins (Fischer et al., 2004). Therefore, a molecular-weight-dependent function with an exponentially increasing protein density for decreasing molecular weights was proposed. However, this function cannot explain the observed large variations of density for proteins with similar low molecular weight. In a third approach, the density of proteins was correlated with their sequence and the amino acid composition of proteins was analysed. While for proteins in their dry state, the protein volumes could not be calculated from the sum of the volumes of its amino acids (Berlin & Pallansch, 1968), the experimentally determined density of proteins in solution could be well reproduced from their amino acid composition (Cohn & Edsall, 1943;Iqbal & Verrall, 1988;Kharakoz, 1997;Makhatadze, Medvedkin, & Privalov, 1990;Zamyatnin, 1972Zamyatnin, , 1984. While the densities of folded proteins vary by less than 10%, their experimentally determined thermal expansion coefficients vary by a factor of almost two (Chalikian, Totrov, Abagyan, & Breslauer, 1996). To account for thermal effects, the temperature dependence of the density of amino acid residues was determined (Amend & Helgeson, 2000;Hedwig & Hinz, 2003;Lee, Tikhomirova, Shalvardjian, & Chalikian, 2008;Makhatadze et al., 1990) and combined to model the temperature dependence of protein densities. However, the reproducibility of the experimental values using this method is limited as this strategy is primarily designed for unfolded proteins (Hedwig & Hinz, 2003). Furthermore, some approximations are made for the calculations. Therefore, the values determined over a wide temperature range are less reliable than those calculated for a temperature of 25°C (Hackel, Hinz, & Hedwig, 1999). Thus, a predictive model of protein density should describe properly the dependency of density not only on sequence and structure of the protein, but also on temperature. Therefore, we have developed a general molecular model to predict the density of a protein and its temperature dependence by using molecular dynamics (MD) simulations of protein-solvent systems. The method uses only the experimentally determined structure of a protein and a generic, transferable force field to evaluate the absolute value of protein density, and no prior parameterization is needed.

Materials and methods
Setup of the simulations Ten proteins were selected which cover a wide range of molecular weights from 7 to 97 kDa (Table 1) setup of the simulations, crystal water molecules and ligands were removed from the protein structure. For ubiquitin, two series of simulations with and without crystal water were performed to estimate the effect of crystal water on the simulated protein density. Subsequently, each protein was solvated in four water boxes of different size, according to protein concentrations of 8, 16, 24 and 32 mg/cm 3 . The box sizes were in a range of 365-20,062 nm 3 depending on the molecular weight of the simulated protein and the protein concentration. For Candida antarctica lipase B, a series of simulations was additionally performed with one and eight proteins in a box at concentrations of 8, 12, 16, 20, 24, 28 and 32 mg/cm 3 . Moreover, the density of pure water was determined by simulation of a large box of 46,657 water molecules. The negatively charged proteins at pH 7 (Pseudomonas glumae lipase (À1), Candida antarctica lipase B (À1), RP2 lipase (À1), malate dehydrogenase (À4), glycogen phosphorylase B (À7)) were neutralized by protonating the respective number of histidines (the catalytic histidine or histidines at the protein surface). The positively charged proteins (ribonuclease-A (+4), lysozyme (+8), adenylate kinase (+2)) were neutralized by adding chloride ions. For these three proteins, control simulations were carried out with water boxes of corresponding size containing water and the respective number of ions.
MD simulations were performed using the software package groningen machine for chemical simulation (GROMACS), version 4.0.7 (Van Der Spoel et al., 2005). The simulations were performed at 298 K using the optimized potentials for liquid simulations (OPLS) all-atom force field (Jorgensen & Tirado-Rives, 1988) and the extended simple point charge (SPC/E) water model (Berendsen, Grigera, & Straatsma, 1987). For ribonuclease-A and adenylate kinase, the simulations were performed at 293 K, the temperature at which the density was determined experimentally.
For an analysis of the temperature dependence of the protein density, additional simulations of scorpion toxin variant-3, ribonuclease-A, lysozyme, P. glumae lipase and C. antarctica lipase B were done at 288 K and at 308 K.

Energy minimization, equilibration and production phase
The systems were minimized using the steepest descent algorithm with a maximum number of steps of 100,000. Subsequently, all systems were equilibrated. The equilibration was started at an initial temperature of 10 K. Within 200 ps, the systems were heated up to 100 K, followed by 100 ps simulation at constant temperature. Within the following 450 ps, the temperature was increased to 288, 293, 298 or 308 K and equilibrated for another 250 ps at the respective temperature. Temperature coupling was done using velocity rescaling with a stochastic term with a temperature coupling time constant of .1 ps. The pressure was kept constant at 1 bar with a pressure coupling time constant of .5 ps (NPT-model). A time step of 1 fs was used.
After equilibration, the simulations were continued for 10 ns with a time step of 2 fs. Temperature coupling was performed using the Berendsen thermostat (Berendsen, Postma, van Gunsteren, DiNola, & Haak, 1984) with a temperature coupling time constant of .1 ps and a reference temperature of 298 K. Pressure coupling was done using the Berendsen barostat (Berendsen et al., 1984) with a pressure coupling time constant of 2 ps and a constant pressure of 1 bar (NPT-model). Simulation of 10 ns was sufficient to allow for complete relaxation of the solution density, which was constant during the last 5 ns of the simulation.

Sensitivity to simulation parameters
Four additional simulations of lysozyme were performed with a different thermostat, barostat, water model or protein force field: (1) temperature coupling by the velocityrescaling thermostat (Bussi, Donadio, & Parrinello, 2007) was used instead of the Berendsen thermostat; (2) pressure coupling by the Parrinello-Rahman barostat (Parrinello & Rahman, 1981) was used instead of the Berendsen barostat; (3) the water model TIP4P (Jorgensen & Madura, 1985) was used instead of the water model SPC/E; (4) the protein force field GROMOS96 43a1 (Christen et al., 2005) was used instead of the OPLS all-atom force field.

Evaluation of protein density by MD simulation
The density of the protein solution ρ total consisting of protein with mass m prot and volume V prot and water with mass m wat and volume V wat is calculated by By inserting Equation (2) into Equation (1), the solution density ρ total depends linearly on the protein concentration c prot (Kupke, 1973): For each protein, simulations were performed for at least four different protein concentrations c prot . For each simulation, the box volume was averaged over the last 5 ns of the simulation using the GROMACS tool g_energy, and the solution density ρ total , the protein concentration c prot and their standard deviations were calculated. By plotting the densities of pure water and protein-water systems for at least four protein concentrations against the protein concentration, the protein density ρ prot was calculated from the slope, and the water density ρ wat from the intercept with the y-axis.
Thus, the calculation of protein density does not require an evaluation of protein surface or volume, and reflects amino acid packing and thermal motion.

Evaluation of protein density by amino acid composition
As an alternative method, the density was calculated based on the amino acid composition. For the calculation of the protein densities by amino acid composition, an additive scheme for the apparent specific volume of proteins according to Cohn and Edsall (1943) was used ( Table SI).

Calculation of thermal expansion coefficients
For the analysis of the temperature dependence of the protein density, protein density values were calculated at 288, 298 and 308 K, plotted against the corresponding temperature, fitted by a straight line, and the thermal expansion coefficient was defined as the negative ratio of the slope of the straight line to the protein density at 298 K.

Protein structure and flexibility
For analysis of protein structure, secondary structure elements were assigned according to dictionary of secondary structure of proteins (Kabsch & Sander, 1983). To analyse the flexibility of the simulated proteins, the root mean square fluctuation (RMSF) was calculated over the last 3 ns for all protein atoms using the GRO-MACS tool g_rmsf. For each protein atom, the RMSF values were converted in the corresponding temperature factor B using the equation For each simulation, the median of the B-factor values of all protein atoms was calculated, and the median values were averaged over the four simulations at different protein concentrations. To analyse the variation of flexibility with temperature, median B-factors were calculated at 288, 298 and 308 K and plotted against the corresponding temperature. The correlation was fitted by a straight line, and the slope of the straight line was used as a measure of the increase in protein flexibility with temperature.

Quality of the protein simulations
To confirm the quality of the water model for different temperatures, the density of a water box consisting of 46,657 water molecules at 298 K was determined by MD simulations. The calculations resulted in a value of .9960 ± .0005 g/cm 3 , which deviates by only .1% from the experimental value (.9970 g/cm 3 ; Lide, 2009). Also at temperatures of 288 and 308 K, the simulations led to density values that differed by only .1 and .3% from the experimental values, respectively. Thus, the SPC/E parameters are sufficiently accurate, and the system size and simulation time are large enough to reproduce experimental data.
To evaluate the quality of the protein simulations, the density of the 10 protein-water systems was followed as a function of simulation time. For all concentrations of all proteins, the density quickly relaxed to its equilibrium value during the first nanosecond of the simulation ( Figure S1). Reproducible values with standard deviations of less than .1% could be obtained by averaging the solution density over a time interval of 5 ns. Thus, a simulation time of 10 ns was sufficient to determine the solution density. Furthermore, one simulation (the simulation of eight proteins of Candida antarctica lipase B at a concentration of 20 mg/cm 3 ) was carried out for 20 ns instead of only 10 ns. Calculation of the solution density of this box over the time range of 5-10 and 15-20 ns led to identical values.
To evaluate the effect of simulating multiple proteins in a box as compared to a single protein, two series of simulations were carried out for C. antarctica lipase B, a single protein in a box and a system of eight proteins in a box ( Figure S2). The protein density calculated by the two series deviated by .07% (Table 1). Thus, simulating a single protein in a box was sufficiently accurate, and protein-protein interactions in a realistic protein solution had no influence to the density. However, the observed relative standard deviations of the box volumes with one protein are slightly increased due to the smaller number of particles in the small boxes.
To evaluate whether variations in the density occur between the monomeric and dimeric form of a protein, simulations of malate dehydrogenase were done for the monomer as well as for the dimer ( Figure S8). The resulting densities of the monomeric and dimeric malate dehydrogenase differed by .04% (Table 1). Therefore monomers and dimers of the same protein do not show different protein densities.
To evaluate the number of simulations at different protein concentrations that are necessary to reliably define a straight line, the simulations with C. antarctica lipase B were done for three additional protein concentrations. The resulting data points lay exactly on the straight line defined by four simulations at different concentration ( Figure S2), thus it was concluded that four protein concentrations plus the value for pure water were sufficient.
To evaluate the effect of crystal water to the evaluation of protein density, two series of simulations of ubiquitin (with and without crystal water) were compared ( Figure S3). For both series, the protein densities differed by .1% (Table 1), thus the absence or presence of crystal water molecules does not influence the equilibration time of the system nor the resulting protein density.
For three positively charged proteins, chloride counterions were added to neutralize the protein system. To examine the influence of the ions on the solution density, boxes of corresponding sizes containing water and ions but without protein were simulated ( Figure S5). The effect was analysed for lysozyme, for which eight chloride ions were added. Calculations of the protein density with and without consideration of the effect of ions on the solution density led to a protein density of 1.3809 and 1.4029 g/cm 3 , respectively. Thus, the influence of counterions should not be neglected upon calculation of the protein density.
To study whether the modelled density is sensitive towards changes in the thermostat, barostat, water model or force field, control simulations of lysozyme were performed. Replacing the SPC/E water with TIP4P water increased the simulated density by .17% (Table SII). Using the GROMOS force field instead of the OPLS all-atom force field increased the simulated density by .41%. Changes in the temperature coupling and pressure coupling decreased the simulated density by 1.2%. Using the Berendsen thermostat and barostat, the SPC/E water model and the protein OPLS all-atom force field, the simulated density of lysozyme deviated by À1.4% from the experimental value. Thus, when the v-rescale thermostat or the Parrinello-Rahman barostat were used, the experimental density was underestimated by 2.6%, with the TIP4P water model by 1.2%, and with the GROMOS force field by .95%.

Modelling the protein density by MD simulations
The densities of aqueous solutions of 10 proteins with molecular weights ranging from 7 to 97 kDa were determined by simulations of protein-water systems at four protein concentrations, 8, 16, 24 and 32 mg/cm 3 for each protein (Figure 1 and Figures S2-S8). The simulations were performed at 293 or 298 K, corresponding to the temperature of the experimental density measurement. From the slope of these graphs, the density of the 10 proteins was determined (Table 1). Four proteins (C. antarctica lipase B, glycogen phosphorylase B, P. glumae lipase and malate dehydrogenase) had a low protein density (1.3257, 1.3259, 1.3330 and 1.3418 g/cm 3 , respectively), slightly lower than the average (1.35 g/cm 3 ) of the majority of proteins. Four proteins (ubiquitin, RP2 lipase, adenylate kinase and lysozyme) had a medium protein density (1.3603, 1.3623, 1.3664 and 1.3809 g/cm 3 , respectively). Two proteins (ribonuclease-A and scorpion toxin variant-3) had a high density (1.4050 and 1.4290 g/cm 3 , respectively). As described previously (Fischer et al., 2004), the simulated density of a protein is related to its molecular weight: the molecular weight of the four proteins with low density is in the range 33-97 kDa, of the four proteins with medium density in the range 8.6-48 kDa, and of the two proteins with high density in the range 7-14 kDa (Figure 2). However, large differences in the protein densities are observed for the two small proteins, scorpion toxin variant-3 and ubiquitin.  Figure 1. Plot of the solution density against the protein concentration for scorpion toxin variant-3 (♦; R 2 = .9999), RP2 lipase (▪; R 2 = 1) and glycogen phosphorylase B (N; R 2 = 1) each at 298 K.  Figure 2. Plot of protein densities against the molecular weight; protein density values calculated by Tsai et al. (1999) (□), values measured experimentally (Squire & Himmel, 1979) (♦) and protein densities determined by MD simulations (4); numbering of the proteins with simulated protein density according to Table 1. For four proteins, experimental data of protein density in solution have been published previously. The simulated densities were in excellent agreement with their experimental values: for ribonuclease-A, the simulated density deviated by À1.2% from the experimental value (1.422 g/cm 3 at 20°C [Richards 1971]); for lysozyme, the simulated density deviated by À1.4% from the experimental value (1.40 g/cm 3 at 25°C) determined using a precision density meter (Gekko & Noguchi, 1979). This deviation is considerably smaller than a previous evaluation of the density of lysozyme (1.43 g/cm 3 ) (Tsai et al., 1999), corresponding to a deviation of +2.1% from the experimental value. For adenylate kinase, the simulated density was slightly larger (+1.2%) than the experimental value (1.35 g/cm 3 at 20°C) as determined by pycnometry (Schirmer, Schirmer, Schulz, & Thuma, 1970). For malate dehydrogenase, the simulated and the experimental density (1.348 g/cm 3 at 25°C (Gerding & Wolfe, 1969) deviated by less than .5%.

Evaluation of protein density by amino acid composition
To investigate the molecular reason for the variation in protein density by about 10%, the amino acid composition of the proteins was analysed, and the density was calculated as the sum of the experimentally determined specific volumes of the amino acids (Table SI) (Cohn & Edsall, 1943). According to the differences in density of the amino acids, proteins with a high content of hydrophobic residues have a lower density than proteins with a high content of hydrophilic residues (Figure 3). These calculations resulted in values for the protein densities that differed by only .7 and 1.3% on average from the values determined experimentally and by MD simulations, respectively (Table 1). Thus, the amino acid composition is the dominant factor that describes the protein density.

Analysing thermal expansion of proteins
To evaluate the temperature dependence of the simulated density, simulations of five proteins (scorpion toxin variant-3, ribonuclease-A, lysozyme, P. glumae lipase and C. antarctica lipase B) were performed at three different temperatures (288, 298 and 308 K). In a temperature range of 20°, the density depends linearly on the temperature (the R 2 values for the linear fit of protein density and temperature are .9788 for scorpion toxin variant-3, .8005 for ribonuclease-A, .8462 for lysozyme, .9968 for P. glumae lipase and .9891 for C. antarctica lipase B), and the thermal expansion coefficient of the four proteins varies nearly twofold between 4.3 and 7.1 × 10 À4 K À1 (Table 2). For ribonuclease-A and lysozyme, the two proteins for which the thermal expansion coefficient was determined experimentally (Chalikian et al., 1996), the simulated and experimental values (deviations of 2.4 and 14.6%, respectively) agree well.

Modelled protein density
To assess the quality of evaluating the protein density by MD simulations, the effects of simulation time, multiple proteins in a box, monomeric and dimeric forms, number of protein concentrations, crystal water and counterions were analysed, and the simulated densities were compared to the experimental values. The analysis revealed that simulation of 10 ns at four different protein concentrations is sufficient for a reliable determination of protein density. The number of proteins in a box, monomeric or dimeric forms of a protein and crystal water had no effect on the simulation result, whereas the effect of counterions on the solution density has to be considered in the calculation of protein density. The results of the MD simulation are in good agreement with the experiments. For four of the 10 simulated proteins (ribonuclease-A, lysozyme, adenylate kinase and malate dehydrogenase), experimental data on protein density are available. These four proteins cover a wide range of molecular weights (14-73 kDa) and  Figure 3. Plot of the protein density against the content of hydrophobic amino acids (numbering of the proteins according to Table 1.   Table 2. Thermal expansion coefficients of scorpion toxin variant-3, ribonuclease-A, lysozyme, P. glumae lipase, and C. antarctica lipase B determined by MD simulation and their experimental values (Chalikian et al., 1996).

Protein
Simulated thermal expansion coefficient (10 À4 K À1 ) (experimental value) Scorpion toxin variant-3 7.1 Bovine ribonuclease-A 4.3 (4.2) Hen egg white lysozyme 4.7 (4.1) P. glumae lipase 6.1 C. antarctica lipase B 5.8 densities (1.348-1.422 g/cm 3 ). For these four proteins, the deviations between the simulated and the experimentally determined densities were less than 1.4%, thus we consider the strategy to model protein density by simulation of protein solutions at different concentrations as a reliable method to model protein density at high accuracy. Furthermore, modelling of solvated proteins by MD simulations is a highly generic strategy, because protein force fields are based on a small set of force field parameters which are assigned by amino acid type and are independent of the environment of the amino acid. Thus, the same set of force field parameters applies to all proteins and a high transferability is guaranteed. With the exception of an initial choice of a force field, no further parameterization is needed to model the physical properties of molecular systems such as density. Modelling of protein density by MD simulation was insensitive towards the choice of the protein force field and the water model (<.5% deviation) and depended only slightly on the coupling method (1.2% for a different thermostat or barostat). This is an advantage and in contrast to previously suggested methods to calculate the density. In widely used methods to determine the protein density, the protein volume is calculated by defining the protein surface. Using the GROMACS tool g_sas (Van Der Spoel et al., 2005), the density of hen egg white lysozyme varied between 1.05 and 1.43 g/cm 3 upon changing the probe radius from 1.4 to .7 Å. Similarly, for the same set of representative proteins, two different algorithms resulted in different values for the average protein density: 1.43 and 1.47 g/cm 3 as calculated by the Connolly and the Voronoi algorithm, respectively (Quillin & Matthews, 2000). These values deviate by 6 and 9%, respectively, from the average protein density of 1.35 g/cm 3 determined experimentally (Fischer et al., 2004).

Molecular reasons for the variation of protein density
While all proteins consist essentially of the same 20 amino acids, their densities deviate by up to almost 10%. It was observed that proteins with a molecular weight of less than 20 kDa tend to have an increased density (Fischer et al., 2004). This tendency was essentially confirmed by analysing experimentally determined protein densities (Squire & Himmel, 1979), values calculated by Tsai et al. (1999) and by MD simulations as a function of molecular weight (Figure 2). However, for proteins of similar molecular weight, large deviations of the protein density may occur. The two proteins, scorpion toxin variant-3 and ubiquitin, have a similar molecular weight below the 20 kDa threshold (7.1 and 8.6 kDa, respectively), but have significantly different simulated protein densities (1.43 and 1.36 g/cm 3 , respectively). The density of ubiquitin is similar to the considerably larger RP2 lipase (48 kDa). Therefore, a simple dependence of the protein density on the molecular weight is not adequate to describe the molecular basis of protein density.
To explain the observed differences in protein densities, it was suggested that the amino acid composition determines protein density (Cohn & Edsall, 1943;Iqbal & Verrall, 1988;Kharakoz, 1997;Makhatadze et al., 1990;Zamyatnin, 1972Zamyatnin, , 1984. Indeed, proteins with high density have a low percentage of hydrophobic amino acids (scorpion toxin variant-3: ρ = 1.4290 g/cm 3 , 17% hydrophobic residues), while proteins with low density have a high percentage of hydrophobic amino acids (ubiquitin: ρ = 1.3603 g/cm 3 , 33% hydrophobic residues, RP2 lipase: ρ = 1.3623 g/cm 3 , 32% hydrophobic residues). This observation is in accordance with the low density and high specific volume υ of hydrophobic amino acids as compared to hydrophilic amino acids (Ile: υ = .90 cm 3 /g, Asp: υ = .60 cm 3 /g [Cohn & Edsall, 1943]). Thus, the observation that small proteins in general have higher densities than large proteins is a consequence of their low content of hydrophobic amino acids. As small proteins have higher surface to volume ratios than large proteins and as the surface is hydrophilic, small proteins have a higher content of hydrophilic amino acids and thus a higher density. Therefore, the density of a protein can be predicted from its sequence (Iqbal & Verrall, 1988;Kharakoz, 1997;Makhatadze et al., 1990;Zamyatnin, 1984), and the results obtained by MD simulations are in good agreement with the calculations from amino acid composition (maximum deviation of 2.7%).

Temperature dependence of protein density
Simulations of scorpion toxin variant-3, ribonuclease-A, lysozyme, P. glumae lipase and C. antarctica lipase B in a temperature range of 288-308 K were performed to determine the thermal expansion coefficients of these five proteins. The thermal expansion coefficients of ribonuclease-A and lysozyme determined by MD simulation are in good agreement with the experimental values (deviations of 2.4 and 14.6%, respectively), thus modelling of protein density by MD simulations is valid for a range of temperatures.
In contrast to protein densities which vary by less than 10%, the thermal expansion coefficients of the five proteins investigated here vary nearly twofold (4.3-7.1 × 10 À4 K À1 ). This high variation of the thermal expansion coefficients was confirmed by experimental data on proteins such as hen egg white lysozyme (4.1 × 10 À4 K À1 ) and bovine serum albumin (7.1 × 10 À4 K À1 ) (Chalikian et al., 1996). Considering the fact that proteins consist of the same material, the variation is surprisingly large. Therefore, it would be interesting to understand the molecular basis of the thermal expansion of proteins. Interestingly, there is no correlation between the thermal expansion of a protein and its protein density at 298 K ( Figure S9). The two proteins with a low density (P. glumae lipase and C. antarctica lipase B) have a similar thermal expansion coefficient (6.1 and 5.8 × 10 À4 K À1 , respectively), while the two proteins with high density (ribonuclease-A and scorpion toxin variant-3) have different thermal expansion coefficients (4.3 and 7.1 × 10 À4 K À1 , respectively). Lysozyme has a medium density and a thermal expansion coefficient (4.7 × 10 À4 K À1 ) similar to that of ribonuclease-A.
As another approach to analyse molecular determinants of thermal expansion, correlations between thermal expansion and secondary structure composition and its variation with temperature were analysed. However, no correlations of thermal expansion and secondary structure composition could be shown ( Figure S10) and no change in secondary structure composition in the analysed temperature range was observed ( Figure S11).
Because the thermal fluctuations increase with temperature, the correlation between protein flexibility and density at different temperatures was investigated. Therefore, the median B-factor at 298 K and the variation of protein flexibility with temperature were analysed. As the median equals to the maximum RMSF of the 50% most rigid protein atoms, it measures the flexibility of the core of the protein and therefore, in contrast to the average value, is independent of individual, highly flexible loop regions. Intuitively, we expected a positive correlation between the thermal expansion coefficient and the temperature dependence of the protein flexibility. In contrast, we found a negative correlation between thermal expansion coefficient and flexibility, and between thermal expansion coefficient and temperature dependence of protein flexibility: proteins with a high thermal expansion coefficient have a low median B-factor at 298 K ( Figure S12) and a low increase in protein flexibility with temperature ( Figure 4). This surprising observation could be explained by the difference between RMSF and config-urational entropy (Missimer et al., 2007). It has been shown that the configurational entropy increases with temperature (Missimer et al., 2007), and the configurational entropy of glasses can be expressed in terms of molar thermal expansion (Casalini, Capaccioli, Lucchesi, Rolla, & Corezzi, 2001). However, the RMSF does not strictly follow the configurational entropy: while the RMSF increased strongly with temperature for a thermolabile protein, it increased less for a more thermostable protein complex, indicating that the intramolecular interactions in thermostable proteins restrict the increase in fluctuations at increasing temperatures despite an increase in configurational entropy (Missimer et al., 2007). A broad variation in the temperature dependence of the protein flexibility was also observed for the five proteins investigated here: for two proteins (scorpion toxin variant-3 and C. antarctica lipase B), the protein flexibility varied only slightly with temperature (.07 Å 2 /K), while the atomic fluctuations of ribonuclease-A increased strongly with temperature (.27 Å 2 / K). Thus, a high thermal expansion coefficient results in a slow increase in atomic fluctuations at increasing temperature. Because thermal fluctuations decrease the kinetic barrier to unfolding, a high thermal expansion coefficient is expected to contribute to protein stabilization. Increasing the thermal expansion coefficient would then be an appropriate design strategy for creating thermostable proteins. However, it is not yet clear how the thermal expansion coefficient is encoded in the sequence and structure of a protein, but it would be highly interesting to study in more detail the molecular basis of protein expansion.

Conclusion
Using MD simulations, protein density and its temperature dependence were modelled based only on the sequence and structure of a protein and a generic, transferable force field without the need of prior parameterization. The method was applied to evaluate two basic material properties of proteins, density and thermal expansion, which were in good agreement with experimental values. The protein density is determined by the amino acid composition of the protein and is independent of molecular weight. The thermal expansion coefficient is linked to the temperature-dependent increase in protein flexibility: proteins with a high thermal expansion coefficient show a small increase in flexibility at increasing temperature, indicating that a high thermal expansion coefficient might contribute to protein thermostability.