Solvated and generalised Born calculations differences using GPU CUDA and multi-CPU simulations of an antifreeze protein with AMBER

Abstract While there has been an increase in the number of biomolecular computational studies employing graphics processing units (GPU), results describing their use with the molecular dynamics package AMBER with the CUDA implementation are scarce. No information is available comparing MD methodologies pmemd.cuda, pmemd.mpi or sander.mpi, available in AMBER, for generalised Born (GB) simulations or with solvated systems. As part of our current studies with antifreeze proteins (AFP), and for the previous reasons, we present details of our experience comparing performance of MD simulations at varied temperatures between multi-CPU runs using sander.mpi, pmemd.mpi and pmemd.cuda with the AFP from the fish ocean pout (1KDF). We found extremely small differences in total energies between multi-CPU and GPU CUDA implementations of AMBER12 in 1ns production simulations of the solvated system using the TIP3P water model. Additionally, GPU computations achieved typical one order of magnitude speedups when using mixed precision but were similar to CPU speeds when computing with double precision. However, we found that GB calculations were highly sensitive to the choice of initial GB parametrisation regardless of the type of methodology, with substantial differences in total energies.


Introduction
The availability of computational resources using graphics processing units (GPUs) has opened up many opportunities to speed up computations in biomolecular studies, since many simulations in this area are still limited by the availability of computational resources. Normally, MD simulations rely on serial or parallel computations executed only on CPUs, but now GPU units can be used to extend, accelerate or devise novel computational approaches for biomolecular or other studies using both single and multiple GPUs. [1][2][3][4] An increasing number of computer simulation packages have been ported to run in GPUs, for example, LAMPPS, [5] GROMACS, [6] NAMD [7]; Bindsurf [8] and also AMBER. [9,10] Independent tests comparing GPU and multi-CPU runs on the same machine have been performed with independent or in-house developed software [11] as well as by the software developers when releasing their own GPU capable implementations. However, there is still a need to see more published results comparing GPU and non-GPU calculations conducted by software end users. Our motivation was to provide additional information to the reports made by the package developers after the implementation and release of their GPU versions. For the tests presented in this paper, we used the pdb structure of an antifreeze protein (AFP), 1KDF, from the fish winter flounder which we are currently analysing in an extended and separate study. Our particular aim was to verify hardware and software performance and to determine the effectiveness of using the CUDA implementation of the molecular dynamics package AMBER [12,13] for our particular combination of hardware and protein to study. This implementation is in substitution of the already established basic energy minimiser and molecular dynamics program sander within AMBER.
Antifreeze glycoproteins and AFPs comprise molecules with varied structures that have in common the ability to inhibit the growth of ice. They have been identified in numerous organisms including fish, [14] and appear to have the capability of depressing the freezing point of water in some cases by several degrees at low concentration (10 −3 M or lower). MD simulation methods have been used to study AFPs in a variety of situations, including Ca 2+ -dependent structure, [15] to predict ice-binding capability in different environmental conditions [16] or to study ice and water properties around them. [17] Here we show that calculations comparing multiprocessor runs with sander.mpi, with pmemd.mpi and with pmemd.cuda in AMBER12 with a single GPU are essentially identical. We found extremely small differences in total energies in 1ns NPT production simulations of the solvated protein 1KDF using the TIP3P water model. As expected, GPU computations achieved typical one order of magnitude speedups when using mixed precision but were similar to CPU speeds when computing with double precision. To demonstrate additional capabilities of the CUDA implementation of AMBER for calculations with this 343 and 373 K) for the 1KDF molecule. The 1KDF molecule was used for all the tests reported, while the topology and coordinate files of the Joint Amber Charmm benchmark (JAC, containing the dihydrofolate reductase structure (DHR) with 23,558 atoms) were used only as background benchmark for performance and reproducibility tests. The short 100-ns aMD calculations were with 1KDF at six temperatures (248, 268, 298, 310, 343 and 373 K) with the TIP3P water model and the NPT ensemble. For the molecules used in GPU runs for performance maximisation, as described in Table 3, we followed the technical considerations suggested by AMBER developers, as described and commented in Supplementary Material S1 section.

Structure preparation
For the 1KDF (shown in Figure 1(a)), the initial folded structure was obtained from the protein data bank (rcsb.org), while the extended residue sequence, shown in Figure 1(b), was generated using the tleap program in AmberTools 13. Using the 1KDF PDB file as input structures, protons were added according to the calculated ionisation states of the chemical groups at pH 7.0 using the H++3.0 server at http://biophysics.cs.vt.edu/H++. [18] The corresponding coordinate and topology files were generated using the AMBER99SB force field. The molecule was solvated with a 10Å cubic box of water using the TIP3P water model, and the final solvated structure contained a total of 12,491 atoms. Input files for the generation of the extended sequence with tleap and additional information about structure preparation are included in the Supplementary Material sections S2 and S3.

Computational methods
All classical molecular dynamics simulations described above were conducted with AMBER12. Because of the number of AFP, results of 0.1-ms NVT production accelerated molecular dynamics (aMD) calculations are also presented here.

Execution environment
Computations were performed with an Asus Quantum TXR-500-032R (EXXACT Corp) workstation, with an Intel Core i7-3930 K Processor (12 M Cache, 3.20 GHz) and 32 GB DDR3 1600 MHz memory. The workstation had Nvidia GeForce GTX 780 available for GPU computations with the CUDA toolkit version 5 installed. The OS was CentOS 6.5 and MD simulations ran with AMBER 12, compiled using gcc-4.4.7. At the start of our work, we had access only to AMBER12, which was upgraded to AMBER14 by the developers during the period of time we were running our calculations. Separate additional computations for generalised Born (GB) runs were performed in a separate computer, labelled from now on HP Workstation. All computations were performed in the execution environment shown in Table 1.
AMBER was compiled (a) to run in the parallel version of sander (sander.mpi), (b) to use pmemd (pmemd.mpi), (c) to use the CUDA version pmemd.cuda mixed precision (SPDP) and (d) to use the CUDA version pmemd.cuda double precision (DPDP). These versions were used depending on the simulation type and molecule, as described in Table 2.
Five different sets of simulations were run: one to compare GPU performance vs. multi-CPUs using the crystal structure of the molecule, both solvated and in gas phase; the second to check GPU reproducibility and precision models; the third to observe the effect of GB parametrisation in calculations with GPU vs. CPUs using the extended sequence generated with tleap; the fourth to run a short accelerated MD using the GPU and the fifth to compare parameters extracted from trajectories between GPU and CPU runs at varied temperatures (248, 268, 298, 310, simulation sets, here we briefly mention the general broad considerations for the methodologies used. Detailed description and input files for these calculations are included in Supplementary Material section S4. As a general procedure, solvated or gasphase structures which were first minimised for 1000 steps with the initial 500 steps using the steepest descent algorithm. The final 500 steps used the conjugate gradient energy minimisation with constraints applied to the protein residues. This was followed by a second minimisation procedure of 5000 steps, with the last 2000 using the conjugate gradient without any restrains to the system. The system was heated to the specified temperatures for 100-ps and production trajectories of 0.9-ns were obtained in isothermal-isobaric ensemble (1 atm), for a total run of 1ns. Except as noted for some GB simulations, where the Berendsen thermostat [19] was implemented, all other simulations used Langevin dynamics; the SHAKE-bond-length constraints applied  background check, in the first instance, we performed reproducibility tests in three different time periods for our GPU using the standard GPU tests available in the Amber website. These were run with the GPU with pmemd.cuda using the JAC benchmark for 1ns runs at 300 K using two different precision models, the DPDP (n = 5) and the SPDP (n = 20). As shown in Table 4, the GPU NVIDIA GeForce GTX 780 installed in our Asus Quantum TXR-500-032R workstation provided bitwise results in standard tests repeated over several months. After months of GPU use, there were no observed differences in final total energies calculated after running 20 cycles per test (SPDP precision) or 5 cycles (DPDP precision). Relative variations along the 1ns calculation for the total energies or densities, as shown in Supplemental Figure S2, remained small when comparing the SPDP and DPDP precision models. However, we found that RMSD values for the DPDP precision remained consistently above the ones from the SPDP (Supplementary Material Figure S2, C), but well within to all the bonds involving the H atom and the particle mesh Ewald method to calculate long-range electrostatic interactions as implemented in AMBER. Depending on the simulation, time steps were 1.0 or 2.0-fs. A formatted restart file and default values of all other inputs were used for the program pmemd. Where applicable, simulations were performed with a 10-Å cut-off for non-bonded interactions. For the tests to observe the effect of GB parametrisation in calculations with GPU vs. CPUs, the extended sequence molecules were prepared with tleap using: (a) the default parameters in AMBER for GB simulations [20] (referred as mbondi) and (b) the modified parameters for GB simulations reported by Onufriev [21]. For the DHR JAC, the regular simulation script included in the downloaded AMBER12 benchmark suite was used. After each simulation total energy, root mean square deviation (RMSD) and root means square atomic fluctuation (RMSF) were analysed. For the tests to observe the effect of GB parametrisation in calculations with GPU vs. CPU, changes in dihedrals and bond angles were also analysed. For RMSD and RMSF, data were extracted from trajectories using the AmberTools 13 and Amber 14 versions of CPPTRAJ. [22] The RMSD and RMSF were calculated by least-square fitting the structure to either the first frame of the reference structure or the average energy minimised structure, as indicated in each case. Molecular graphics were generated using VMD version 1.9.1 [23] and plots were generated using the utility Grace version 5.1.22 in CentOS or the software package Origin Pro.

Reproducibility tests
We found excellent reproducibility in solvated calculations performed with GPU and multithread sander, but not with multithread pmemd. MD simulations on GPUs are deterministic, and multiple simulations on the same hardware shall produce bitwise-identical results in the absence of hardware errors. As a As noted, in AMBER, GPU simulations could use three different types of precision arithmetic: a) single precision for the entire process; b) combination of single precision for calculation and double precision for accumulation (SPDP, which is the default for GPU runs) and c) double precision for the entire process (DPDP, as is done in CPU runs). [24] To determine whether there were differences in reproducibility of calculations between multithread sander, multithread pmemd and the single pmemd GPU, a repeated simulation with five runs indicated a small difference reasonable values for expected displacements for the JAC benchmark production calculation. Presumably, these small deviations are due to the different order of execution of the floating point operations in the CPU and GPU implementations (or by the SPDP/DPDP implementation), as reasoned in Ref. [24]. There was a final negligible difference of 5 kcal/mol in total energies for the two precision arithmetics (first and second row in Table 4). For our molecule of interest, 1KDF, the tests indicated same GPU reproducibility (last row in Table 4). demonstrates that GPU pmemd.cuda runs much faster than CPU runs (9× or 17×) in solvated simulations for 1KDF. Table 6B shows performance with the Amber GPU Benchmark Suite as background test. Our result for the GPU scaled fairly well, as indicated by comparing our GTX780 JAC NPT run of ~80 ns/day with the GTX780 JAC NPT benchmark of ~77 ns/day reported in the Amber website for benchmarks using AMBER12. These results are also in line with the speedups reported by other groups. For example, Suhartanto et al. [25] showed speedup ratios for GPU vs. CPU of about 10 for 500-ps runs with the DHFR simulation using AMBER 11. All these results are for mixed precision runs (SPDP). But when the GPU benchmark is run using full double precision (DPDP), the performance is (0.3%) in total energies calculated using the SPDP CUDA GPU and the DPDP sander.mpi (Table 5, rows one and two). There was no reproducibility in the CPU pmemd.mpi runs. Since the Ewald error estimate is not calculated when running on a GPU, we verified that the Ewald error estimate was reasonable using the sander runs, which were identical in all five cycles with a value of 0.9745E −04 .

Performance and speedup ratios GPU/CPU for explicit and GB simulations for 1KDF
As expected, we observed substantial increases in the performance (ns/day) in simulations run with GPU vs. CPUs. Table 6A runs, but ~40× faster for pmemd.cuda. These results are visualised in Figure 2, where the average performance for all temperatures is shown. Similar results are reported in the Amber12 GPU benchmarks website for implicit solvent simulations with myoglobin or nucleosome, with speedup ratios GTX780/16× CPUs of ~29× for nucleosome or ~23× for myoglobin. Figure 3 presents energy values from calculations with sander. mpi, pmemd.mpi and pmemd.cuda at six different temperatures for the solvated and unsolvated molecule. From the point of view of energy values, the stability of all runs with pmemd.cuda is comparable to the CPU sander.mpi and CPU pmemd.mpi implementations of AMBER12 in these 1ns production simulations of the solvated protein (Figure 3(a)) or for the GB calculations in the gas phase (Figure 3(b)) at all six different temperatures. For each temperature, differences in average total energy were only between 2 and 10 kcal/mol for total energies in, for example, the -35,000 kcal/mol range for 248 K, representing 5% or less of the corresponding RMS fluctuations of each calculation. The stability of these results is also confirmed comparing final densities, as shown in Figure 3(c). similar to a run using pmemd in CPU (Table 6B). This issue has been discussed by Le Grand et al. [24], but is not normally evaluated in published reports presumably because most researchers use the default AMBER precision arithmetic during compilation, for both CPU and GPU. We note that our results are similar to the GPU DPDP/SPDP speedups previously reported using a GTX680 (8.7 speedups, Ref. [24]), while we obtained 8.6 using the GTX780, as we indicate in Table 6B.

Energies, RMSD, RMSF and additional analysis for CPU and GPU simulations at varied temperatures
Since in our work with AFPs we are interested in studying possible effects of final simulation temperatures in stability and energetics of the molecules, we decided also to look for changes in performance depending on the final production temperature for the three methodologies, including in the GB simulations. Table 7 shows performance breakdown of simulations run at different final temperatures in the production stage for both GB and explicit water for 1KDF. It should be noted that the ntpr, ntwx and ntwr parameters in the input files were the same for both solvated and GB simulations. Overall, this indicated ~17× faster runs for the solvated molecule calculations using pmemd. cuda vs. sander.mpi and ~9× faster vs. pmemd.mpi.
There is less information in the literature comparing speedup ratios when running solvated vs. GB simulations for the same molecule. However, GB calculations ran only ~1.5× to 2× faster than for the solvated molecule for sander.mpi and pmemd.mpi Figure 5. comparison of total energies (top row) and rMSDs with respect to the first frame (second row) for the extended conformation (from tleap) of 1KDF when using different parameters for the Bond radii, labelled mbondi and mbondi2 for 50 ns production calculations in gas phase at 268 and 298 K. runs labelled pmemd ran in cPu using pmemd.mpi and the runs labelled cuDa ran in the GPu. all runs used langevin dynamics except the one labelled Berendsen, which used a Berendsen thermostat with the parameters described in Supplemental data.
affects essentially the energetics of the trajectories. This was additionally explored by analysing bonded, non-bonded and dihedral energies, which are shown in Figure 6 for the 268 K run. Except for the dihedral energies, all bonded electrostatic interactions remained unaffected, and so the differences in energies shown in Figure 5 are due to the changes in dihedral energies derived from the implementation of the two Born radii models, and these results were independent of the procedure or machine used for the calculations. These results indicate that while the GPU-and the CPU-based calculations will always provide equivalent energy values, the choice for the model of effective Born radii used may have substantial impact on the result of the energies during the trajectory.
While total energies remained stable over 1ns runs regardless of type of methodology, analysis of backbone RMSD, shown in Figure 4, indicated that all GPU and CPU runs for the solvated molecule were only affected by the final temperature ( Figure  4(A), first row), with higher RMSD values with increased temperature. For the calculations in gas phase, however, the backbone RMSD showed much higher fluctuations, and these varied with simulation type and temperature (Figure 4(A), second row). This variability is illustrated in Figure 4(B), with details of the RMSD for 248 and 298 K. While it was expected for GB runs to show higher fluctuations, we did not expect fluctuations to randomly vary with temperature between pmemd.cuda, pmemd. mpi and sander.mpi. This aspect is highlighted also in Figure 1 in the Supplementary Material, where backbone atoms' root meansquare fluctuations (RMSFs) of residues 5 to 60 with respect to the crystal structure at varied T for the 1ns trajectories are shown. While in the solvated molecule the differences in RMSF were small between the three methodologies, in the GB runs higher temperatures showed increased atomic fluctuations for both pmemd.cuda and pmemd.mpi but not for sander.mpi.

Selection of GB parameters affects energy results, independently of simulation methodology
For this particular analysis, only the extended conformation generated with tleap was used, and not the crystal structure. GB simulations in AMBER12 allow for the selection of the type of effective Born radii when preparing the molecules, as described in the AMBER12 manual. Since we are considering the possibility of using our GPU for GB calculations for our 1KDF molecule in the future, we wanted to study whether there were differences in calculations when using the GPU (pmemd. cuda) vs. the CPU (pmemd.mpi) by preparing the molecule with tleap with two different types of parameters, labelled mbondi and mbondi2 in AMBER (implemented in AMBER using parameters from Onufriev et al. [21] and Roe et al. [22], respectively). In this particular case, we ran 50-ns production simulations at two different temperatures, 268 and 298 K. This is shown in Figure 5. Initially we found that there was a substantial increase in energies for pmemd.cuda from mbondi to mbondi2. To verify that this difference was not related to our computer system or to the heating procedure, we then ran the same simulation in the CPU (pmemd.mpi) in both the Exxact and HP machines (see Table 1). We also changed the heating procedure using a Berendsen thermostat, instead of the Langevin dynamics procedure we had used in all other calculations. We note that all GB runs were done with the igb variable set to 1 while changing the parameter set in tleap from the default mbondi to mbondi2. Using igb = 2 resulted in the same output as igb = 1 for the pmemd.cuda (data not shown). The coordinate files generated were identical and the topology files had equal charge assigned. Variations between the topology files generated were in the dihedral force constants, dihedral phases and the mBondi radii. The plot of RMSDs with respect to the first frame of the extended molecule ( Figure 5(b)) shows a substantial difference in values between thermostats, but not between the GPU and CPU calculations, indicating that the parametrisation with different types of effective Born radii Figure 6. comparison of energies from dihedral (a), non-bonded (B) and bonded (c) terms for the extended conformation (from tleap) of 1KDF when using different parameters for the Bond radii for 50 ns production calculations in gas phase at 268 K. runs labelled pmemd ran in cPu using pmemd.mpi and the runs labelled cuDa ran in the GPu. all runs used langevin dynamics except the one labelled Berendsen, which used a Berendsen thermostat with the parameters described in the Supplemental data. the millisecond range, in accordance with the opinion that only 10% or less of proteins may fold in less than 1 ms. [26] As seen in the snapshots in Figure 7, there appears to be an increased number of beta sheets when using the Berendsen thermostat protocol vs. the Langevin dynamics, a feature more evident in the 298 K runs. At 268 K, formation of secondary structures is slowed considerably. We are currently extending the GPU simulation (pmemd.cuda) to reach the millisecond order to further study the folding of the molecule.

Accelerated MD: comparison of aMD vs. non-aMD runs when using 1KDF crystal solvated structure
Recently, it was reported the possibility of using special methodologies to accelerate the study of biomolecular processes using AMBER, [9,10] termed accelerated MD (aMD). In these methodologies, a bias potential is introduced with the objective of reducing the height of the energies of local barriers, allowing the calculation to evolve much faster than otherwise. Adding the aMD potential results in a modification of the relation between energy differences. To assess this aspect and how the methodology could be implemented for our particular system using the GPU, we conducted a preliminary calculation using the standard parametrisation reported in the AMBER12 manual, following the procedure described. [10] In our case, we implemented a boost

Folding extended sequences: GPU CUDA and CPU pmemd comparison
One of the reasons to use GPUs for biomolecular simulations is the obvious shortening of the time required to achieve similar results without compromising accuracy. [24] In the above results, we have shown this to be the case for our particular system and type of calculations. A secondary objective in our simulation set-up for the gas phase was the attempt to observe the possible correct folding of our protein of interest, 1KDF, using the GPU CUDA code, albeit without exploring additional conformational space by changing the initial minimisation and heating procedures. Since we already had our system set-up to compare the energetics, we decided to take lowest energy snapshots from each trajectory for the 50-ns GB runs of the extended sequence. We used the methodologies described above, that is, pmemd.cuda, pmemd.mpi with the described Langevin dynamics and with the pmemd.mpi with the Berendsen thermostat, at both 268 and 298 K. After 50 ns, we did not observe any structure close to the crystal one, and the same happened extending the run to 250 ns. The snapshots are shown in Figure 7. To the knowledge of this author, there is not a single computational report in the literature describing the correct folding of AFPs, and is not known at which temperatures the folding can take place or the time to observe it. Although we did not sample all configurational space, and based on this initial 250-ns run, we speculate this protein will fold in form, how using pmemd.cuda affected the stability of the energy and RMSD values in the boosted calculation relative to the values of the original 1ns run. This is shown in Figure 8(a), where the total energies for the initial 1ns production runs, reported above, are followed by a 5-ns run, without the additional potential, and then switched to the aMD run. For the energies, the only effect to the whole potential with an extra boost to the torsions (using iamd = 3 in the input file). We ran these computations for 100 ns for the solvated crystal structure 1KDF using the values for the parameters EthreshD (Ed), alphaD (αD), EthreshP (Ep) and alp-haP (αP) for each temperature described in the Supplementary Material Section 5. Our intention here was to assess, in this simple Figure 8. accelerated MD: comparison of aMD vs. non-aMD runs for the solvated 1KDF crystal structure. in (a), the total energies for the initial 1ns production runs (dotted lines) are followed by a 5-ns run, without the additional boost potential, and then switched to the aMD run. the increments correspond to the boost potential corresponding to the potential values added in the input file of the simulation (total time shown truncated at 10-ns). (B) comparison of the rMSD for all temperatures for the initial 1 ns and the boosted 100-ns aMD. (c) Same as in (B), but for the atomic fluctuations.
taking place was an automated increase in the total energy corresponding to the potential values added in the input file of the simulation. Figure 8(b) shows how the potential boost affected RMSD values and Figure 8(c) shows how it affected RMSF values at each temperature, over the 100-ns calculation, in comparison to the initial 1ns run. As was somewhat expected, the potential boost substantially increased the RMSDs at higher temperatures. Backbone RMSDs also increased for the 298 and 310 K runs. The same result was expected for the atomic fluctuations, but interestingly the technique permitted to observe how the boost affected atomic fluctuations broadly to all residues (for 373 K), selectively more to some specific residues (residues 28 to 40 for 343 K or residue 19 for 310 K) or did not have much effect on the atomic fluctuations for the 248, 268 and 298 K runs.

Conclusion
For our particular type of MD calculations with solvated, small systems with AFPs, the GPU AMBER CUDA is a suitable method that shows exact, reproducible and comparable results with the more established methodologies in CPUs, while at the same time reduces computational times by at least one order of magnitude. GPU computations achieved typical one order of magnitude speedups when using mixed precision but were similar to CPU speeds when computing with double precision. We also showed that the aMD GPU calculations were stable at least over 100 ns and the pmemd.cuda code could provide a faster methodology to observe the folding of this AFP. We conclude that for our particular studies, results derived from GPU calculations with AMBER appear to be equal to the CPU runs, but faster.