Preorientation of protein and RNA just before contacting

Protein and RNA molecules interact and form complexes in many biological processes. However, it is still unclear how they can find the correct docking direction before forming complex. In this paper, we study preorientation of RNA and protein separated at a distance of 5–7 Å just before they form contacts and interact with each other only through pure electrostatic interaction when neglecting the influence of other molecules and complicated environment. Since geometric complementary has no meaning at such a distance, this is not a docking problem and so the conventional docking methods, like FTDock, are inapplicable. However, like the usual docking problem, we need to sample all the positions and orientations of RNA surrounding the protein to find the lowest energy orientations between RNA and protein. Therefore, we propose a long-range electrostatic docking-like method using Fast Fourier Transform-based sampling, LEDock, to study this problem. Our results show that the electrostatically induced orientations between RNA and protein at a distance of 5–7 Å are very different from the random ones and are much closer to those in their native complexes. Meanwhile, electrostatic funnels are found around the RNA-binding sites of the proteins in 62 out of 78 bound protein–RNA complexes. We also tried to use LEDock to find RNA-binding residues and it seems to perform slightly better than BindN Server for 23 unbound protein–RNA complexes.

On the other hand, since electrostatic interaction is the only force at large separation (e.g. atom-atom distance > 5 Å) if neglecting the influence of other molecules and ions, it may play a key role in the process of protein-protein/RNA recognition before they become an encounter complex (Drozdov-Tikhomirov et al., 2003;Drozdov-Tikhomirov, Linde, Poroikov, Alexandrov, & Skurida, 2001;Kovalev, Drozdov-Tikhomirov, Poroikov, & Alexandrov, 2004;Qin & Zhou, 2009;Schreiber & Fersht, 1996;Schreiber, Haran, & Zhou, 2009;Schreiber, Shaul, & Gottschalk, 2006;Sinha & Smith-Gill, 2002). For examples, previous studies showed that the measured rates of protein-protein complex formation are much higher than those predicted by Brown collision (Alsallaq & Zhou, 2008;Drozdov-Tikhomirov et al., 2001Kovalev et al., 2004;Schreiber et al., 2009;Schreiber & Fersht, 1996). The barnase-barstar complex is a classic example (Camacho, Weng, Vajda, & DeLisi, 1999;Janin, 1997;Neumann & Gottschalk, 2009;Schreiber et al., 2009;Spaar, Dammer, Gabdoulline, Wade, & Helms, 2006) and its association rate constant is about 10 3 times higher than diffusion-controlled association rate constant (Gabdoulline & Wade, 1997). This may be explained by long-range electrostatic interactions. Camacho et al. (1999) calculated the electrostatic energy at 5 Å surfaceto-surface separations for 12 protein-protein complexes and found that electrostatic interaction exerts strong effect in preorientation of subunits in the formation of the complexes with strong opposite charges. But their sampling was done only on a particular plane and was insufficient. Kovalev et al. (2004) calculated the electrostatic energy on a homodimer (PDBid:1RGE) at 100 Å separation of two centers of subunits and also found that electrostatic interactions can achieve the effective preorientation of subunits, although they studied only one protein-protein complex. We hope to study this problem for protein-RNA complexes in our paper.
On account of the negative charges from phosphate groups, it is generally believed that electrostatic interactions are much more dominant in protein-RNA recognition before they form contacts with each other than in protein-protein complex formation. Since full spatial sampling of electrostatic interactions at a distance is similar to what docking methods do, it is better to use the efficient Fast Fourier Transform-based (FFT) sampling algorithm, which has been used in protein-protein docking programs, such as GRAMM (Vakser, 1997), FTDOCK (Gabb, Jackson, & Sternberg, 1997), ZDOCK (Chen & Weng, 2002;Chen, Li, & Weng, 2003), ASP-Dock (Li, Guo, Huang, Liu, & Xiao, 2011), and SDOCK (Zhang & Lai, 2011). However, these FFT-based sampling methods do not consider a sampling decoy whose subunits are separated 5 Å away, as shown in Figure 1. In fact, some degree of surface overlap is tolerated to account for conformational change when a preequilibrium encounter complex relaxes to the final specific com-plex. Therefore, in this paper we propose a long-range electrostatic docking-like method (LEDock) based on FFT algorithm to explore long-range recognitions between the subunits of 78 bound protein-RNA complexes and the electrostatic funnel of the protein-RNA interaction. We study the role of electrostatic interaction in pre-orientation of RNA and protein separated at a distance of 5-7 Å, just before they form contacts and interact with each other only through pure electrostatic interaction when neglecting the influence of other molecules and complicated environment.

Materials and methods Sampling
Since FFT algorithm was first applied in geometric surface recognition method by Katchalski-Katzir et al. (1992), it has been used in many docking protocol (Chen et al., 2003;Chen & Weng, 2002;Gabb et al., 1997;Li et al., 2011;Vakser, 1997;Zhang & Lai, 2011) due to its high efficiency in full space sampling. In our calculations, we also use the FFT-based sampling method to perform the full space sampling. In order to further enhance the efficiency of sampling, we also use the Lattman's method (Gabb et al., 1997;Lattman, 1972) to remove the redundant angles. The grid step and angle step are set as 2.0 Å and 15°, respectively. Different grid sizes for different molecular sizes are used to make sure that sampling structures can be accommodated. The half of grid size is the sum of protein radius and RNA diameter plus 10 Å. The size of the grids ranged from 100 to 256.
Based on FFT-based sampling algorithm, we construct LEDock algorithm (Figure 2), which can be divided into two parts: geometric constraint (GC) and electrostatic interaction.

Geometric constraint
This is to ensure that the distance between the closest atoms of protein and RNA is at 5-7 Å. First, the protein (denoted as P) and the RNA (denoted as R) are mapping onto a three-dimensional grid of N Â N Â N which is assigned as follows: GC p ðl; m; nÞ ¼ À1: surface of P 100 : core of P 0 : others Here, ðl; m; nÞ are the coordinates of the gridðl; m; n ¼ 1; . . . ; N Þ. P and R are treated differently. First, any grid node within 7 Å from any atom in P is assigned À1. Second, we assign 100 to all grid nodes within 5 Å from Figure 1. Schematic drawing of the difference between FFTbased docking method and LEDock. FFT-based sampling methods do not consider a sampling structure whose subunits are separated 5 Å away. On the other hand, our LEDock just considers a sampling structure whose subunits are separated.
any atom in P, which are considered as the core of P. So, the remaining grid nodes with À1 are considered as surface of P. Any grid node within 1.8 Å from any atom in R is considered as the inside of R. Others are assigned as 0. The value of the core of P is set much larger than others just to make sure that the penalty is large enough when two subunits are too close to each other. We also changed the value of the core of P to 50 and 200, but the results have no significant changes. The score of GC can be calculated using the correlation function of GC F and GC R : where N 3 is the number of grid points and a, b, and c are the translation grid steps of RNA R relative to protein P.

Electrostatic interactions
The electrostatic potential between RNA atom i and protein is given by Here, we adopted the approach by Gabb et al. (1997), except q i is the charge on RNA atom i which is adopted from the partial charge of atom in the AMBER parm03 force field (Meyer et al., 2010) in combination with the pdb2pqr program (Dolinsky, Nielsen, McCammon, & Baker, 2004). And / i is the protein electric field strength calculated by the Adaptive Poisson Boltzmann Solver (APBS) (Baker, Sept, Joseph, Holst, & McCammon, 2001). So, the electrostatic energy of each grid node S LE ða; b; cÞ can be expressed as follows: After each rotation, the electrostatic energy S LE ða; b; cÞ is saved only if the corresponding S GC ða; b; cÞ is less than zero and the S LE ða; b; cÞ is lower than the previous one.

Evaluation of the closeness between sampling and native orientations
We evaluate the closeness of the orientations between RNA and protein subunits of sampling structures to that in their native structures as follows: the geometric center of the protein subunit in native protein-RNA structure is considered as the origin of coordinate system. The y-axis is along the line connecting the geometric centers of the   (Continued) protein and its partner RNA in the native structure and z-axis is chosen arbitrarily. h angle is calculated from the projection of the geometric center of the sampling RNA structure with respect to the z-axis and / is calculated from the projection of the geometric center of the sampling RNA structure onto the xy plane with respect to the y-axis. Then, the protein subunit in the sampling protein-RNA structure is overlapped with that in the native structure and thus, the angles h and / of RNA in the sampling structure can be used to calculate the angle a from the projection of the geometric center of the sampling RNA structure with respect to the y-axis as follows: cos a ¼ cosð90 À hÞ Á cos /: We use this angle a to indicate the closeness of the orientation between RNA and protein subunits of sampling structures to that in their native structures. Since two monomers of a sampling protein-RNA decoy are separated 5 Å away, normal evaluation criterions such as root-mean-square deviation (RMSD) and fraction of native contacts cannot be used. To see how many sampling RNA structures have orientations less than a particular value of a relative to yaxis, we define the closeness rate of LEDock sampling as follows: where N a is the number of structures less than a particular value a. N max is the number of total structures in data-set and equals to 78 in this study. For comparison, we also calculated the closeness rate of random sampling without electrostatic interactions by the formula given below: Random closeness rateðaÞ ¼ ð1 À cos aÞ=2 Representation of electrostatic energy landscape in spherical coordinates For each protein-RNA complex, the sampling structures of the top 5000 lowest electrostatic energies are retained to plot the distribution of their poses by using the similar method in the paper by Fernandez-Recio, Totrov, and Abagyan (2004) as shown in Figures 5 and 6. The angles h and / are used to represent the sampling poses in a two-dimensional sinusoidal projection. The poses are colored from black to yellow according to the electrostatic energies from lower to higher. The 60°and 90°l ines of cone angle a are shown in figures correspondingly. The dot with the lowest electrostatic energy is marked as a yellow cross.

RNA-binding residues prediction and performance evaluation
A residue is defined as a RNA-binding residue if the distance between any heavy atom of the protein and any heavy atom of the RNA is less than 4.5 Å. For the prediction of RNA-binding residues, each residue gets a score, which is calculated as follows: where minðd k ij Þ is the closest distance between heavy atom j of residue i and any heavy atom k of RNA. N is the number of heavy atoms in residue i. Here, we used 100 sampling structures of lowest electrostatic energies. And then the scores of residues were normalized as follows: Score final ðiÞ ¼ 1 À ðScoreðiÞ À Score min Þ=ðScore max À Score min Þ where Score min and Score max are minimum and maximum values of scores, respectively. Finally, we chose top 30 residues with highest score values as the RNAbinding site residues.
Some widely used measures are adopted in this study, such as sensitivity SN = TP/(TP + FN) and specificity SP = TN/(TN + FP). We calculate the Matthews correlation coefficient (MCC) as follows: Here TP, TN, FP, and FN represent the true positives, true negatives, false positives, and false negatives, respectively.

Data-set
It is known that the tertiary structures of RNA molecules are the basis to understand their dynamics and functions (Gong & Xiao, 2010;Gong, Zhao, Chen, & Xiao, 2011;Zhang, Bian, Lin, & Wang, 2012;Zhang et al., 2009;Zhang, Zhang, & Wang, 2011;. To better study the association process, the structures of both complex and corresponding unbound subunits should be used, but at present few of them are found simultaneously in Protein Data Bank (PDB) (Berman et al., 2006). So we use protein-RNA complex structures and their bound subunits here for elaborate analysis and a relatively small unbound data-set for RNA-binding residues prediction. The native protein-RNA complex structures are extracted from the PDB with resolution better than 3.0 Å and sequence identity less than 30%. Missed heavy atoms and hydrogen atoms are added through tleap of AMBER Version 11 program (Case et al., 2010). Finally, we construct a data-set with 78 bound protein-RNA complex structures and they are listed in Table 1. Twenty-three unbound protein-RNA structures listed in Table 2 are selected from data-set adopted in the paper by Li, Cao, Su, Yang, and Wang (2012). The selection criteria are the integrality and proper size of the protein and RNA subunits. As we submitted our paper, two data-sets of protein-RNA are published (Barik, Nithin, Manasa, & Bahadur, 2012;Perez-Cano, Jimenez-Garcia, & Fernandez-Recio, 2012) recently. The sizes of the new data-sets are 45 and 71, respectively. And the latter is almost the same size as ours.

Comparing electrostatic and random preorientations
Preorientation of subunits is checked before they become an encounter complex by closeness rates of LEDock and random samplings. Closeness rate relative to the cone angle used here is borrowed from docking problem and just describe how many RNA molecules have their orientation smaller than a particular angle. The higher the closeness rate, the more subunits are closer to their native binding orientations. As described in Figure 3, the closeness rate of LEDock sampling is much higher than random sampling, especially when cone angle a is smaller (<90°). The largest difference between them is 48.1% when the cone angle a = 60°. Preorientation of subunits ða\60 Þ is found in 57 out of 78 protein-RNA complex subunits (73.1%). This indicates that the orientations of RNA and protein molecules with lowest electrostatic energies are much closer to those in their native complexes than random orientations.

Distributions of docking sampling poses for three types of protein-RNA complexes
All the protein-RNA complexes can be divided into three categories according to the type of RNA: paired RNA type, unpaired RNA type, and tRNA type. The complex consisted of protein and tRNA is considered as tRNA type. If no base pairing exists in RNA subunit, the complex is defined as unpaired RNA type. The others are considered as paired RNA. The closeness rates of these three types of protein-RNA complexes are shown in Figure 4. Protein-paired RNA complexes have a highest closeness rate while protein-unpaired RNA complexes have a lowest closeness rate. Furthermore, for almost all the cone angles, the closeness rates of all three types of protein-RNA complexes are significantly higher than that of random samplings. These suggest that the preorientation of the paired RNA type of complexes may be mainly determined by electrostatic interaction. Figures 5 and 6 show some typical distributions of the positions of the sampling structures of the first 5000 lowest electrostatic energies for the two types of the protein-RNA complexes. Since tRNA type is just a special case of paired RNA type, their distributions of sampling are similar. Figure 5 shows the LEDock sampling distributions of two protein-RNA complexes of paired RNA type. Figure 5(B) shows a spherical coordinates projection of the positions of geometric centers of the RNA of the first 5000 lowest electrostatic energies for human LGP2 C-terminal domain and a double-stranded RNA complex (PDB 3EQT) (see 'Materials and methods' section for details). The yellow cross represents the position of the geometric center of the RNA with the lowest electrostatic energy. In this case, the angle a of the structure with the lowest electrostatic energy is 6.7°, which means the orientation of the RNA relative to protein is nearly the  Figure 5(C) and (D) show the same representation for elongation factor SelB and SECIS RNA complex (PDB 2PLY). The angle a of the structure with the lowest electrostatic energy is 76.9°(>60°), i.e. the orientation of the RNA is very different from that in native complex structure. However, the position of RNA is still at the same side of the binding interface because the angle a is smaller than 90°.

Complexes of paired RNA type
Complex of unpaired RNA type Figure 6(A) and (B) describes the complex of capmodifying and 5′-capped mRNA (PDB 1AV6). The cone angle a of the RNA with the lowest electrostatic energy is 140.2°, and the sampling and bound RNA appear having a mirror symmetry as shown in Figure 6(A). Even though most of the sampling structures deviate from the center, the top 5000 sampling structures still include some spots around the center as shown in Figure 6(B).
Figure 6(C)-(F) shows the FBF-2/FBE complex (PDB 3K5Q) and complex of HC-J4 polymerase and short RNA template strand (PDB 1NB7), respectively. In these cases, it seems that pure electrostatic interaction cannot lead to a proper orientation of the protein and RNA subunits on account of interaction between bases of RNA and residues of protein (Figure 6(C) and (D)) and RNA buried in protein (Figure 6(E) and (F)). Stacking interactions between bases and amino acids are common and strong short-range interactions between protein and RNA (Allers & Shamoo, 2001;Koh et al., 2011;Morozova et al., 2006), which cannot be identified using electrostatic interaction. Embedding is also a problem, which is hardly sampled by electrostatic interaction. In most protein-RNA complex of this type, the protein presents in its closed state, in which the binding pocket of protein shrinks to enclose the RNA. Long-range interactions occur when the protein stays in its open state, so it is hard to find the correct orientation of this type of protein-RNA complex using bound structures. Electrostatic energy funnel Experimental evidence suggests that the association rate constant is usually higher than diffusion-controlled association when opposite charges are distributed on the binding partners, such as barnase and barstar (Schreiber & Fersht, 1996) and DNase E9 and Im9 (Wallis, Moore, James, & Kleanthous, 1995). Electrostatic energy funnel at the protein surface could more easily capture the partner (RNA), thereby enhancing the binding efficiency. Here, the distribution of electrostatic energies of the full spatial sampling structures relative to the cone angle α is shown in Figure 7 (those of all 78 protein-RNA complexes are given in Supporting Information). Most of the complexes (62 out of 78 protein-RNA complexes) have electrostatic funnels around the RNA-binding sites of proteins. The cone angle with the minimal average energy is less than 10°in 49 complexes (62.8%) and 60°i n 62 complexes (79.5%) out of all 78 protein-RNA complexes.

RNA-binding site prediction
To evaluate the ability of our program in RNA-protein recognition, we also predicted RNA-binding residues using 100 sampling structures of the lowest electrostatic energies (see 'Materials and methods' section for details). The performance of the methodology is evaluated on 23 unbound protein-RNA complexes and compared with the BindN server (Wang & Brown, 2006). Figure 8 shows a representative example of the binding site prediction. The predicted binding sites for all cases are listed in Table S1. The ROC curve is shown in Figure 9 and the area under the ROC curve (AUC) is 0.80. We achieved 48% in sensitivity, 88% in specificity, and 0.33 for the MCC value and the BindN server (setting the expected specificity to 80%) achieved 35% in sensitivity, 88% in specificity, and 0.21 for the MCC value. Since our method uses only electrostatic energy without any training method or adjustable parameter, it may be more reliable and physically meaningful than BindN.

Protein-RNA association at different salt concentrations
The results above indicate that electrostatic interaction would be one of the key factors in preorientation of RNA and protein before they form complex structure. To   further show this, we use LEDock to study protein-RNA association in vitro at different salt concentrations. Here, we consider a protein-RNA complex composed of Carnation Italian ringspot virus (CIRV) p19-H and siRNA (PDBid:1RPU) (Vargason, Szittya, Burgyan, & Hall, 2003). The experimental data of the dissociation constants for CIRV p9-H:siRNA binding under different salt concentrations from 0.1 to 1.1 M are shown in Table 3, which are extracted from the reference (Koukiekolo, Sagan, & Pezacki, 2007). We use APBS software to calculate the lowest electrostatic energies of the sampling structure under different ion concentrations and the results are listed in Table 3. We find that the lowest electrostatic energy and the dissociation constant change consistently under different ionic strengths, i.e. the lower the ion concentration, the smaller the dissociation constant and the lower the electrostatic energy. Sampling structures of the lowest electrostatic energy under different ionic strengths are shown in Figure 10. RNA structures colored in red, blue, and yellow represent sampling structures from lower ion concentration to higher one. We find that the lower the ion concentration, the closer to the native structure is the sampling RNA structure with the lowest electrostatic energy. These results indeed show that electrostatic interaction is important to the preorientation of CIRV p19-H and siRNA. Under higher ion concentration the strength of electrostatic interaction between RNA and protein is reduced and RNA and protein cannot find their correct orientation. It also shows that LEDock can find proper orientation between the RNA and protein subunits before they form complex.

Conclusions
Here, we analyzed the effect of electrostatic interaction on protein-RNA orientation using an FFT-based sampling method. The electrostatic energy algorithm adopted here is Coulomb's formula and no any other artificial parameter is used, which make the results more reasonable. A large data-set (78 protein-RNA complexes) is also used to get a more convincing conclusion.
The result shows that electrostatic interaction may be one of the important factors in preorientation of protein/ RNA subunits at a distance of 5-7 Å. The orientations of RNA and protein molecules with lowest electrostatic energies are very different from the random ones and are much closer to those in their native complexes. Electrostatic funnels are found around the protein at 5-7 Å separation, and most of them are located around RNAbinding sites. These funnels could make proteins to \capture RNAs much easily. We also used our method to predict RNA-binding residues and it performs slightly better than BindN (a sequence-based RNA-binding site prediction method).
The role of electrostatic interaction in preorientation is not significant in two classes of protein-RNA complexes. The first class includes single chain RNA without  base pairing. In this case, the stacking interaction but not electrostatic interaction between bases and amino acids is the dominating interaction. The second is when RNA is buried by its receptor protein. In this case, the protein subunits in this class seem to get their closed states through larger conformational change. The electrostatic fields are different between the closed state and its corresponding open state.