4D-QSAR investigation and pharmacophore identification of pyrrolo[2,1-c][1,4]benzodiazepines using electron conformational–genetic algorithm method

Abstract In this paper, we present the results of pharmacophore identification and bioactivity prediction for pyrrolo[2,1-c][1,4]benzodiazepine derivatives using the electron conformational–genetic algorithm (EC–GA) method as 4D-QSAR analysis. Using the data obtained from quantum chemical calculations at PM3/HF level, the electron conformational matrices of congruity (ECMC) were constructed by EMRE software. The ECMC of the lowest energy conformer of the compound with the highest activity was chosen as the template and compared with the ECMCs of the lowest energy conformer of the other compounds within given tolerances to reveal the electron conformational submatrix of activity (ECSA, i.e. pharmacophore) by ECSP software. A descriptor pool was generated taking into account the obtained pharmacophore. To predict the theoretical activity and select the best subset of variables affecting bioactivities, the nonlinear least square regression method and genetic algorithm were performed. For four types of activity including the GI50, TGI, LC50 and IC50 of the pyrrolo[2,1-c][1,4] benzodiazepine series, the r2train, r2test and q2 values were 0.858, 0.810, 0.771; 0.853, 0.848, 0.787; 0.703, 0.787, 0.600; and 0.776, 0.722, 0.687, respectively.


Introduction
Many of the clinically potent anticancer agents directly target DNA to show their antitumour effects [1]. In recent years, there has been an increasing interest in DNA interactive ligands which can bind to DNA to achieve the required sequence selectivity. As a gene-targeted ligand, naturally occurring pyrrolo [2,1-c] [1,4]benzodiazepines (PBDs) showing antibiotic and antitumour effects are derived from the fermentation broth of various Streptomyces species, well-known members of which include anthramycin, tomaymycin, sibiromycin and DC-81 [2][3][4].
As the most reliable and cited approach, quantitative structure-activity relationships (QSARs) have been utilized to correlate the biological activities of a compound library and conformations knowing that a number of low-energy conformations are available at room temperature for a molecule and each low-energy conformer produces a considerable effect on biological activity and contributes to model power. In this method, the biological activity prediction and pharmacophore identification are performed as a function of physicochemical and structural descriptors for a set of low-energy conformers of each compound, instead of a single lowest energy conformation. To establish a meaningful and predictive QSAR model, it is crucial to select the best subset of molecular descriptors in the optimum number.
Here the GA optimization technique is used for descriptor selection. The final model is cross-validated by the leave-one-out cross-validation (Loo-Cv) method. As a promising 4D-QSAR approach the EC-GA method, which provides pharmacophore detection, variable selection and quantitative bioactivity prediction, was performed employing C2-aryl PBD derivatives for four types of biological activities.

Materials and methods
C2-aryl PBD derivatives were analysed by the EC-GA method to distinguish the pharmacophore group and to derive a relationship between biological activities and selected molecular parameters. Detailed information about the methodology can be found in the literature [26][27][28][29][30]. The GI 50 , TGI and LC 50 activity values for compound 17 and 38 are not given in Table  1 since they were not determined experimentally.
The structure of relevant compounds and their experimental biological activities including the GI 50 , TGI, LC 50 and IC 50 values obtained from the literature are given in Table 1. The concentrations which are in μM were converted to a negative logarithmic scale which allows us to better handle the numbers. Spartan 10 [31] software was used for the construction of the 3D structures of compounds, conformational analysis and quantum chemical calculations at Hartree Fock 3-21 G* level. Even though the more complicated basis sets give more accurate results, they expend a great deal of computation time. In case of a large number of compounds and conformations, as in this study, the required computation time increases due to much larger basis sets. Accordingly, we have considered the basis set 3-21G*, which is faster and sufficiently small without compromising the required level of accuracy. Water was used as solvent since it is the most similar solvent to biological systems. Following the conformational search of each molecule, conformers with Boltzmann distribution under 1/10000 were excluded, and higher ones were kept.
Mulliken charges and bond orders/interatomic distances were utilized to generate the electron conformational matrix of congruity (ECMCs) for individual conformations of the entire compound set and placed in diagonal and non-diagonal positions, respectively. Nondiagonal elements are of two types: bond orders for chemically bonded atom pairs and interatomic distances for non-bonded atom pairs [32]. An example of ECMC is illustrated for the lowest energy conformer of compound 63 as the template in Figure 1. For 87 analogues of C2-aryl 1, 4PBD derivatives, 997 ECMCs were created to be used in the comparison of the ECMCs by EMRE software [26][27][28][29][30] after eliminating the conformers which overlap and have lower Boltzmann distribution.
of all the conformers of individual compounds, the lowest energy conformer of the most active one was chosen as template. The compounds were categorized as active and inactive by indicating a proper activity threshold value which is based on the data of the activity range for each type of activity. Up to a specified tolerance value, by adjusting the tolerance Table 1. chemical structures, substituents and experimental pGi 50 , ptGi, plc 50 and pic 50 activity values for c2-aryl pyrrolo [2,1-c] [1,4] 1. Ecmc of the lowest energy conformer of the most active template molecule (compound 63) in the data set. the diagonal members correspond to the mulliken charges whereas the non-diagonal elements refer to the bond orders for chemically bonded atom pairs and interatomic distances for nonbonded pairs. hydrogen atoms attached to carbon atoms are omitted in the Ecmc for clarity.
limit steadily all matrix elements of the ECMC of the template compound were compared with that of other ECMCs. Through the comparison of ECMCs, we obtained several electron conformational submatrices of activity (ECSA). Each ECSA was evaluated according to two commonly used criteria (P α and α a ) given in Equations 1 and 2 below [33]: where n 1 and n 2 are the numbers of molecules including and not including pharmacophore atoms (ECSA) in the class of highly active compounds, respectively, whereas n 3 and n 4 have similar meaning for weakly active compounds; m 1 and m 2 are the numbers of molecules in the class of highly active and weakly active compounds, respectively; m3 = n1 + n3; m4 = n2 + n4 [34]. Herein the first term P α is related with the possibility of pharmacophore presence in active compounds while the second one is related with the possibility of pharmacophore presence in inactive/low active compounds.
To make clear how additional groups affect biological activity besides the pharmacophore, auxiliary groups (AG) and anti-pharmacophore shielding groups (APS) [23] were determined. AG and APS groups are distinguished by their opposite effects on biological activity. While the AG group promotes biological activity, the APS group shows a reducing effect. The out-of-pharmacophore groups are described by the following S function [35]: where a (j) ni is the parameter depicting the jth kind of feature in the jth conformation of the nth compound, N is the number of selected parameters and κ j is the relative weight of different parameters. Each parameter has a different and constant κ j value.
In the equation below [23], biological activity is expressed as a function of molecular descriptors, its energy and temperature considering the Boltzmann weighting of the individual conformations of each compound as follows: where An and Al are the activity values of the nth compound and the reference compound, respectively. El i is the relative energy of the ith conformation of the reference compound (in kcal mol −1 ). En i is the relative energy of the ith conformation of the nth compound (in kcal mol −1 ), R (kcal mol −1 K −1 ) is the gas constant and T is the temperature in Kelvin. δ is a kind of Dirac δ function which takes two values based on pharmacophore presence. The value equals 1 if the pharmacophore is present and 0 if not. The same equation was also used to calculate the κ j , variational constants. To implement and solve the weighted least squares fitting problem for the κ j values of the parameters, the lsqnonlin function of the optimization toolbox in Matlab [36] is used. The weighted nonlinear least-squares analysis combined with GA can (1) P = (n 1 + 1)∕(n 1 + n 3 + 2) be efficiently utilized with parameter selection and any kind of nonlinear optimizations. In addition, a GA and the method including iteration of the lsqnonlin function combined with initial values generated stochastically within a wide parameter range are employed to explore the best parameter subset. The numbers "κ j " = 1, 2,…, N, obtained in this way characterize the weights of each kind of the ani (j) parameters in the overall APS/AG influence [23]. Another significant point is the preparation and the selection of descriptors. Hereby, 1331 molecular descriptors based upon four main classes (quantum chemical, thermodynamic, electrostatic and geometrical) regarding the pharmacophore group were generated for each conformer of PBD derivatives by EMRE software [21,[26][27][28][29][30]. To eliminate the irrelevant and unnecessary descriptors and to increase model accuracy, the descriptor pool was reduced to a small subset of parameters. For this purpose, the most important parameters, a ni (j) in Equation 4, were selected by the GA technique [37,38] since it is a fast and efficient method. The GA procedure starts with a randomly generated initial population comprising N individuals, each of which corresponds to a different parameter subset randomly selected from the descriptor pool. The populations are mainly composed of integer units defining model parameters (κ j indices) as genetic codes. To calculate the κ j values of the model parameters, each parent is subjected to the lsqnonlin function. The initial selected population according to the fitness values is subjected to genetic operators named selection, mutation and crossover to yield the new generation. Thus, some part of the next generation is constituted from the mutation procedure and the other part from crossover. Repeating this procedure, a number of models giving different parameter subsets are obtained until they converge or the prespecified size of generation is reached. Here, we run the GA with the following parameters: number of generation: 400; population size: 400; number of iterations: 150; crossover fraction: 85%; mutation rate: 1.5%.
Through Loo-Cv, the fitness value of each chromosome was calculated by the predictive residual sum of squares (PRESS) as the fitness function. The formula of PRESS which measures the distribution of the calculations obtained from Loo-cross-validated values is given by: where A exp n is the experimental activity of the nth molecule in the experimental activity data, A pred n is the predicted value of activity of the nth molecule in the training set by Loo-Cv, and N is the total number of compounds in the training set.
In this study, the quality of the each of the obtained models was assessed internally by the Loo-Cv method and externally by an analogous test set. In the internal validation of the models, only the training set compounds were considered. Each compound is precluded one by one to determine the biological activity with remaining compounds. Therefore, the contribution of each molecule to the robustness of the model is evaluated. For internal validation of the models, the value of q 2 was found by the following formula: where N indicates the total number of compounds in the training set. Ā exp n is the mean value of experimental activity of all the molecules in the training set. A exp n is the experimental activity of the nth molecule in the training set. SYY expresses the sum of squared deviations of experimental activity from the mean (Ā exp n ). So as to verify the reliability and predictivity of the models on the new compounds which are not used in the model development, the data set is split into training and test sets. The model developed by training compounds is applied to the test compounds to confirm the prediction power. In order to calculate the q 2 , two expressions of external validation were proposed by Schüürmann et al. [39] and are based on the average values involving the training set and test set means in the denominator and the sum of squares of the external set in the numerator. These equations are given by the following formulas [39]: where N is the number of molecules to be tested. A exp n test and A pred n test are the experimental and the predicted activities of the nth compound in the test set. Ā exp n training and Ā exp n test are the arithmetic means of the experimental activities of the training and test sets, respectively.
Another external validation measure called Q 2 F3 was introduced by Consonni et al. [40] for the purpose of discussing the predictive ability of QSAR models with external assessment described in Schüürmann et al. 's study [40]. The external prediction capability given by Consonni is calculated by the following equation: where N test and N training are the number of test and training molecules, respectively. Whereas A exp n test and A Pred n test refer to the experimental and the predicted activity values of the nth test compound, A exp n training is the experimental activity of the nth compound in the training set. Ā exp n training is equal to the mean of the experimental activities of the training compounds. In Equation  9, the sum of squares in the denominator is related with the training set while that in the numerator is related with the external prediction set. In addition to the external evaluation criteria given above, Chirico and Gramatica proposed a different and simpler alternative which gives more cautious and restrictive results in proportion to other compared measures. The rearranged version of the concordance correlation coefficient (CCC) is given by following equation [41]: where A exp i and A pred i correspond to the experimental and predicted values of the activity, respectively. Similarly, Ā exp and Ā pred correspond to the averages of the experimental and predicted activity values. In the formula, by using both training and test sets, the reliability of the model was developed. The CCC, which has a value greater than 0.85, confirms the excellent precision and accuracy of the model. For QSAR model development, different external evaluation functions which have advantages or drawbacks with regard to each other were introduced by different researchers. Among those expressions, Equations 7-9 were used to appraise model consistency in previous papers by us. We also made use of the last two external validation formulas (Equations 10-12) aforementioned for the first time in this 4D-QSAR EC-GA study.
At the end of the model development stage, evaluating the prediction abilities of all the models considering the r 2 , q 2 , q 2 ext1 , q 2 ext2 , q 2 ext3 and CCC criteria by the Loo-Cv technique, the best parameter subset and related best model were determined. Using the best parameter subset and corresponding κ j values, we calculated the activity values of the compounds with unknown activity with Equation 4.
In the best parameter subset, one or several parameters make more contribution to the biological activity. To estimate which parameter/parameters in the subset is predominant, the E-statistics technique is used [42]. The statistical E value is calculated by the following formula as the ratio of the predictive sum of squares: where A exp n and A pred n refer to the experimental and predicted activity in the Loo-Cv procedure. In this situation only a small number of parameters (N = 9-11 in this study) were used to construct the model. The value of the E defines the impact of the parameters. The greatest the increase in the E value, the lowest the contribution made by the parameter. In parallel with the high value of E, omission of the parameter reduces the model's performance.

Results and discussion
The chemical structures of the C2-aryl PBD derivatives with substituents and experimental pGI 50 , pTGI, pLC 50 and pIC 50 values are given in Table 1 in the previous section. The data comprising atomic charges, Cartesian coordinates, bond orders and interatomic distances from the conformational analysis and quantum chemical calculations at Hartree Fock 3-21G* level were assigned to build the ECMCs of the 997 conformers of 87 compounds by the EMRE programme (see Figure 1 for sample matrix of the lowest energy conformer of reference compound). To describe the pharmacophore for GI 50 activity, the value pGI 50 = 8.3010 was regarded as the activity threshold. In total, 46 compounds with pGI 50 ≥ 8.3010 were categorized as high-activity compounds, 37 were classed as low-activity compounds and four compounds had unknown activity.
For GI 50 activity, the comparison procedure of the ECMCs defined in the materials and methods section resulted in a pharmacophore group comprising the o1, o2, C9, o3, N1, N2, C14 and C17 atoms with an optimum P α = 0.9737 and α a = 0.7849 values (i.e. with the highest P α and α a values). The final ECSA and relevant tolerance values for both active and inactive compounds including the compounds with unknown activity are reported in Table 2, in which pharmacophore atoms are shown in yellow. Table 2 contains six submatrices. The first submatrix corresponds to pharmacophore atoms for the lowest energy conformer of the template compound. The second and third ones are the tolerance submatrices for 46 compounds with high activity and 37 compounds with low activity, respectively. The fourth submatrix represents tolerance values for the overall conformers (997) of 87 compounds without tolerance limitation. As seen in (b) and (c) of Table 2, the atomic charge tolerances of the o1 atom are ±0.024 and ±0.093 and the tolerances of the distance between the N1 and N2 atoms are ±0.028 and ±0.189 for high and low active compounds, respectively. Table  2 proves that, in general, compounds with high activity possess lower tolerance values than those with low activity.
After careful analysis of the pharmacophore atoms, the o3, N1, o1 and o2 atoms present in the benzodiazephine ring are identified among the key pharmacophoric elements as hydrogen-bond acceptors. The C14 and C17 atoms located in the imidazole and quinoline ring, respectively, comprise the hydrophobic regions. Most of the pharmacophore atoms are placed on a rigid plane since the structure contains condensed heterocyclic units showing very little conformational flexibility. The o1, o2, N1, o3 and N2 atoms are defined as negatively charged atoms while the C9 atom is positively charged. The C14 and C17 atoms show lower negative charges than the others. The highest tolerance value of interatomic distances for high-activity compounds pertains to the C17-o2 distance which shows the flexibility of the position whereas the N2-o3 distance has minimum tolerance due to a rigid plane.
In the first step of bioactivity prediction, four data sets associated with pGI 50 , pTGI, pLC 50 and pIC 50 values were randomly divided into three data sets: the training set, test set and unknown set. The compounds in the training, test and unknown set were randomly selected from the entire data set by GA. For each activity type, the generated models were evaluated both internally and externally. These subsets for GI 50 activity included 55 training, 27 test and five unknown compounds. Likewise, the pTGI, pLC 50 and pIC 50 datasets were classified as training, test and unknown sets (55, 27, 5; 55, 27, 5; and 48, 24, 15, respectively).
The main goal of descriptor selection is to develop a robust model by employing the minimum number of variables. As the optimal number of parameters is not known formerly, it is essential to run a number of models to explore the relationship between prediction power (q 2 ) and the number of parameters in the subset. First the compounds were randomly selected; then they were kept stable and we scanned the number of parameters from 1 to 15 to detect the optimum number of parameters. The number of parameters was plotted versus r 2 (for training and test set), q 2 , q 2 ext1 , q 2 ext2 , q 2 ext3 and CCC of pGI 50 activity as shown in Figure 2. As seen in Figure 2, even if increasing the number of parameters causes a rise in r 2 and q 2 up to 11 descriptors, after 11 descriptors the model gains stability and a higher number of descriptors does not enhance the model performance very much. As a general rule, the ratio of the number of parameters to the number of compounds in the model should not be higher than 1:5 to avoid potential overfitting risk [43].
The plots showing the optimum number of parameters for pTGI, pLC 50 and pIC 50 values are also given in Figures S1-S3 as supporting information (available via the Supplementary Content tab on the article's online page). The pTGI activity values of C2-aryl PBD derivatives resulted in an optimum of 11 parameters for 55 training and 27 test compounds. In Figure  S1, the statistical parameters exhibit an increase until 11 parameters. At 11 parameters, the model reaches a steady state and does not need any extra parameters. Thus, the model for pTGI was found as a function of the best 11 parameters. In the same way, the optimum numbers of parameters for pLC 50 and pIC 50 activities are determined in Figures S2 and S3   For pGI 50 , a brief definition of the best 11 descriptors selected with GA and the related κ j values are listed in Table 3. The analysis of Table 3 shows that geometrical and electronic parameters have more impact on the GI 50 activity of C2-aryl PBD derivatives. a (1) , a (2) , a (3) , a (4) , Molecular parameters κ j a (1) orthogonal distance from c8 atom to the o1 n1 o3 plane (Å) 0.102 a (2) orthogonal distance from o3 atom to the n1 n2 c14 plane (Å) -0.128 a (3) orthogonal distance from c4 atom to the n1 n2 c14 plane (Å) + van der Waals radius (Å) 0.297 a (4) orthogonal distance from c8 atom to the c17 c14 n1 plane (Å) + van der Waals radius (Å) -0.061 a (5) orthogonal distance from c15 atom to the o1 o2 c17 plane (Å) -0.141 a (6) orthogonal distance from c11 atom to the n4 c12 o3 plane (Å) 0.064 a (7) Angle between o3 c9 n2 plane and the line of c14-c23 0.103 a (8) Electrostatic charge of n2 atom -0.498 a (9) nucleophilic atomic frontier electron density of o3 atom -2.193 a (10) nucleophilic atomic frontier electron density of n2 atom -1.801 a (11) Fukui atomic electrophilic reactivity index of c17 atom -30.654  , q 2 ext3 and ccc for pGi 50 activity values. a (5) , a (6) and a (7) are the geometrical parameters involving mostly pharmacophore atoms. The parameters a (1) , a (2) , a (5) and a (6) are orthogonal distances. a (3) and a (4) are the orthogonal distances plus van der Waals radius (Å). The remaining four parameters represent the electronic features of the pharmacophoric atoms. a (8) is the electrostatic charge of the N2 atom placed in the imidazole ring. a (9) and a (10) are the nucleophilic atomic frontier electron density index values [44] of the o3 and N2 atoms, respectively. The last parameter, a (11) , in Table 3 is the Fukui atomic electrophilic reactivity index value [45] of the C17 atom. The presentation of parameter a (2) and a (3) is shown in Figure 3.
The best descriptors and related κ j values corresponding to pTGI, pLC 50 and pIC 50 values are given in Tables S1-S3 (available online). In Table S1 for TGI activity, it is seen that the first eight parameters (a (1 )-a (8) ) are geometrical parameters including the orthogonal distance, orthogonal distance + van der Waals radius and the angle between the line and plane of atoms, whereas a (9) and a (10) symbolize the Fukui atomic electrophilic reactivity index values of the o1 and C17 atoms. a (11) is log P, which is the partition coefficient related with the compound's hydrophobicity. A similar situation is seen in Table S2 and Table S3 for LC 50 and IC 50 activities. For both types of activity, geometrical parameters are predominant. The parameter list of LC 50 activity gave a (1) -a (7) as geometrical parameters which are mainly composed of orthogonal distance and orthogonal distance + van der Waals radius. The other four parameters (a (8 )-a (11) ) are the nucleophilic atomic frontier electron density index value of the o3 atom [46], the Fukui atomic electrophilic reactivity index value of the C17 atom, the HoMo and log P. The best parameters' list for pIC 50 values (see Table S3) includes nine parameters of which a (1) is the orthogonal distance + van der Waals radius, a (2) is the orthogonal distance, a (3) is the angle between the C16 C17 C20 plane and the C14-C18 line, a (4) and a (5) are the electrophilic atomic frontier electron density index values of the C17 and C16 atoms [46] and a (6) -a (9) are the dihedral angles.
To determine the AG and APS groups which contribute positively or negatively to the activity, the product of κ j and the parameter value was taken into account. If the result of the product is positive then the related parameter is regarded as an AG parameter, otherwise it is an APS parameter. Accordingly, within the 11 optimal parameters in Table 3 for GI 50 activity a (2) , a (4) , a (5) , a (9) , a (10) and a (11) are AG parameters while a (1) , a (3) , a (6) , a (7) and a (8) are APS parameters. In the same way for TGI activity, a (1) , a (3) , a (6) , a (9) , a (10) and a (11) were determined as AG parameters and a (2) , a (4) , a (5) , a (7) and a (8) as APS parameters. Among the parameters in Table S2 of LC 50 activity, a (2) , a (4) , a (6) , a (8) , a (9) and a (11) are AG parameters while a (1) , a (3) , a (5) , a (7) and a (10) are APS parameters. Finally a (4) , a (6) , a (8) and a (9) are AG parameters and a (1) , a (2) , a (3) , a (5) and a (7) are APS parameters for IC 50 activity.
In consideration of previous explanations, among the several models for pGI 50 , pTGI, pLC 50 and pIC 50 activity values, the experimental and predicted activity values, r 2 , standard error and both internal and external q 2 values for the best models of each activity type are listed in Table 4. As seen in Table 4, the data set of pGI 50 was divided into a training set of 55 compounds and a test set of 27 compounds in order to get an exact robust model through a validation procedure with test compounds. The compounds marked with "a" correspond to test compounds while those marked with an asterisk are unknown compounds. The number of training, test and unknown sets for pTGI, pLC 50 and pIC 50 datasets are 55, 27, 5; 55, 27, 5 and 48, 24, 15, respectively.
As a general rule, if the q 2 values of the cross-validated models are higher than 0.5, the predictive ability of the model should be acceptable [47]. Based on internal validation, the  =0.791). In addition, the difference between the experimental and predicted activity values is less than 1. The usefulness of the obtained models for future activity prediction of new PBD analogues can be seen from the high quality of the statistical results of the models. It is seen that the TGI activity results also showed very good predictive capability with internal and external validation criteria. For the best model of TGI activity with an optimum 11 of parameters, the r 2 and q 2 values of the training set were found as 0.848 and 0.787. In addition, the external validation results of the test set (r 2 = 0.848, q 2 ext1 = 0.743 and q 2 ext2 = 0.731), which is the real indicator of the prediction capacity of a model, are also highly predictive and acceptable. The models a test compounds; *compounds with unknown activity.  values indicate that the model is less capable of correctly predicting. The plot of experimental vs. predicted pGI 50 values of training and test sets obtained by 11 descriptors is shown in Figure 4. Consequently, taking into account all the conformers of the 87 compounds, both the training and test sets gave acceptable statistical results with an optimal 11 descriptors. The model generated with the EC-GA method produced a good prediction power (see Table 4, Figure 4, Figures S4-S6 (available online)). The pTGI, pLC 50 and pIC 50 corresponding plots are given in Figures S4-S6.
All calculations related to bioactivity prediction and statistical analysis were carried out in two ways: the first examined all the conformers and the second examined only the lowest energy conformer for each compound. The statistical results for pGI 50 regarding both only one conformer and all conformers are presented in Figure 5. Regarding only the lowest energy conformer of each compound, we obtained the q 2 , When we considered only the lowest energy conformer we achieved the following results: for TGI activity q 2 = 0.720, r 2 training = 0.816, r 2 test = 0.660, q 2 ext1 = 0.404, q 2 ext2 = 0.378, q 2 ext3 = 0.221, con1 = 0.902, con2 = 0.570, con3 = 0.785; for pLC 50 , q 2 = 0.541, r 2 training = 0.681, r 2 test = 0.753, q 2 ext1 = 0.743, q 2 ext2 = 0.736, q 2 ext3 = 0.685, con1 = 0.813, con2 = 0.836, con3 = 0.822; for pIC 50 , q 2 = 0.490, r 2 training = 0.684, r 2 test = 0.729, q 2 ext1 = 0.568, q 2 ext2 = 0.532, q 2 ext3 = 0.387, con1 = 0.823, con2 = 0.757, con3 = 0.793. With all the statistical results for four data types, it was seen that taking into account all reasonable conformers gave higher internal and external validation values. The statistical results of TGI, LC 50 and IC 50 activities containing the experimental and predicted activity values, r 2 , standard error and both internal and external q 2 values for the best model obtained by the optimum number of descriptors are given in Table 4.
The best parameter subsets including 9-11 parameters which yielded the best models for the pGI 50 , pTGI, pLC 50 and pIC 50 of C2-aryl PBD derivatives are the parameters suggested as contributing most to the activity. However, the contribution of each parameter is not equal. The E-statistic technique was used to analyse the individual effect of each parameter on the biological activity. In turn, each parameter was excluded and the model was established with other parameters. Consequently, neglecting the related parameter, the differentiation in the model performance was observed over the E, r 2 training , se training , r 2 test , se test , q 2 , q 2 ext1 , q 2 ext2 , q 2 ext3 , con1, con2 and con3 values that are represented in Table 5 for GI 50 activity.   (11) parameter, which corresponds to the Fukui atomic electrophilic reactivity index value of the C17 atom, has maximal impact on the activity. The negative correlation between pGI 50 activity values and the Fukui atomic electrophilic reactivity index also has the lowest E value. Hence omission of a (11) leads to a deterioration in the model performance. The angle between the o3 C9 N2 plane and the line of C14-C23, a (7) , which has the highest E value, does not much affect the model's performance. a (9) , a (4) and a (1) are the most potent second, third and fourth parameters; ignoring them gives a reasonable E value and noticeably low q 2 values compared with a (11) . Considering the statistical values in Table 5, the contribution of parameters to the model quality is, respectively, as follows: a (11) , a (9) , a (4) , a (1) , a (6) , a (5) , a (3) , a (10) , a (2) , a (8) and a (7) .
The E-statistic results to determine which parameters contribute most to the pTGI, pLC 50 and pIC 50 activity values are listed in Tables S4-S6 (available online). Whereas the q 2 and r 2 training values of the model with the optimum 11 descriptors based on pTGI activity values are 0.853 and 0.787, respectively, it is clearly seen neglecting the a (10) parameter, which is the Fukui atomic electrophilic reactivity index value (ev) of the C17 atom, obviously results in decreased q 2 (-0.245) and r 2 training (0.638) values (see Table S4 online). In addition, remarkable negative q 2 (-0.245), q 2 ext1 (-0.369), q 2 ext2 (-0.429) and q 2 ext3 (-0.790) values and the lowest E value (0.171) reveal how influential the a (10) parameter is on the activity and how essential it is for the model development as the most important contributor. The a (3) parameter (orthogonal distance from C6 atom to the C10 N2 o3 plane (Å)) whose E value (0.994) is the highest has very little effect on the model. This means that omitting the effect of the a (3) parameter on the activity gives an acceptable model without any loss of model performance. The orthogonal distance from the C14 atom to the N2 C9 o3 plane (Å), a (5) , is the second most potent parameter as a geometrical parameter. Neglecting a (5) also gives negative q 2 ext1 , q 2 ext2 and q 2 ext3 values, which affirm its impact on the activity. The descending contribution of the parameters to the biological activity is as follows: a (10) , a (5) , a (2) , a (9) , a (11) , a (8) , a (4) , a (6) , a (1) , a (7) and a (3) .
In consideration of pLC 50 activity values, the q 2 value of the developed model with 11 parameters is 0.600. As seen from Table S5 (available online), the two most influential parameters with the lowest E and q 2 values are the Fukui atomic electrophilic reactivity index value of the C17 atom (a (9) ) and the nucleophilic atomic frontier electron density of the o3 atom (a (8) ). Exclusion of the a (9) parameter decreases the q 2 value from 0.600 to -17.770. With the lowest value of E (0.021), a (9) has the maximal impact. Moreover r 2 training , r 2 test , q 2 ext1 , q 2 ext2 , q 2 ext3 , con1, con2 and con3 exhibit the lowest values for the situation of a (9) . Neglecting a (1) , the orthogonal distance from the C17 atom to the o1 o2 o3 plane +van der Waals radius (Å), we obtained relatively high statistical values of r 2 training , r 2 test , q 2 ext1 , q 2 ext2 , q 2 ext3 , con1, con2 and con3, which indicates that it can be ignored. The a (9) , a (8) , a (2) , a (6) , a (11) , a (4) , a (7) , a (5) , a (3) , a (10) and a (1) parameters show their contribution to activity in the given order.
For pIC 50 activity values (Table S6, available online), the best nine parameters were taken into account. According to E-statistic results, the importance of the variables can be given as follows: a (2) , a (6) , a (1) , a (3) , a (7) , a (8) , a (9) , a (4) and a (5) . The accuracy of the model was influenced by a (2) more than by the others. The orthogonal distance from the C11 atom to the N4 C12 o3 plane displays its effect by lowering all the statistical values, especially q 2 ext1 , q 2 ext2 and q 2 ext3 negatively. We cannot eliminate this parameter without loss of accuracy. The variables whose effects are most negligible are a (4) and a (5) . Their effects are equal to each other.
As a result, considering four types of activity it was seen that the Fukui atomic electrophilic reactivity index value (ev) of the C17 atom is the most important and essential parameter for GI 50 , TGI and LC 50 activities. For IC 50 activity, the orthogonal distance is the dominant parameter.

Conclusion
In this study a mathematical model was developed for pharmacophore identification and antitumour activity prediction of 87 C2-aryl PBD derivatives by the extensive 4D-QSAR EC-GA method. For both stages of the study, a conformational ensemble of the compounds presenting molecular flexibility was used related to Boltzmann distribution. The defined pharmacophore, which is mainly located in benzodiazepine and imidazole rings, consists of eight atoms, namely the o1, o2, N1, o3, N2, C9, C14 and C17 atoms. By dividing the original data set into training and test sets, the generated QSAR models with Loo-cross-validated r 2 and q 2 values varying between 0.56 and 0.80 showed high internal and external accuracy for four types of activity and proved their robustness. The models were also applied and tested on the compounds with unknown activity to guide the employment of new bioactive benzodiazepines.
The final models and their validation results for all GI 50 , TGI, LC 50 and IC 50 activities indicate that the geometrical and electrostatic descriptors used in this study are influential on the biological activity. The resulting EC-GA models and their internal and external validation for all of the dataset of pGI 50 , pTGI, pLC 50 and pIC 50 activity values showed that the goodness of fit between experimental and predicted activities was over 0.700. The prediction power represented by q 2 , q 2 ext1 and q 2 ext2 values for both training and test sets was greater than 0.6. only for pIC 50 activity values, the q 2 ext1 and q 2 ext2 values were lower than 0.6. Thus, the QSAR model of C2-aryl PBD derivatives created by the EC-GA method is a promising tool for the future design of novel benzodiazepine derivatives as antitumour agents.