Development of pharmacophore similarity-based quantitative activity hypothesis and its applicability domain: applied on a diverse data-set of HIV-1 integrase inhibitors

Quantitative pharmacophore hypothesis combines the 3D spatial arrangement of pharmacophore features with biological activities of the ligand data-set and predicts the activities of geometrically and/or pharmacophoric similar ligands. Most pharmacophore discovery programs face difficulties in conformational flexibility, molecular alignment, pharmacophore features sampling, and feature selection to score models if the data-set constitutes diverse ligands. Towards this focus, we describe a ligand-based computational procedure to introduce flexibility in aligning the small molecules and generating a pharmacophore hypothesis without geometrical constraints to define pharmacophore space, enriched with chemical features necessary to elucidate common pharmacophore hypotheses (CPHs). Maximal common substructure (MCS)-based alignment method was adopted to guide the alignment of carbon molecules, deciphered the MCS atom connectivity to cluster molecules in bins and subsequently, calculated the pharmacophore similarity matrix with the bin-specific reference molecules. After alignment, the carbon molecules were enriched with original atoms in their respective positions and conventional pharmacophore features were perceived. Distance-based pharmacophoric descriptors were enumerated by computing the interdistance between perceived features and MCS-aligned ‘centroid’ position. The descriptor set and biological activities were used to develop support vector machine models to predict the activities of the external test set. Finally, fitness score was estimated based on pharmacophore similarity with its bin-specific reference molecules to recognize the best and poor alignments and, also with each reference molecule to predict outliers of the quantitative hypothesis model. We applied this procedure to a diverse data-set of 40 HIV-1 integrase inhibitors and discussed its effectiveness with the reported CPH model.


Introduction
Three-dimensional pharmacophore models provide an intuitive way of representation for medicinal chemists to understand small molecule binding properties and to interpret chemical-functional molecular characteristics that are necessary to trigger a biological response Steindl, Laggner, & Langer, 2005). It has a tremendous application in drug design by providing valuable information to study structure-activity relationships (SARs) and uncovers the mechanism of ligand-target interactions by deducing the nature of functional groups and non-covalent bonding patterns (Van Drie, 2003;Wermuth & Langer, 1993). It is also applied to the discovery and development of novel molecules with a desired biological activity (Dror, Shulman-Peleg, Nussinov, & Wolfson, 2004). A pharmacophore can be derived on the basis of protein receptor structural information (structure-based approach) or purely on the structures of bioactive and potential ligands (ligandbased approach). Most popular commercial packages utilize ligand-based approach to propose pharmacophore models and also facilitate predictions of biological activities by developing quantitative hypothesis (Patel, Gillet, Bravi, & Leach, 2002). Ligand-based pharmacophore modeling develops hypothesis based on the 3D molecular alignment guided by chemical structure superpositioning, pharmacophore-based overlay, matching shape characteristics, field-based features, etc. (Clement & Mehl, 2000;Cramer, Patterson, & Bunce, 1988;Dixon et al., 2006;Guner, 2000;Haigh, Pickup, Grant, & Nicholls, 2005;Li, Sutter, & Hoffmann, 2000).
The official IUPAC definition for 'pharmacophore' term states that 'a pharmacophore is the ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response' (Wermuth, Ganellin, Lindberg, & Mitscher, 1998). Pharmacophore pattern or configuration can be defined as a set of relative locations of a pharmacophore feature in 3D space. A pharmacophore hypothesis is a set of common 3D pharmacophores amongst active compounds in the training set (Dror et al., 2004;Wermuth et al., 1998). This term is also known as common pharmacophore hypotheses (CPHs) models with a modification that each hypothesis is scored according to various geometric and heuristic criteria (Dixon et al., 2006). A finite set of pharmacophores is called pharmacophore space (Dror et al., 2004). Numerous pharmacophore modeling programs extended the meaning of the hypothesis by bridging the geometrical 'qualitative' hypothesis described by angles, distances, and planes with experimental biological activities to 'quantitative' pharmacophore models with predicted activities (Lemmen & Lengauer, 2000). To address flexibility criteria in diverse data-set, several programs calibrated the hypothesis development phase by incorporating various conditions, including the predefinition of mapping minimum and maximum feature points (Dixon et al., 2006), number of molecules (actives, moderately actives, and less than half of the inactives (Li et al., 2000)) used to develop hypothesis, selecting training molecules with 4 activity orders of magnitude (Li et al., 2000), selecting a set of reference molecules (Schneidman-Duhovny, Dror, Inbar, Nussinov, & Wolfson, 2008), features interdistance flexibilities (VLifeMDS, 2010), etc. The Phase program recommends featuresbased match for hypothesis generation in case of diverse and/or flexible molecules (Dixon et al., 2006).
Diverse molecules with high chemical complexities may face several problems in pharmacophore modeling for the following reasons, (i) the conformations of the molecules are substantially larger due to the high number of rotatable groups (Kirchmair, Laggner, Wolber, & Langer, 2005), (ii) a simple common substructure in the data-set complicates the alignment or overlaying process (Wolber, Dornhofer, & Langer, 2006), (iii) the greater counts of pharmacophore features in each molecule and sampling common features present in entire or partial (cover most and moderately actives) training molecules are exhaustive (Li et al., 2000) and typically constrained by geometric or feature mapping requirements governed by the number of features, tolerance, and distance predefined parameters (Dixon et al., 2006). Thus, mapping pharmacophore space will be very useful for diverse data-sets instead of CPH models to develop quantitative models.
With ongoing efforts on development of computational strategies on medicinal plants-based drug discovery, structural bioinformatics, and related methods (Kumar, Jasrai, Pandya, George, & Patel, 2013;Kumar, Jasrai, Pandya, & Rawal, 2013;Kumar, Pandya, Desai, & Jasrai, 2014), we present here a ligand-based computational procedure for quantitative pharmacophore hypothesis development using pharmacophore similarity matrix calculated on the basis of molecular topology-based alignment technique. The applicability domain of the developed quantitative models was then discussed. Finally, the effectiveness of sampling pharmacophore space was addressed by comparing the quantitative models developed on 40 HIV-1 integrase (HIV-1 IN) inhibitors with its published CPH model (Mustata, Brigob, & Briggsa, 2004). We selected HIV-1 IN inhibitors as HIV-1 IN enzyme is one among the attractive targets in antiviral drug design and expected to complement the therapeutic use of HIV protease and transcriptase inhibitors which have flexible binding affinities (Goldgur et al., 1999;Purohit et al., 2008;Purohit, Rajendran, & Sethumadhavan, 2011;Purohit & Sethumadhavan, 2009).

Materials and methods
Ligand data-set The ligand data-set consisted of 40 HIV-1 IN inhibitors which covered a wide activity range of .04 μM (compound 11)-1000 μM (compound 15) with structural diversity. A 3D pharmacophore model using Catalyst/ HypoGen program (Catalyst, 2000) based on this diverse data-set was published by Mustata et al. (2004). The biological activity data reported in half maximal inhibitory concentration (IC 50 ) spanning 5 orders of magnitude were converted into its logarithmic scale, pIC 50 and used as the dependent variable for quantitative pharmacophore modeling. Mustata et al. (2004) selected at least three compounds in each activity order, including the most active and inactive ones, as training set. The data-set was partitioned into training (26 molecules) and test (14 molecules) sets for hypothesis generation.

Maximal common substructure identification
The maximal common substructure (MCS) present in the molecules was identified using 'Multiple MCS' program (Tripod, NIH). The program utilizes fragment-based approach that implements fragments as seeds to the C-clique alignment algorithm, to reduce the search space and recognize multiple MCSs present in the molecules (Multiple MCS program, 2010). The entire data-set was specified as input in a single structure data format file and the multiple MCSs were returned.

Molecular topology-based alignment
The molecular data-set was geometrically optimized using AMBER03 force field (Duan et al., 2003) in YASARA Structure software (academic license) (Krieger, Darden, Nabuurs, Finkelstein, & Vriend, 2004). Molecular topology-based alignment was performed for ligand data-set (having 'carbon' molecules; described in Results and Discussion) using VLife 'Align molecules' module (VLifeMDS, 2010). We considered template-based alignment method with atom specifications corresponding to chosen MCS (present in all training molecules) which acted as a template in its first step, followed by alignment of template-connected chemical groups to obtain optimal alignment (VLifeMDS, 2010). The alignment can be examined graphically and statistically using root-meansquared deviation (RMSD) parameter, respectively.

Pharmacophore perception and its descriptors
The pharmacophore features present in the molecules were predicted using 'MolSign' module of VLife MDS v3.5 software (academic license) (VLifeMDS, 2010). MolSign module utilizes a distance geometry method to 'qualitatively' predict CPHs. The pharmacophore feature detection procedure was customized to predict the maximum number of features with no tolerance and maximum distance coverage. The process of pharmacophore feature definition, abstraction, and representation in VLife MDS are similar to the existing pharmacophore modeling softwares (VLifeMDS, 2010). The predicted features along with the atomic coordinates for each molecule were saved in XYZ coordinate (.xyz format) file.
A set of distance-based pharmacophoric descriptors were determined by calculating the distance between the feature positions (Cartesian coordinates) and the centroid of the chosen MCS position in the molecule. It should be noted that this descriptor calculation requires molecules in alignment mode. The predicted feature positions saved in XYZ files can be easily recognized by atom labels viz. HAc (H-bond acceptor), HDr (H-bond donor), AroC (aromatic group), AlaC (aliphatic group), PosC (positive ionizable center), and NegC (negative ionizable center). AroC and AlaC are the hydrophobic features.

Descriptor selection and machine learning
Since we used flexible parameters to detect pharmacophore features in molecules, the number of features overlaid in the perception phase summed up to 58. Inconsistent pharmacophoric descriptors were identified by inspecting descriptor columns enriched with value zero. The removal of inconsistent descriptors reduced the descriptor set to 9 (HAc -3, HDr -2, AroC -2, AlaC -2) and used as independent variables in machine learning. Support vector machines (SVMs) are a famous 'supervised' machine learning method for data classification, regression, and other learning tasks (Hsu, Chang, & Lin, 2003). We considered SVM for developing a statistical relationship between the biological activities and pharmacophoric descriptors of the small molecules. The original ligand data-set (40 molecules) was divided into 21 training, 5 validations, and 14 external test sets. It should be noted that the training set (26 molecules) used in the alignment and pharmacophore perception phases were only sub-categorized into 21 training and 5 validation sets for SVM model training and cross-validation purposes, while the test set used in all these phases was directly employed as an external test set to exhibit SVM model external predictivity. The use of validation set for cross-validation can prevent the model over-fitting problem (Hsu et al., 2003) and selected one compound from each activity in order (i.e. pIC 50~3 , 4, 5, 6, and 7) to ensure activity and structural diversities in the validation set.
LIBSVM (a library of SVMs) package was used to develop quantitative pharmacophore model using computed descriptor set as class attributes (x vectors) and pIC 50 values as class labels (target value). The aim of SVM is to generate a model (based on training data) to predict the activities of the test set using test data descriptors by mapping the training x vectors into a higher dimensional space and finding a linear separating hyperplane to achieve marginal and maximum separations between the selected classes (Chang & Lin, 2001;Hsu et al., 2003). Various kernels including linear, polynomial, radial basis function, sigmoid, and anova were selected to obtain an optimal SVM model. A grid search was carried out to monitor various SVM parameters viz. C, mean squared error (MSE), ε, γ, k d , and r 2 (coefficient of determination) across models. Based on the recommendations of LIBSVM (Hsu et al., 2003), optimal models were selected on the basis of SVM parameters, cross-validation techniques, and residuals in activities. Statistical analysis of data was performed using SPSS software (SPSS Inc., 2007).

Results and discussion
Pharmacophore similarity-based quantitative pharmacophore hypothesis We describe a ligand-based computational procedure for quantitative pharmacophore hypothesis development by proposing geometric flexibilities in pharmacophore model development. Figure 1 gives an outline of the major tasks involved in this procedure and its various stages is discussed in detail in the following sections. The major goal is to map the pharmacophore space necessary to study ligand pharmacophore requirements hidden in highly complex molecules rather than elucidation of CPH models which may differ substantially due to its alignment, pattern search, or heuristic methods (Wolber, Seidel, Bendix, & Langer, 2008). Quantitative pharmacophore hypothesis attempts to develop a statistical relationship of CPH models with biological activities which may cover only a common pharmacophore space shared by aligning chemical groups in the molecular data-set (Wolber et al., 2006). Flexible mapping of pharmacophore space is prominent than CPH models for four reasons, (i) molecules with large chemical diversity may bind at different binding sites of a receptor leading to an incorrect pharmacophore model (Dror et al., 2004), (ii) pharmacophore models are mostly dependent upon 3D alignment (overlay) that are carried out by geometric and/or feature matching algorithms (Wolber et al., 2008). Choosing a set of features or neglecting a single or set of feature(s) priori for CPH model generation is questionable in the case of diverse molecules, (iii) CPH models developed on large chemical space do not necessarily indicate complete-ligand binding requirements for a specific receptor (Dror et al., 2004), and (iv) pharmacophore space may encompass many CPH models and mapping its space will be more valuable than selecting top-scored CPH models as it enhances the dimensionality of ligandbinding modes (Cato, 2000;Patel et al., 2002) and contains various chemical groups that are not selected in statistically significant CPH models.

MCS in the data-set
We selected 40 HIV-1 IN structurally diverse ligands with its experimental biological activities (pIC 50 ) spanning 5 orders of magnitude. Most of the commercial packages utilize MCS-based alignment to superpose molecules in pharmacophore modeling (Dror et al., 2004;Wolber et al., 2008). It should be noted that the MCSbased and topological-guided alignments ( fuzzy search) are equivalent in performance to clique-detection methods in pharmacophore modeling (Bonachéra & Horvath, 2008;Gardiner, 2013). The MCS present in the ligand data-set was identified using Multiple MCS programs (Tripod, NIH) and implemented as a template in the template-based method (VLifeMDS, 2010) to obtain optimal topological alignment. We observed six MCSs in the training set ( Figure 2) with heterocycle as a common framework. Six-membered rings were the most common MCS shared by all training molecules and selected as the template. Fused heterocycles and heterocyles with connected carbon linkers were the fewer MCSs present in at least three or four molecules, respectively. We have chosen molecular topology-based alignment by creating homogenous molecules i.e. all non-carbon atoms in the molecules are substituted with the carbon atoms, then called 'Carbon' molecules, for the following two reasons: (i) Homogenous carbon molecules (resembling carbon skeletons) can be applied to achieve maximum topology-based alignment and (ii) the use of the selected MCS group as seeds for systematic exhaustive alignment method (VLifeMDS, 2010) to ensure optimal overlay. This alignment method also reduces the conformational space required by atom-to-atom matching technique (Supplementary Figure 1) (Dror et al., 2004;Wolber et al., 2006). Besides homogeneity of carbon topology, we introduced additional conditions to create carbon molecules to reduce chemical complexities.

Creating carbon molecules for molecular topology based alignment
Besides the replacement of non-carbon atoms by carbon atoms, the chemical complexities in the carbon molecules can be reduced by retaining the core structure or scaffold and deleting (i) fragments attached to scaffold except connected to another core group(s) or heterocycle (s) and (ii) highly branched carbon chains emanating from scaffold or even from an aliphatic chain(s). The carbon training molecules constituted seven MCSs ( Figure 2) with structurally variable structures in comparison to the original MCS set. Noticeably, these conditions substantially reduced the chemical space and enhanced the share of MCS in the molecules. For example, fused heterocycle (4-hydroxy-octahydro-2H-1-benzopyran-2-one) from the original MCS set was observed in only four training molecules, whereas its related fused heterocycle MCS (decahydronaphthalene) devoid of non-carbon atoms are present in nine training molecules thereby assuring optimal molecular topology alignment. Similarly, six-membered ring MCS was noticed in all 14 test molecules. The structures of training and test set carbon molecules are shown in Figures 3 and 4, respectively. The user may retain the non-carbon atoms (e.g. B, N, O, P, S, etc.) only if the MCS share such atomic configuration in the entire training set.
The alignment revealed that chemical complexities are large around the aligned six-membered MCS ( Figure 5) and span the pharmacophore space. Molecules with greater chemical groups were found to be overlaid with the MCS-connected groups, while terminal groups which lack alignment were externally placed. It can be well distinguished that regions densely populated with chemical groups may encompass pharmacophore space constituting various CPH configurations (Batten et al., 1999) that can be applied to develop pharmacophore similarity-based quantitative activity hypothesis. The overall overlay is acceptable to majority of compounds which secured RMSD below 2 Å with few exceptions (Supplementary Table 1).

Enumeration of atom connectivity of aligned MCS
The atom connectivity of aligned MCS was enumerated to cluster molecules in bins which will assist in calculating pharmacophore similarity matrix. The atom labels corresponding to the six-membered MCS ring were initially recognized and labeled conventionally as C1, C2, …, C6. The chemical groups connected to the MCS group can be clustered based on the atom connectivity details (Table 1) viz. bin 1, bin 2, …, bin 6. Noticeably, the connected groups can be a fragment (F; connected with one atom of MCS) and/or another fused heterocycle (H; connected with two atoms of MCS). The molecules were clustered based on atom pair connectivity data (Table 2) to increase bin chemical complexities, which in turn increase the counts of perceived pharmacophore features. A molecule with an aliphatic chain (fragment) connected at C3 position to MCS group can be clustered in C3-C4 atom pair bin. Similarly, a molecule with hetrocycle fused at C3 and C4 positions of MCS group can also be grouped under C3-C4 bin (Table 2). Care should be taken to ensure that each bin must contain at least one active molecule with variable moderately actives and inactives. The most active molecule (molecule with top pIC 50 , Bin 5,6 -Tr10, Bin 6,1 -Tr3, Bin 1,2 -Tr11, Bin 3,4 -Tr8 and Bin 4,5 -Tr14) in each bin was selected as 'reference' molecule to compute pharmacophore similarity matrix and other computations.

Enrichment of carbon molecules
The carbon molecules in aligned coordinate positions were enriched by substituting chemical atoms and groups present in the original data-set which represents the mirror image of aligned carbon molecules ( Figure 6) and used for further analysis. The chemical groups are abundant around MCS centroid group, whereas terminal groups are externally aligned due to its highly branched nature of molecules.

Pharmacophore feature perception on aligned data-set
The pharmacophore features present in the molecules were detected using VLife MolSign module (VLifeMDS, 2010) without affecting the alignment (Supplementary Table 2). This perception step graphically depicted the pharmacophore space of the MCS-based aligned data-set ( Figure 6). The shared MCS contributes AroC (aromatic) feature spanned by densely populated variable counts of the H-bond acceptor (HAc) and donor (HDr) features which are further extended by AroC (MCS connected heterocycle) and AlaC (MCS linked aliphatic chains), respectively. It can be studied that molecules with high chemical diversities increased the counts of encoded features. Te3 has 9 HAcs, 14 HDrs, 1 AroC, 23 AlaCs and 3 PosCs summed to 50 perceived features (Supplementary Table 2).

Development of pharmacophore similarity matrix
Various pharmacophore features present in the data-set were categorized based on its type viz. HAc, HDr, PosC, NegC, AroC, and AlaC, respectively, to develop pharmacophore similarity matrix. We have chosen pharmacophore feature-based Tanimoto coefficient as a similarity measure (Barnard, Downs, & Willett, 1998) and used reference molecule of each bin to calculate the similarity with the rest of the clustered molecule. The similarity measure was calculated for six different pharmacophore feature types as follows.
Pharmacophore feature based Tanimoto coefficient (2) where Ffeatures, F reffeatures of reference molecule, F molfeatures of non-reference molecule, F commonfeatures that are common to reference and non-reference molecules, and ifeature type. This similarity measure was estimated in two different ways with respect to the consideration of only bin reference molecule (Table 3) and across reference set (Table 4). The pharmacophore similarity with bin reference molecule signifies the feature similarities across its types. Pharmacophoric similar molecules can be identified by the total pharmacophore feature-based Tanimoto coefficient or total pharmacophore similarity (TPS) metric (high TPS = most similar molecules). For example, Tr12 of Bin 1,2 (TPS = 3.667, Table 3) is pharmacophorically similar to Tr11 (reference molecule of Bin 1,2)  and also shares an identical carbon skeleton (Figure 3). TPS computed across reference set (Table 4) represents the feature contribution rate of each clustered molecule with respect to the reference set. In this case, molecules with the higher TPS indicate strong pharmacophore similarity with the reference set and possess prominent features that are encoded by most actives. For example, Tr21 (Table 4) is very similar to Tr10, Tr3, and partially, Tr11, Tr8, and Tr14, as it shares similar hydroxyl and double-bonded oxygen groups. Further, the TPS calculated with bin reference and its set was used to calculate fitness score to assist pharmacophore model evaluations.
The hierarchical clustering of training and test molecules is discussed in Supplementary text and shown graphically in Supplementary Figure 2.

Consensus features and relative index for reference molecules
Fitness calculations which seek to recognize accurate and inaccurate molecular alignments from its fitness score describe the dimension of mapped pharmacophore space and identify molecules that are solely aligned on the basis of the MCS centroid group. Initially, the consensus features encoded by the reference set were recognized by counting the minimum and maximum features in each type (Table 5). For example, the reference molecules viz. Tr10, Tr3, Tr11, Tr8, and Tr14 have HAc feature counts of 6, 5, 3, 1, and 2, respectively, among which 2 can be selected as the consensus HAc feature due to its majority. Additionally, a feature type can be discarded from the consensus features set if the feature type was absent in the majority of the compounds with few exceptions. NegC feature is present only in Tr8 and Tr14 among the reference sets, and therefore, discarded from calculations. This discarding criterion should be withheld if the most active molecule among the data-set constitutes the respective feature type because the chemical group mapped by this feature will govern biological response (Dror et al., 2004). The most active molecules, Tr12 (pIC 50 = 6.432) and Tr13 (6.222), do not constitute the NegC feature type and are discarded in the consensus feature set. A relative measure called 'Relative index' based on the feature counts of consensus and reference molecule was computed to account the extent of availability of a specific feature type in a reference molecule. For example, Tr10 relative index for AroC is 1.667 as it constitutes 5 AroC features as opposed to 3 AroC consensus feature (relative index > 1 indicates the availability of additional features compared to consensus) (Table 5) which shows the possibility of mapping more than 3 AroC features in pharmacophore space. The reference molecules of bin are shown in bold face. Figure 6. The enrichment of aligned carbon (training and test) data-set by original molecules with its perceived pharmacophore features (shown as spheres) constituting pharmacophore space.
The relative index of reference set across each feature type was summed together to interpret the complexities in chemical groups and feature counts in reference molecules. This parameter was called 'Relative score (RS).' In our case, Tr10 (RS = 15.667) and Tr8 (2.5) constitute large and small feature complexities (Table 5), respectively. It is also used in the calculations of fitness score.

Fitness score
Similar to TPS metric, the fitness score was estimated by considering bin-specific reference molecule and reference set as separate entities. Fitness score based on bin-specific reference molecule considers molecular TPS calculated with respect to bin-reference molecule and its reference molecule RS, to identify the alignment of accurate and inaccurate molecular pairs (Supplementary Table 3). An overlay of the reference set with its predicted pharmacophore features is shown in Figure 7. The linear chain of seven aromatic moieties is the core component for alignment spread by HDr features from hydroxylation pattern.
Fitness score for clustered molecules ¼ Molecule TPS=Bin specific reference molecule RS The fitness score for clustered molecules was assessed to examine the extent of low and high pharmacophore features engaged in alignment and to identify molecular pairs having low and high TPS and RS, respectively (Figure 8). Tr25 secured low TPS and Tr12 constituted high TPS, whereas Tr8 has a low RS and Tr10 possesses high RS. Noticeably, a molecule with low TPS has fewer features and vice versa and molecules with low RS will possess only essential features (due to less chemical complexity) and vice versa. A trade-off between the perceived features and its chemical complexities can be studied using this fitness score. The overlay of low TPS and RS (Tr25:Tr8) revealed that the MCS group and its features were solely matched and aligned (Figure 8(A)). A better topological alignment was observed for low TPS and high RS pairs (Tr25:Tr10) (Figure 8(B)). Tr25 was aligned with one of the Tr10 terminal fused heterocycle groups. The analyses of molecular pairs with high TPS showed that the alignment with low RS reference molecule (Tr8) (Figure 8(C)) exhibited inaccurate overlay; whereas alignment with high RS molecule (Tr10) (Figure 8(D)) showed the best possible alignment among the data-set. It is evident that optimal molecular pair alignment can be obtained by optimizing the two parameters, TPS and RS above the average of the fitness score. This fitness score also facilitates in the identification of accurate and inaccurate alignments by witnessing its low and high scores, respectively.
Similarly, the fitness score calculated against the reference molecular set in an alignment independent manner (Supplementary Table 4) can be used to map features in pharmacophore space and provides list of molecules that do not embed in the pharmacophore similarity space defined by training molecules. A histogram depicting the distribution of fitness scores (Supplementary Figure 3) was used to examine the external test molecules which do not get placed in the pharmacophore space and achieved poor alignment with the reference molecule under consideration. The lower and upper bounds were defined by the minimum and maximum fitness scores of training molecules against each reference molecule. The minimum and maximum fitness scores computed against Tr10 reference molecule were Tr25 (fitness score = .025) and Tr9 (.168) (Supplementary Table 4) and defined as lower and upper bounds for external test set pursuing alignment with Tr10 as reference molecule. Noticeably, the molecular overlays of poorly aligned pairs, viz. Te7: Tr8, Te8:Tr8, Te10:Tr8, Te13:Tr8, Te7:Tr14, and Te8: Tr14, are graphically illustrated in Figure 9. The MCS was only shared in the alignment of bisdistamycin Te7 with Tr7 (Figure 9(A)) and Tr8 (Figure 9(B)) molecules. A similar mode of alignment was observed for the other bisdistamycin, Te8 with Tr8 (Figure 9(C)) and Tr14 (Figure 9(D)), respectively. Te10 (Figure 9(E)) and Te13 (Figure 9(F)) only shared MCS groups in common with the reference molecule Tr8. These fitness scores cannot be used to predict outliers of the quantitative hypothesis as the trained SVM model samples the pharmacophore space without 3D geometric placement of features and consider orientation independent distance-based pharmacophoric descriptors.

Applicability domain
It is very essential to select data-sets with structural diversities and activities span (Dror et al., 2004;Guner, 2000; Li et al., 2000) and the existence of similarity between training and test sets in pharmacophore modeling (Zhang, Golbraikh, Oloff, Kohn, & Tropsha, 2006). We selected applicability domain to determine this similarity threshold using a modified version of Euclidean distance-based domain (Zhang et al., 2006) (6)) where Z is an empirical cut-off factor with a default value of .5. In addition, an upper bound for pharmacophore similarity space was defined by selecting the most active molecule Tr22 (molecule with high TPS, 1.817). The external test molecular TPS occupied within this domain was considered as suitable molecules for modeling or otherwise treated as outliers of respective bins whose reference molecule may not participate in better alignment ( Figure 10). An external test molecule can have better alignment and share pharmacophore similarity with more than one bin-reference molecule due to the flexible chemical structures and therefore, plotted outside the domain threshold. Te1 (2-benzyl-9,10-dihydroanthracene), Te10 ((cyclopenta-2,4-dien-1ylmethyl)benzene), and Te14 (benzene) constitute simple chemical framework and share pharmacophore similarity and alignment with more than one reference molecule i.e. these molecules can be placed in other bins too. Te3 and Te4 have complex structures and possess excess features in addition to consensus features and therefore participate as members in other bins.
Quantitative pharmacophore hypothesis and activity predictions Distance-based pharmacophoric descriptors were determined by calculating the distance between the perceived feature positions and MCS-aligned centroid positions. Calculations on HIV-1 IN data-set yielded 58 descriptor  variables from which inconsistent descriptors were removed. Finally, a descriptor set of nine features (HAcs -3, HDrs -2, AroCs -2, snd AlaCs -2) were chosen as independent variables for quantitative SVM modeling ( Figure 11, Supplementary We divided the original data-set of 26 training molecules into 21 training and 5 validation sets, respectively. The external test set of 14 molecules was used to predict the biological activities from the trained quantitative SVM model. A grid search was used to identify two optimal anova kernel-based models (Table 6) which secured a significant r 2 and q 2 , respectively. The predictions of external test set from trained models yielded an acceptable range of activities reflected by its r 2 ext parameter (Table 7). Mustata for compound activities to represent the ratio between the upper range of biological activity for the compound and the actual activity (Guner, 2000). Applying this criterion in SVM models, compounds Te12 (residual = 2.894 as per SVM model 1) and Te14 (3.031) should have a partial match with the perceived features geometrically mapped in the pharmacophore space.
The examination of r 2 ext and activity residuals suggested that SVM model 1 was superior in its performance and predictions compared to the model 2. Distinguishably, the activity residual (error as per Catalyst/HypoGen program (Catalyst, 2000)) for training set in CPH model (Mustata et al., 2004) is capricious in comparison to the observation of significant r 2 and q 2 values proposed by our SVM models. Thus, Figure 9. The poor molecular aligned pairs predicted by fitness score of external test molecules computed against all reference molecules. This fitness score enables graphical understanding of insignificant alignments between training and external test sets. Molecules with less complexity are highlighted in yellow stick model. there is a dependency to focus both on the geometric fit of pharmacophore features, feature selections and its relative count to generate hypothesis. These limitations can be overcome by incorporating partial match criteria to define pharmacophore with consistent descriptors.

Benefits of quantifying pharmacophore space
The chemical complexities and the abundance of encoded pharmacophore features of the diverse data-set paves way for problems associated with conformational sampling, molecular alignment, common features elucidation, and modeling (Dror et al., 2004;Van Drie, 2003;Wermuth & Langer, 1993). The advantages of pharmacophore space and CPH models are described in Supplementary text with an illustration (Supplementary Figure 4). Assuming a low energy conformer as bioactive, a MCS-based alignment can be effectively performed by selecting the common MCS shared by the training set. The carbon-substituted method can also be used to reduce the complexity of the molecule by carefully selecting the MCS and its connected groups to increase the effectiveness of optimal topology alignment. In addition, distance-based pharmacophoric descriptors eases the distance flexibilities among the feature set as it considers the MCS-aligned position as the centroid to calculate distance with consistent features instead of individual feature interdistances. Fitness score and applicability domain provide an overview of geometric alignments and possible outliers. New feature can be effectively incorporated in the proposed pharmacophore space, while the addition of new feature requires developing new CPH. It also evades the possibility of prioritizing only top-scoring CPHs. Finally, a non-linear relationship between biological activities and pharmacophoric descriptor set using SVM modeling is statistically appealing than the CPH model to model predictions. Molecules that occupy the domain beyond the pharmacophore descriptor space are known as flexible molecules (labelled) owing to its ability to align with other reference molecule(s) and get clustered. Figure 11. The pharmacophore hypothesis of the selected HIV-1 IN inhibitor data-set containing consistent features in pharmacophore space. The interfeature distances (minimum and maximum distances expressed in Å units are shown) are calculated by considering MCS as centroid position with respect to each mapped feature.