Implementation of ensemble methods on QSAR Study of NS3 inhibitor activity as anti-dengue agent

ABSTRACT Dengue fever is a disease transmitted by infected mosquitoes. This disease spreads in several countries, especially those with a tropical climate. To date, there is no specific drug that can be used to treat dengue. Use of clinically investigated drugs, such as Balapiravir, is still not effective in inhibiting the activity of virus replication. The design of a drug candidate can be performed by using the non-structural protein 3 (NS3) as target. This study aimed to develop QSAR models to predict the inhibitory activity class of NS3 inhibitors. The classification was performed by using feature importance analysis for selecting the descriptors and three ensemble methods, i.e. random forest (RF), adaptive boosting (AdaBoost), and extremely randomized trees (ERT), for model design and prediction. Hyperparameter tuning was performed to improve the performance of the models. Based on the results, we found that model 9, developed from ERT produced the best performance with values of accuracy and AUC equal to 0.73 and 0.82, respectively. Use of y-scrambling method allowed us to confirm that the model was not related to the chance correlation.


Introduction
Dengue virus is transmitted by infected mosquitoes and can cause dengue fever. The infection can develop into dengue haemorrhagic fever or potentially deadly dengue shock syndrome [1][2][3][4][5]. Currently, dengue fever is spreading in several countries. Severe dengue has caused death mostly in Asia and South America countries [6]. In 2012, a study by Brady et al. [7] estimated that 3.97 billion people in 128 countries could be risked by the dengue virus.
To date, there is no specific drug that can be used to treat dengue. In general, dengue fever is only treated by giving paracetamol along with pain medical treatments. Therefore, it is necessary to find a new anti-dengue drug with a better healing impact. So far, the most successful approach for dengue treatment is the inhibition of enzymes in the virus [8]. One of the proteins that takes a role in the life cycle of the dengue virus is the nonstructural protein 3 (NS3) [8][9][10]. Several compounds that can be used as inhibitors have been successfully identified on NS3 protein together with C, NS5 and NS4B proteins [9].
The NS3 protein is part of the dengue virus structure. It is a multifunctional protein that can perform various enzymatic activities. NS3 complex with the NS3B cofactor performs the function of viral protease [11]. This complex is in charge of the cleavage at 8 of the 13 polyprotein cleavage sites that are responsible for the maturation of the viral particle [12]. The C-terminal ATPase/helicase domain in NS3 is also known to be a vital factor for RNA replication. Therefore, NS3 protein is a promising target for the development of drugs because of the important role of the protein in the replication of dengue virus [13].
The testing of the inhibitory activity of hit compounds in the drug design process requires a long time and is extremely costly. Hence, quantitative structure-activity relationship (QSAR) models can be used to overcome this problem. QSAR modelling is considered as a promising method to predict and classify the biological activity of new compounds [14]. QSARs have been used for predicting inhibitors against certain dengue virus targets and other targets as well [13][14][15][16][17].
In 2015, Hariono et al. conducted a QSAR study of type 2 dengue virus (DENV-2) NS2B-NS3 from the quinoline scaffold using genetic function approximation (GFA). According to the results, they obtained a valid model with r 2 and q 2 values of 0.861 and 0.677, respectively [18]. In 2017, Anusuya and Gromiha conducted a QSAR study on 33 flavonoid compounds as antiviral agents by using multiple linear regression (MLR). They found a novel model that predicts several compounds as strong inhibitors against dengue polymerase [19]. In 2018, Sarwar and co-workers conducted a SAR and QSAR study on more than 100 flavonoid compounds with NS2B-NS3 protein as a target. Based on these results, they managed to find ten flavonoid compounds that can be used as NS2B-NS3 protein inhibitors through docking analysis [20].
This study aims to build a QSAR model to predict the inhibitory activity class of NS3 inhibitors by using three ensemble methods [21], i.e. random forest (RF) [22], adaptive boosting (AdaBoost), and extremely randomized trees (ERT). The classification was performed in two steps, (i) feature selection by using feature importance analysis and (ii) prediction model development by using ensemble methods. The feature importance was used to determine the features with the most contribution to the model [19,20]. Meanwhile, the ensemble methods were utilized because of their ability to improve the performance compared to the single classifier model [23].

Data preparation
The SMILES strings and bioactivity (IC 50 ) of the NS3 inhibitors were taken from the chemical and bioactivity database, namely ChEMBL [24]. The bioactivity of all inhibitors was measured by using a similar bioassay protocol against a similar strain of dengue virus type 2. The inhibitory activity was measured by using Vero Cytotoxicity Assay which is a cell-based secondary assay for exploring the cytotoxicity of the compounds by inhibiting the Flavivirus Genomic Capping Enzyme.
The experimental target data is the half-maximal inhibitory concentration (IC 50 ) value of 845 compounds obtained from the CHEMBL database. In this study, IC 50 values were grouped into two classes, namely 1 and 0, according to the defined threshold value. Class 1 represents an active compound with an IC 50 value < 20 µM, while the class 2 represents an inactive compound with an IC 50 value > 70 µM. Hence, we removed the compounds with IC 50 values between 20 µM and 70 µM. The number of compounds classified as active and inactive were 145 and 127, respectively. Both numbers are quite similar so we did not have a problem regarding imbalanced data sets. Finally, the data set was randomly divided into a training set and a test set with a ratio of 4:1. The list of SMILES notations and class labels for both the training and test sets are provided in Table S1 (see Supplementary Information).
Then, we prepared the structure of each compound to calculate the molecular descriptors. The 3D structures of the compounds were obtained by converting the SMILES notations into SDF structure files using Open Babel [25]. Then, molecular descriptors were calculated by using Mordred [26] and PaDEL [27] programs to produce 1613 and 1875 descriptors, respectively.

Molecular descriptor selection
From the total of 2603 unique molecular descriptors, we reduced their number by using several methods to obtain the more appropriate one. Firstly, the selection of descriptors was carried out by using statistical analysis. The analysis began with the removal of molecular descriptors with variance equal to zero, and a standard deviation of less than 0.95. Then, molecular descriptors that have the same value less than 50% were retained. After that, Pearson correlation analysis was performed to remove molecular descriptors that have similar information to other molecular descriptors. Molecular descriptors with weak correlation values (Pearson correlation values < 0.1) to targets and strong correlations (Pearson correlation values > 0.9) to other molecular descriptors were removed.
Selected molecular descriptors were further analysed by using feature importance to obtain the rank of molecular descriptors. The feature importance was calculated based on mean decrease impurity with the Gini index as the splitting criterion [28], as shown in Eq. 1.
where c and p j represent the number of classification classes and the ratio of samples for class j, respectively. The selected molecular descriptor was sorted according to the value of feature importance calculated by each method. We selected the best seven molecular descriptors that were sorted from the highest value of feature importance. The list of selected molecular descriptors is provided in Table S2.

Ensemble methods
Three ensemble methods [21], i.e. random forest (RF) [22], adaptive boosting (AdaBoost), and extremely randomized tree (ERT), were used to predict the NS3 inhibitory classes. RF is an ensemble-based method that is built from a combination of several decision trees. This algorithm was published by Breiman in 2001 as a robust algorithm for data noise with a high classification speed [29]. RF uses the concept of bootstrap and it is suited for classification of large amounts of data. In carrying out the classification process, RF samples data and randomly constructs decision trees to avoid the possibility of overfitting in training data [30]. The bagging method is used in this algorithm to take the winner from the formed tree based on the most vote results.
AdaBoost is an ensemble classifiers-based algorithm that combines several weak classifiers to produce a robust classifier. AdaBoost works by adaptively adjust the weights for each cycle of group weak classifiers. AdaBoost can give better results because of the diversity among weak classifiers based on performance in each classifier [31]. Meanwhile, ERT is an ensemble decision tree-based algorithm that does not follow the classic topdown procedures. Unlike the other ensemble methods, this algorithm splits the node by choosing the point of cleavage completely random and does not use the concept of bootstrapping to grow trees [32].

Hyperparameter tuning
Hyperparameter tuning was performed to improve the performance of the models. The parameter scanning in the tuning process was performed by using grid search crossvalidation (grid search CV). A list of parameters and the ranges of values for RF, AdaBoost, and ERT methods are presented in Table 1. RF and ERT have the same optimized parameter types, namely n_estimators, min_samples_leaf, and min_samples_split. The difference between the two methods relies in the values and ranges of the n_estimators parameter. As regards AdaBoost, the optimized parameter types include n_estimators and learning_rate. With AdaBoost, SVC is used as the base_estimator to improve the accuracy with the SAMME.R algorithm [33].

Prediction models
The prediction models were developed by using the optimized parameters obtained from the hyperparameter tuning process. Nine models are presented that were developed by using the three ensemble methods. Models 1, 2, 3 that contain five, six, and seven molecular descriptors, respectively, were developed by using RF. Models 4, 5, and 6 that include five, six, and seven molecular descriptors, respectively, were developed by using AdaBoost. Models 7, 8, and 9 are models that contain five, six, and seven molecular descriptors, respectively, were developed by using ERT.

Model validation
The models were validated using several metrics derived from a confusion matrix. In this study, true positive (TP) represents the number of active compounds that are correctly predicted, while true negative (TN) represents the number of inactive compounds that are correctly predicted. From the confusion matrix, several validation parameters were calculated, such as accuracy (Q), precision (PR), sensitivity (SE), specificity (SP), F-score, Receiver Operating Characteristic (ROC), Area Under Curve (AUC) and Matthews Correlation Coefficient (MCC). The ROC curve is constructed by scanning the value of TPR and FPR along with the increasing of the sample proportion. Then, the AUC value is reflected by the area under the ROC curve. MCC value is used to measure the quality of the overall binary classification [34]. The formula of those parameters is presented in Eq. (2) to (7).
MCC ¼ TPxTN ð Þ À FPxFN ð Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Furthermore, y-scrambling analysis was also carried out on models 7 and 9 to check the probability of chance correlation on both models. The y-scrambling procedure is made by shuffling the target value while keeping the value of molecular descriptors. Then, the shuffled data set is splitting by using a similar splitting procedure to unshuffled data. Both models are re-fit by using a shuffled training set and used to predict the shuffled test set. The process was repeated ten times to check the consistency.

Molecular descriptor selection
By using the basic statistical analysis (variance and standard deviation), the number of molecular descriptors was reduced from 2603 to 950. Then, the number of descriptors was reduced to 10 after the implementation of the threshold value of correlation coefficient (i) between descriptor and target, and (ii) amongst descriptors. Afterwards, the selected descriptors were further investigated by using feature importance analysis.
The rank of the best seven feature importance calculated by RF, AdaBoost, and ERT are presented in Figures 1-3, respectively. From these figures, we found that the feature importance of SM1_Dzs is the highest for all methods. This indicates the best contribution of SM1_Dzs in all methods. From a chemical viewpoint, SM1_Dzs descriptor is derived from the Barysz distance matrix that considers the presence of heteroatoms and multiple bonds in a molecule. This descriptor represents the structure of the compound and reflected the possible interaction between the compound and NS3 target. Hence, this might be the reason for the presence of the descriptor in all descriptor combinations.  Two other descriptors also appear in all descriptor combinations, i.e. ATSC2m and C2SP2.1. ATSC2m descriptor is based on the centred Broto-Moreau autocorrelation that represents the autocorrelation of topological structure. Meanwhile, C2SP2 descriptor represents the type of carbon atom with sp2 hybridization attached to two other carbon atoms. Both descriptors encode the structure of the compound and are highly correlated to the inhibitory activity. The description of all selected molecular descriptors is presented in Table 2. By using the best seven descriptors, we defined three models for each method consisting of five, six, and seven molecular descriptors, respectively. A list of molecular descriptors involved in each model is presented in Table 3.
The distribution of normalized values of the selected molecular descriptors is presented in Figures 4-6, respectively. In general, we found that almost all selected molecular descriptors in every model have a large enough range of values. This can be seen from the distribution of almost all selected descriptors that have long whiskers, which indicate varied values of the descriptors. However, there are several descriptors that do not have whiskers.
The distribution of selected descriptors seems not symmetrical according to the distribution of their values. Besides, almost all selected descriptors have outliers, which  Averaged moreau-broto autocorrelation of lag 1 weighted by valence electrons AATS2dv Averaged moreau-broto autocorrelation of lag 2 weighted by valence electrons AATSC0v Averaged and centred moreau-broto autocorrelation of lag 0 weighted by vdw volume ATSC2m Centred moreau-broto autocorrelation of lag 2 weighted by mass C1SP2. means that the descriptors have far values from the data set. For example, C1SP2.1 fulfils several criteria as symmetrical data by having a median in the centre of the box, complete whiskers and no outlier. However, the descriptors cannot be said to be symmetrical because the length of the whiskers is not similar. Matrices that represent the correlation amongst molecular descriptors and target are presented in Figures 7-9, respectively. As for RF correlation matrix, we found that AATS2dv shows the best correlation on the target with the highest absolute value of the correlation of 0.23. The pair of SMR_VSA5 and SM1_Dzs are found to be the best correlation amongst descriptors with the smallest absolute value of the correlation of 0.041. The pair of AATS1dv and AATS2dv are found to have a strong correlation with the absolute value of the correlation of 0.9, which means that they have similar information.
As for AdaBoost correlation matrix, we found that C1SP2.1 has the best correlation to the target with the highest absolute value of the correlation of 0.25. The pair of AATSC0v  and C2SP2.1 show best correlation amongst descriptors with the smallest absolute value of the correlation of 0.02. Similar to RF correlation matrix, we also found a strong correlation between AATS1dv and AATS2dv in Adaboost correlation matrix. As for ERT correlation matrix, C1SP2.1 shows the best correlation to the target as also found in AdaBoost correlation matrix. The pair of SRW07 with SM1_Dzs has the best correlation amongst descriptors with the smallest absolute value of the correlation 0.0017.

Hyperparameter tuning
Hyperparameter tuning was carried out to obtain the best parameter for the nine prediction models. The result of the tuning processes is presented in Table 4. Also, we provided the list of all parameters used in models 1-9 in Table S3. In the case of the RF model, we found that all models produced the best tuning result with the number of min_sample-s_leaf of 1. However, the parameter of min_samples_split and n_estimators of all RF models is different. In the case of AdaBoost, all models produced the best tuning result with similar learning_rate value, which is 0.01. However, the value of the n_estimators parameter of all models is different. In the case of ERT, we found that all of the tuning parameters for all model are different.  To ensure that the tuning process can improve the accuracy, we present a comparison of the accuracy between non-tuned and tuned models, as shown in Figure 10. The results show that the accuracy of tuned models is slightly higher than in non-tuned models. However, we found that the accuracy increasing is not too significant. The biggest increase of accuracy found in model 2 with the accuracy increase by 6%. Also, we found that the tuning process does not give any impact on the accuracy of models 1 and 5.

Model validation
The results of the model validation for the training set and test set are presented in Table 5. As for the training set, we found that the best accuracy is obtained from models 2 and 3, while the worst accuracy is obtained from model 4. This points out that models 2 and 3 can accurately predict the targets in the train set compared to other models.
As for the test set, we found that the best accuracy obtained from models 7 and 9, while the worst accuracy obtained from model 2. This is contradictory to the validation   Figure 10. Comparison of the accuracy of non-tuned and tuned models. result of the training set. The fact that the training accuracy of model 2 is far better than the test accuracy indicates the overfitting situation, even the tuning parameter has been performed. This problem might be caused by a large number of descriptors that lead to too complex model.
To determine the best model, we consider the value of accuracy as the main criterion. We found that models 7 and 9 present a similar value of accuracy. Interestingly, both models also produced similar values for other parameters, except AUC. Here, AUC was calculated by using the ROC curve shown in Figure 11. We found that the AUC value of model 9 (0.82) is slightly better than that of model 7 (0.81). This confirms that the prediction ability of model 9 is slightly better than model 7. Regarding y-scrambling analysis, we found that all MCC values of both models obtained from ten shuffling trials are always lower than 0.1, while the MCC value for non-randomized data was 0.49 for both models, as shown in Figure 12. This proved that the results obtained from models 7 and 9 are not related to chance correlation.
The fact that models 9 and 7 give the best performance indicates that the ERT method outperforms other methods. Regarding the basic principle, the randomness level of ERT is higher than RF and AdaBoost. Generally, the randomization process leads to the increase of bias and variance of individual trees. However, the part of the variance due to the randomization can be vanished by taking the average over an adequate number of large ensemble trees. Hence, the number of ensemble trees is important to obtain the optimum variance number. The results indicate that the number of ensemble trees of ERT is sufficient enough to cancel out the variance. Also, once the randomness level of ERT is correctly adjusting, the variance can be vanished while the increase of the bias is negligible [32].

Future prospects
According to the validation results, we found that models 7 and 9 are valid and acceptable. Hence, those models can be used to predict new NS3 inhibitors. Since the data set comprised of large space of the chemical structure, those models can be used to filter a bulk of inhibitor candidates with a specific function to inhibit NS3 protein. The compound classified by the models as active can be used as a candidate for a new anti-dengue drug. However, concerning the applicability, the model has a limitation regarding the compound that has inhibitory activity against a different target of the dengue virus. This implied that our model is only applicable to the compound that is known to have the potential to inhibit NS3 protein.
To overcome the limitation, we consider developing a model by utilizing more dengue targets besides NS3 protein, especially other non-structural proteins such as NS1 and NS5. The utilization of other non-structural proteins will also increase the chemical space of the inhibitor so that the applicability of the model against a new compound will increase as well. Regarding the method, we also consider using more advanced methods, such as support vector machine or deep learning, to obtain more satisfying models.

Conclusion
Based on the results, QSAR models to predict the inhibitory activity class of the NS3 protein was successfully developed by using ensemble methods. Feature selection by using feature importance successfully selected several appropriate molecular descriptors. The improvement of the model performance by using a hyperparameter tuning procedure is not too significant. According to the validation analysis, model 9, that was developed by extremely randomized trees (ERT) and that contains seven molecular descriptors, was found to be the best one. This model produces the best results with values of accuracy and AUC equal to 0.73 and 0.82, respectively. The y-scrambling analysis also proves that the results obtained from model are not related to chance correlation.

Disclosure statement
No potential conflict of interest was reported by the authors.