Modelling enzyme inhibition toxicity of ionic liquid from molecular structure via convolutional neural network model

ABSTRACT Deep learning (DL) methods further promote the development of quantitative structure–activity/property relationship (QSAR/QSPR) models by dealing with complex relationships between data. An acetylcholinesterase inhibitory toxicity model of ionic liquids (ILs) was established using a convolution neural network (CNN) combined with support vector machine (SVM), random forest (RF) and multilayer perceptron (MLP). A CNN model was proposed for feature self-learning and extraction of ILs. By comparing with the model results through feature engineering (FE), the model regression results based on the CNN model for feature extraction have been substantially improved. The results showed that all six models (FE-SVM, FE-RF, FE-MLP, CNN-SVM, CNN-RF, and CNN-MLP) had good prediction accuracy, but the results based on the CNN model were better. The hyperparameters of six models were optimized by grid search and the 10-fold cross validation. Compared with the existing models in the literature, the model performance has been further improved. The model could be used as an intelligent tool to guide the design or screening of low-toxicity ILs.


Introduction
With the aggravation of pollution and the deterioration of the environment, the design of environmental protection solvents is more important to the industrial production process.At the present time, researchers are looking for sustainable green products to be used in chemical processes.Ionic liquids (ILs) are generally considered as safe and environmentally friendly solvents because of their low vapour pressure, thermal stability and high decomposition temperature [1].ILs have become a popular choice for material synthesis, separation processes [2,3], and the biomedical industry [4,5] in recent years.Due to the designability of cations and anions, ILs have vast chemical spaces.However, it is inevitable that ILs will pose potential threats to organisms and the environment.ILs produce a variety of toxicological effects when released into the environment [6].Meanwhile, with the development of the application of ILs, the number of ILs with different structures increases exponentially.With the increase of ILs quantity, it is necessary to discuss and study the toxicity of ILs.In the future, it is important to use green or less toxic ILs in chemical engineering to promote clean, green and less toxic chemical processes.
However, experimental research on the ILs cannot meet the production needs due to the limitation of materials.It is necessary to evaluate the biotoxicity of ILs through modelling [7].At present, research on predicting solvent toxicity focuses on organic solvents and drugs.Brito-Sanchez et al. [8] established a quantitative structure-activity relationship (QSAR) model for the toxic modes of 221 phenolic compounds on Tetrahymena ciliae by using four methods: K-nearest neighbour, support vector machine (SVM), classification tree and artificial neural network.Mamy et al. [9] used QSAR modelling to predict the fate of organic compounds in the environment based on molecular properties.It was also found that the combination of different classes of descriptors (structure, topology, quantum chemistry) improved QSAR model performance.Tian et al. [10] summarized various methods for drug similarity assessment to screen compounds with undesirable properties, including simple rules/filters based on molecular properties/structures and quantitative prediction models based on complex machine learning (ML) methods.By using a model of drug similarity, some potential clues can be eliminated.Moreover, the advantages and disadvantages of these methods are briefly summarized, and the latest progress in this field is comprehensively reviewed.Finally, the similarity analysis between natural products and traditional Chinese medicine was discussed.The above studies proved that it is feasible to predict biotoxicity through descriptors, and they give us some enlightenment in choosing to build and improve the performance of QSAR model.
Many researchers have tried various methods to study the toxic mechanism of ILs based on these data [11].Kumar et al. [12] carried out a detailed quantitative structural toxicity relationship analysis for 229 ILs.Based on the Monte Carlo optimization method built into Coral software, the close relationship between the toxicity and the structure of ILs was determined.Through the conclusion of this study, we understand the relationship between molecular structure and toxicity, which is a good inspiration for analysing the relationship between them.Grzonkowska et al. [13] developed a statistical model for predicting the toxicity of ILs to Vibrio fischeri (a bioluminescent marine bacterium).The model was based on multiple linear regression with three predictors: the Gutman molecular topological index, the lopping centric information index and the number of oxygen atoms.The model had high predictive power and robustness.The polar functional groups and anion structures of cations had little effect on the toxicity of ILs.The results provided an insight into the toxicity mechanism of ILs.This study provides useful information for the assessment of potential ecological risks of ILs.At the same time, the study provides ideas and directions for the analysis of different toxic mechanisms.
With the improvement of computing power, ML has been widely used in the rapid prediction of ILs toxicity due to its powerful fitting ability and high accuracy [14].Zhang et al. [15] used interpretable deep neural networks to establish quantitative structureproperty relationship (QSPR) models and realized accurate prediction of octanol-water partition coefficients.This has made great contributions to the prediction of molecular structural isomer properties and the interpretability of deep learning (DL).The substructures of lipophilic molecules were identified by Monte Carlo tree search method.Cho et al. [16] developed a comprehensive toxicity prediction model with a unified linear free energy relationship descriptor.Based on the calculated descriptors, the toxicity of ILs to Spodoptera exigua, earthworm, Caenorhabditis elegans and zebrafish is accurately predicted by linear regression model SigmaPlot.Wang et al. [17] developed an ML model based on SVM and feedforward neural network to predict the cytotoxicity of leukaemia rats from the structure of ILs.The results of five-fold cross validation showed that both models have high prediction accuracy and the established model has high generalization ability.In the above studies, the strong nonlinear relationship between the toxicity and the structure of ILs and the superiority of ML to deal with complex fitting cases were proved.It has made great contributions to the prediction of molecular structural isomer properties and the interpretability of DL.
As an important branch of ML, DL has a strong ability of feature learning, and the learned features have more essential representation of data [18].Deng et al. [19] used convolution neural network (CNN) and other DL models to accurately predict the critical properties of ILs based on their molecular structures.The accuracy of the model was significantly higher than that of thermodynamic models (PR, SRK) and ML models (XGBoost).The conclusion of this study plays an enlightening role in the prediction of properties by using the combined model.Wang et al. [20] proposed a QSPR model that combined tree long short memory network with signature descriptor for automatic feature selection, and then used back propagation network to predict the octanol-water partition coefficients of organic compounds accurately.The model showed excellent prediction accuracy and discriminant power for different types of isomers.Chen et al. [21] developed a neural network model to predict the infinite dilution activity coefficient.Each ILs and solute was mapped by embedded neural network entities, and the nonlinear interactions between ILs and solute were processed by neural cooperative filtering.On this basis, a comprehensive database covering 215 ILs and 112 solutes was established.The obtained model could accurately obtain the activity coefficient of ionic liquid-solute systems and enable the wide range expansion of UNIFAC model.Compared with the established ML model, the DL model can accurately extract the structural characteristics of ILs, avoiding the uncertainty of artificial feature selection [22].Through the study of characterization of IL characteristics by different methods, the relationship between the prediction accuracy and the integrity of characterization of ILs characteristics is understood, and efforts are made in the characterization structure.
Acetylcholinesterase inhibition test is an important index for toxicity evaluation of ILs.This method is commonly used to study the effects of pollutants on biological nervous system.Huang et al. [23] discovered a fresh approach for the synthesis of novel pyrrolidine-based ILs, and bioactivity testing revealed that two of them have some acetylcholinesterase inhibitory activity.Although the experimental testing approach is highly persuasive and trustworthy, its application to large amounts of ILs can be prohibitively expensive.
In this study, a reliable DL model based on CNN and ML was established.In order to use the CNN model to apply the learned feature information to regression prediction, it was coupled with three typical ML models, multilayer perceptron (MLP), random forest (RF), SVM and the prediction of ILs toxicity was well realized.The established model can accurately predict the inhibitory toxicity of IL acetylcholinesterase.The toxicity data of 266 kinds of ILs were collected, and 208 kinds of molecular descriptors were calculated to characterize the molecular structure of ILs.All toxicity data were obtained from the published literature, and the molecular descriptors were calculated using the RDKit library, which is an open-source chemical package.CNN based on computational descriptors was proposed to extract and learn the structural features of ILs.Finally, based on the features extracted by CNN, the inhibition toxicity of ILs acetylcholinesterase was accurately predicted by using RF, SVM and MLP.Compared to the model using feature engineering (FE), the results were a good improvement.The robustness of the model was improved by cross-validation and compared with the existing models in the literature to demonstrate the efforts of the proposed model in improving the performance of the QSAR model.

Datasets
The data of acetylcholinesterase inhibition toxicity were collected from the literature [24][25][26][27].These toxicity data had the same toxicity endpoint.The toxicity endpoint of data was the concentration of a chemical inducing a specified response to 50% of the tested population after a specified exposure duration (EC 50 ).In this study, log values of the above concentrations were used to evaluate experimental toxicity values, where the concentration unit was μM (log EC50[μM]).The collected data set included 266 different ILs composed of 115 cations and 41 anions.Simplified Molecular Input Line Entry Specification (SMILES) of ILs was obtained from PubChem as molecular representation.The SMILES format is a form of encoding strings that describe molecular structures, using different symbols to represent different molecular structures.For example, a hydrogen cyanide molecule with triple bonds can be represented as C#N.The detailed data and corresponding SMILES string for all involved ILs are listed in Table S1.
To transform the IL structures into numerical information that the ML model could recognize.The RDKit package in Python was used to convert the SMILES format of ILs into molecular descriptors to characterize the structures [28].The SMILES of ILs consists of anionic SMILES and cationic SMILES connected by one'.' symbol.RDKit package directly recognized and converted the overall SMILES of ILs.Finally, the SMILES of ILs were transformed into 208 descriptors.Details about the descriptors are listed in Table S3 in the Supporting Information.Based on the generated descriptors, a data set was established to characterize the structure of ILs.In order to solve the problems of inaccurate results caused by a large order of magnitude difference and unsatisfactory results caused by more data than decentralized training, all features of the data set were standardized.The data set was normalized to standard normal distribution using the Min-Max values.The feature normalization was done with the help of the scikit-learn library.

Model training and evaluation index
First, the dataset was randomly divided by the train_test_split function in the scikit-learn package in python.Eighty percent of the data was used as the training set for model training and hyperparameter tuning, and 20% was used as the test set to test the model.Based on the calculated ILs descriptors, CNN was used to automate feature extraction and pre-learning of these descriptors to obtain the key feature information of ILs.After extracting features from CNN model and processing data with FE, three commonly used ML algorithms (MLP, SVM, and RF) were used to establish an accurate prediction model of acetylcholinesterase inhibition toxicity.Six models were built.Ten-fold cross-validation was employed in this study to boost the stability of the model results.The training set was divided equally into 10 parts, nine parts were used to train the model and one part was used to test the model.The above process was repeated 10 times, and the test results were averaged from the above results.Besides, the coefficient of determination (r 2 ), mean square error (MSE) and mean absolute error (MAE) were used to evaluate the model.Among these, the closer the value of r 2 is to 1, the smaller the values of MSE and MAE are, indicating that the model has more effective predictive power.The detailed calculation method was shown in Equations 1-3: where the predicted and experimental values of acetylcholinesterase inhibition toxicity of ILs are denoted as y cal and y exp , the y cal indicates mean value of acetylcholinesterase inhibition toxicity of ILs, and n represents the number of ILs.

Feature engineering (FE)
FE refers to the filtering of raw data through a series of filtering criteria to get better vector features.It is an indispensable part of ML and occupies a very important position in the field of ML.The prediction ability of the model depends on the quality of input data for supervised learning.So, it is necessary to preprocess the descriptors.The FE used in this work was based on a feature selector (https://github.com/WillKoehrsen/featureselector#feature-selector-simple-feature-selection-in-python).Based on the feature selector, the effective features with linear correlation less than 0.90, missing rate less than 0.6 and non-unique values are selected.In the end, FE filtered out 112 features and retained 96 feature descriptors.Three models (FE-MLP, FE-RF, and FE-SVM) were established using these retained feature descriptors.

Convolutional neural network (CNN)
CNN is a deep neural network with convolution structure [29,30], which includes three key operations: local receptive field, weight sharing, and pooling layer.The nature of convolution kernel and pool calculation can ensure that the extracted features can prevent overfitting more effectively.Translation invariance eliminates the sample reconstruction process.More accurate feature extraction can more accurately characterize the structural information of ILs, thus improving the accuracy of its regression model.CNN can control the fitting ability of the whole model by convolution, merging and the size of the final output feature vector.In the case of over fitting or under fitting, CNN can adjust the dimensions of convolution layer, merging layer and feature vector to obtain a more appropriate feature vector.Therefore, compared with other feature extraction methods, CNN has higher flexibility and accuracy.CNN model is implemented with Keras toolkit (https://github.com/fchollet/keras) in Python.The structure of CNN model is shown in Figure 1.
To improve the feature extraction ability of CNN and obtain the most accurate characterization of IL structure, it is necessary to optimize the key parameters of CNN [31], such as the optimizer, activation function, learning rate, node number of feature extraction layer, and so on.The optimization method is made by grid search [32].Finally, the optimizer selects Adam, a computationally simple and efficient optimizer with low memory requirements [33].At the same time, the parameter updating is not affected by gradient transformation and is suitable for the unstable optimization objective function.There are 32 nodes in the extraction layer, and the activation function is the hyperbolic tangent.The detailed calculation is shown in Equation ( 4).Compared with other activation functions, hyperbolic tangent can reduce the problem of gradient disappearance and has better fault tolerance.

Hyperparameter optimization of the models
Based on the data set which can better represent the structure information of ILs obtained by preprocessing, the pre-learning and pre-extraction of features were realized through the convolutional kernel in the CNN.The IL structure information in the training set was pre-processed by the CNN and transformed into features in 32-dimensional vectors.The trained CNN model was used to generate eigenvectors for the ILs in the test set.The obtained eigenvectors were used for subsequent regression model fitting and evaluation.
RF algorithm [34], N_estimators (number of iterations) and Max_depth (maximum depth) is generally adjusted to control the bagging error and determine the optimal model [35].Different hyperparameter values will make the model to have different prediction accuracies.By optimizing the parameter values for the N_estimators and Max_depth through the method of grid search, the model obtains the best prediction capability [36].All other optimization parameters remain the default parameters in the scikit-learn library.The MSE of RF models with different hyperparameter combinations in the cross-validation is shown in Figure 2. As shown in Figure 2, the MSE of the model decreases significantly as the N_estimators increase.The effect of Max_depth on the MSE is not as intuitive.Figure 2 SVM is a commonly used ML algorithm for data analysis and nonlinear modelling [37].After selecting the kernel function, the most important regularization parameter C and influence range parameter Gamma (γ) of SVM are optimized by grid search [38].After the radial basis function was selected as the kernel of SVM model [39].The data distribution in the new feature space is implicitly affected by the function's parameter γ.The more the γ, the fewer the support vector, and the less the γ, the more the support vector.The number of support vectors influences the speed of training and prediction.As shown in Figure 3, with the change of C and γ of the model, the error of the model was greatly affected.The hyperparameter optimization results of the FE-SVM model and the CNN-SVM model are shown in Figure 3(a,b), respectively.The optimal result for the parameters C and γ of the FE-SVM model was 45 and 0.15.The optimal result for the parameters C and γ of the CNN-MLP model was 5 and 0.05.
MLP is the simplest artificial neural network, which is basically composed of input layer, hidden layer and output layer [40,41].Where, each layer is connected through nodes, the output of the upper layer as the input of the next layer, layer by layer until the output layer output results.The nonlinear relationship between nodes can be transformed by activation function and other parameters, which leads to more complex information.Therefore, deep neural networks model can better deal with the nonlinear relationship between data compared with linear model.The most commonly used activation functions are relu, sigmond, and tanh, etc.The training process of the model is firstly through forward propagation from the input layer to the output layer, and after the error results of the predicted value and the experimental value are obtained, the error is backpropagated to update the node weight of each layer.After repeated forward and back propagation of model training, when the error reaches the minimum value and no longer decreases, it represents the completion of the final model framework training.The MLP model used in this work is a typical feedforward neural network, which can segment complex space and fit strong nonlinear data by combining multiple perceptron [42].For MLP, the stability of the model can be controlled by adjusting the number of hidden layers and their nodes, and the phenomenon of over-fitting or under-fitting can be reduced [43].In addition, the activation function is used to transform the hidden layers of the MLP to obtain the input of the next connection layer.The structure of the MLP model is shown in Figure 4.
After grid search, the activation function of the MLP model was finally selected [44], and the hidden layer was two layers.The optimization process of the model is shown in Figure 5.The hidden layer of neurons was selected for parameter optimization.Figure 5(a) shows the hyperparameter optimization results of the FE-MLP model, and Figure 5(b) shows the hyperparameter optimization results of the CNN-MLP model.The optimal result for the hidden layer size of the FE-MLP model was (85, 45).However, the optimal result of the hidden layer size of the CNN-MLP model was (90, 75).

Performance of models
The information of IL descriptors were input into the RF model, SVM model and MLP model for model training in vector format.After the CNN-ML models were established, the test set was used to evaluate the model.The distribution of predicted values of FE-MLP, FE-RF, FE-SVM, CNN-MLP, CNN-RF, and CNN-SVM models for all the data points are shown in Figure 6.As seen, the six models obtained good results on the data set and accurately predicted most of the data points.After 10-fold cross-validation, each model showed that it had very stable predictive power.In the training set, r 2 of FE-MLP, FE-RF, FE-SVM, CNN-MLP, CNN-RF, and CNN-SVM models were 0.904, 0.955 0.965, 0.968, 0.987, and 0.968, respectively.In the test set, r 2 of FE-MLP, FE-RF, FE-SVM, CNN-MLP, CNN-RF, and CNN-SVM models were 0.815, 0.897, 0.754, 0.945, 0.933, and 0.947, respectively.The feature extraction of the CNN model enhanced the model   For the purpose of evaluating the model more comprehensively, the prediction error (PE) of the FE-MLP, FE-RF, FE-SVM, CNN-MLP, CNN-RF, and CNN-SVM models at each data point were calculated.As shown in Figure 7, the PE of most data points in the FE-ML and CNN-ML models was between −1 and 1.However, the data points for the FE-SVM model  on the test set.For the CNN-ML model, the PE of each model was basically within ± 0.6.Compared with the FE-ML model, the CNN-ML model had higher accuracy and a lower error range.

Comparison with previous models
Aiming at the prediction of acetylcholinesterase inhibitory toxicity of ILs, Torrecilla et al. [45] established the MLP model based on the component descriptor, Yan et al. [46] established a multiple linear regression model based on the topological indexes of ILs.Those models had good prediction ability.The prediction results of the model established by previous studies and this work were compared, and the detailed statistical data of r 2 for train and test set are shown in Table 1.The CNN-SVM and CNN-MLP model had the highest accuracy on the data set as well as the best performance on the test set.Thus, the CNN-SVM and CNN-MLP models were better than the model in the literature in terms of r 2 , MAE, and MSE.Compared with feature selection and artificial feature selection, the feature extraction of the local relation of molecular descriptors through the CNN model can better characterize the feature relation of molecules and finally be used in the prediction of the regression model, which is of great help to improve the performance of the model.The results showed that feature pretreatment and DL pretreatment could accurately obtain the inhibitory toxicity of ILs to acetylcholinesterase compared with artificially selected descriptors.The high precision of the model is valuable for the screening of low-toxic ILs.

Conclusion
In this work, the acetylcholinesterase inhibitory toxicity of 266 ILs was collected to establish a data set.SMILES based on the structure of ILs was used to calculate 208 descriptors of ILs.Six models (FE-MLP, FE-RF, FE-SVM, CNN-MLP, CNN-RF, and CNN-SVM) were constructed by FE and CNN and optimized by the grid search method.After determining the best structural parameters, the training data set was used to train the models and the test data set was used to evaluate the trained models.The results showed that the six models can predict the toxicity of ILs satisfactorily.CNN-ML model had the highest accuracy in predicting the toxicity of ILs.This showed that feature extraction through the CNN model was more accurate than manual selection or FE.Furthermore, the rapid screening of low-toxicity ILs by this model is of great significance to the application of ILs.
Different from traditional property prediction models, key features can be extracted by using the powerful capability of the CNN model in feature extraction.In this process, feature extraction and learning can be realized and errors caused by artificial feature selection can be avoided.The code has been uploaded to Github (https://github.com/Blueone123/cnn-ml.git).It is proved that the DL method is a promising intelligent method in the toxicity prediction of ILs.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The work is supported by the National Natural Science Foundation of China (No. 22078166).

Figure 1 .
Figure 1.Graphical interpretation of the CNN model.
(a) presented the hyperparameter optimization results of the FE-RF model, and Figure 2(b) presented the hyperparameter optimization results of the CNN-RF model.The optimal result for the parameters N_estimators and Max_depth of the FE-RF model was 35 and 80.The optimal result for the parameters N_estimators and Max_depth of the CNN-RF model was 85 and 10.

Figure 4 .
Figure 4. Graphical interpretation of the MLP model.

Figure 6 .
Figure 6.The comparison results of predicted and experimental values of all data points in the FE-ML and CNN-ML models.(a) FE-MLP model; (b) FE-RF model; (c) FE-SVM model; (d) CNN-MLP model; (e) CNN-RF model; (f) CNN-SVM model.

Table 1 .
r 2 , MAE, and MSE of comparison of six established FE-MLP, FE-RF, FE-SVM, CNN-MLP, CNN-RF, and CNN-SVM models with those in the literature.