In silico QSAR modeling to predict the safe use of antibiotics during pregnancy

Abstract The use of medicines during pregnancy is a growing public health concern due to the risk of developmental toxicity. Healthcare providers heavily rely on the FDA pregnancy risk categories (A, B, C, D, and X). Antibiotics are among the most prescribed drugs during pregnancy and are often listed under category B or C. However, the risk-benefit assessment may be lacking due to challenges in the clinical toxicology studies on pregnant women, such as ethical concerns. The primary focus of this study is to generate a model that predicts the safe use of antibiotics during pregnancy by using in silico approaches. Thus, a QSAR model was created to assess the FDA pregnancy category (B or C) of antibiotics. The dataset consisted of 97 antibiotics obtained from the FDA. A total of 6420 molecular descriptors were determined via multiple software and various machine learning algorithms were utilized. The performance of the models was measured using internal and external validation. The accuracy (ACC) values of the most successful model were 83.82% for the internal and 94.11% for the external validation. Sensitivity (SE), specificity (SP), MCC, and ROC values were 0.878, 0.778, 0.68, and 0.892 for the internal validation and 0.9, 1, 0.887, and 0.936 for the external validation, respectively. Kappa statistics also indicate that there was a substantial agreement for internal validation with 0.6765 and an almost perfect agreement for external validation with 0.8811. In conclusion, our model can be used as an initial step before pre-clinical and clinical studies to predict the safe use of antibiotics in pregnancy.


Introduction
The safe use of medication has the highest priority in the treatment of pregnant women, as any adverse effect on normal fetal development can lead to lifelong negative consequences (Dathe and Schaefer 2019). Therefore, to avoid the risk of developmental toxicity, the US Food and Drug Administration (FDA) has divided pharmaceuticals into five categories (A, B, C, D, and X) for drug use during pregnancy. While A is considered the safest category and showed no risk in human studies, category X is contraindicated in pregnancy. The drug molecules listed in category B showed no developmental toxicity in animal studies, but there are no adequate clinical studies on pregnant women. Although category B is slightly riskier than A, both categories are approved safe to use during pregnancy. In contrast to these two categories, there is evidence of teratogenicity in animal studies of drugs in category C. Thus, it is not recommended to use category C drugs during pregnancy, except in obligatory situations. Human studies show that group D drugs may pose a risk to the fetus. Furthermore, pharmaceuticals in category X have been proven to cause human birth defects. Consequently, determining the category of medicines used during pregnancy is crucial to avoid any adverse effects on the fetus and newborn (Salim 2014).
One of the most prescribed pharmacological drug groups during pregnancy is antibiotics (Haas et al. 2018). Among the five pregnancy categories specified, most of the antibiotics are listed in categories B or C, and an extremely small proportion in categories D or X. For example, tetracyclines are listed in the pregnancy category D, many fluoroquinolones are in category C, and cephalosporins and penicillins are mostly in category B. On the other hand, there are several antibiotics, whose developmental toxicology tests have not been completed yet (hence the pregnancy category has not been determined) (RxMediaPharma V R 2022). In addition, new molecules with antibiotic effects continue to be synthesized (Hutchings et al. 2019). Drug safety and efficacy are determined by in vitro and in vivo tests in pharmacology and pharmaceutical toxicology studies. In the comparison of experimental approaches, which are costly, time-consuming, and come with ethical concerns, computational methods have shown great advantages as they are cost-efficient, fast, cheap, and accurate when validated appropriately. Therefore, non-animal-based tests and predicting strategies can be used as an initial step before pre-clinical and clinical studies to predict the safe use of antibiotics whose pregnancy category has not yet been determined. There are insufficient invasive clinical studies in the literature, as it poses ethical problems for pregnant women. One of the main goals of our study is to contribute to this gap in the literature. Moreover, this approach contributes to the 'Reduction' and 'Replacement' principles of the 4 R rule, both preventing unnecessary animal use in scientific studies and providing economic benefits (Jiang et al. 2019).
Quantitative structure-activity relationship (QSAR), one of the well-known in silico studies, is a promising approach to replace the animal-based test models (Madden et al. 2020). It is one of the widely used computational or mathematical modeling methods in predictive toxicology that correlates the chemical structure of a molecule with the toxicity endpoint. The method is based on the hypothesis that similar molecules exhibit similar biological activity. For this reason, finding the most critical molecular descriptors is important in developing a functional predictive QSAR model (Balekundri et al. 2015).
In predictive toxicology, machine learning methods form the basis of in silico methods (Wang et al. 2021). Machine learning utilizes different algorithms that take and evaluate input data to predict output values (Sarker 2021). QSAR modeling requires using machine learning classification algorithms, such as k-Nearest Neighbors (kNN), Logistic Regression, Support Vector Machines (SVMs), Decision Trees (J48), Naive Bayes Bayes (NB), random forest (RF), non-nested generalization exemplars (NNge), and KStar (Wang et al. 2021).
The main purpose of this study is to generate a QSAR model to assess the FDA pregnancy category (B or C) of antibiotics. An extensive number of antibiotics from categories B and C were collected to develop a machine learning model. Molecular descriptors of the collected antibiotics with similar chemical structures are used to build binary classification models. A total of 15 QSAR models are established using different five different machine learning algorithms Bayesian Network Classifier (BayesNet), SVM, k-Nearest Neighborhood (kNN), C4.5 Decision Tree (J48), and RF by the means of the selected descriptors by WEKA for three descriptor packages (PaDEL, T.E.S.T., and alvaDesc). Ten-fold cross-validation is used as internal validation to evaluate the performance of the models. The predictive ability of the generated models was evaluated as external validation in the test set. All models were compared using commonly used evaluation criteria and the most successful model was chosen accordingly.

Materials and methods
In this study, our goal is to create an effective QSAR model that can predict the safe use of antibiotics during pregnancy. We specifically concentrated on the binary classification of the FDA pregnancy risk category of antibiotics, namely B and C. Since our study was binary classification and we focused on whether antibiotics could be used during pregnancy, we based our model plan on these categories. Category D and X antibiotics were excluded from the study so as not to jeopardize our chemical structure diversity since there is a very small number of them. FDA-approved antibiotics have been collected as a dataset and used for model development by machine learning techniques. In the following subsections, detailed information on the collection of the dataset, calculation of molecular descriptors, and implementation of machine learning techniques are given.

Data collection
The FDA-approved dataset of 172 antibiotics from different pharmacological groups labeled as pregnancy risk either B or C was collected from the official database of the FDA (https://www.fda.gov/). The chemical structure and associated data files of all these compounds were downloaded in twodimensional structure data file in (2D SDF) form from the publicly available chemistry database PubChem (https://pubchem.ncbi.nlm.nih.gov/) (Kim et al. 2021).
The quality of the collected data is the most critical factor in machine learning model building (Hongbin et al. 2018). As a result, before employing machine learning methods, we cleaned and verified the quality of the data. First, we identified and removed corrupted and irrelevant information data from the dataset. Then, the inorganic compounds, salts, and aromaticity in the molecule were removed, followed by the removal of duplicated compounds. After the data cleaning process, 97 antibiotics remained as QSAR-ready structures. Then, the remaining 97 antibiotics were divided into the training and external validation sets by using an unsupervised filter in the machine learning software Waikato Environment for Knowledge Analysis (WEKA) version 3.9.5 (Frank et al. 2016). After splitting the dataset in 8:2 ratio, the training set contained 80 compounds, and the external validation set contained 17 compounds (Table 1). We verified that the training set compounds span the entire chemical space for all of the dataset compounds after dividing the dataset. This is an important concept for the meaningfulness of the QSAR models.
The antibiotic name, pharmacological group, FDA pregnancy category (B or C), chemical abstract services (CASs) number, molecular weight (MW), molecular formula, and canonical SMILES string of each antibiotic in the dataset are presented in the Supplementary File 1, Table S1.

Molecular descriptors
The calculation of molecular descriptors is an important task in QSAR modeling since they are the mathematical representation of the compounds (Mauri 2020). With the assistance of identifiers and fingerprints of chemicals, a successful QSAR model can be created. In this study, molecular descriptors were calculated with three software, i.e., PaDEL-Descriptor (Yap 2011), Toxicity Estimation Software Tool (T.E.S.T. 5.1.1) (Todd 2020), and alvaDesc (alvaDesc version 1.0.14, Lecco, Italy) (Alvascience-Srl 2019) to characterize the compounds. PaDEL (PaDEL-Descriptor version 2.21) is open-source software that calculated 1444 2D-descriptors (Yap 2011).
T.E.S.T. 5.1 is also open-source software that calculated a total of 797 2D-descriptors generated from each SDF file. AlvaDesc (Alvascience Srl 2019) calculated 4179 2D-molecular descriptors. Using the three software programs, we generated a total of 6420 2D-molecular descriptors. For more information, the interested reader might see (Yap 2011;Mauri 2020;Todd 2020). All the calculated descriptors had varying scales, so we normalized the dataset to change the numeric values to a single scale without misrepresenting differences in the range of values before modeling. An unsupervised filter of WEKA version 3.9.5 (Frank et al. 2016) was used to normalize the descriptor values. The software is available at https://www.cs. waikato.ac.nz/ml/weka. All of the descriptor values were transformed to comparable scales after normalization. We used the default settings of WEKA, which resulted in values ranging from 0 to 1.

Machine learning methods
In this study, five machine learning algorithms were used for the calculated three molecular descriptors, PaDEL, T.E.S.T., and alvaDesc. WEKA was used to implement the various machine learning algorithms and these algorithms are BayesNet, SVM, kNN, J48, and RF.

Bayesian network
BayesNet is one of the machine learning algorithms that is used to explain data modeling in computational science (Friedman et al. 1997). The characteristic of BayesNet is statistical networks in which the edges that connect nodes are chosen based on statistical considerations. BayesNet is directed acyclic graphs (DAGs), with each node representing a different variable. BayesNet can also characterize the ordering of these random variables.
The BayesNet is a statistical graphical model based on Bayes's Theorem (Hogg and Elliot 1997). A brief statement of the Theorem is as follows.
Bayes' Theorem: The conditional probability of E j , given A, from the probabilities of E 1 , I, . . . , E k and the conditional probabilities of A given E i , i ¼ 1,2, … , k is calculated by,

Support vector machine
SVM is among the most widely used machine learning algorithms for multi-class classification problems (Cortes et al. 1995). SVM works by locating an optimal hyperplane that divided the data into two groups. The hyperplane is constructed in such a way that it only takes into account the training dataset that is closest to the line that best separates the groups.
In this study, we used the Sequential Minimization Optimization algorithm (Platt 1998), which is the specific optimization algorithm used within the SVM classification. Sequential Minimization Optimization is used to train SVM and is implemented by SMO in WEKA.

k-Nearest neighborhood
kNN algorithm uses a distance measure to locate the k most similar samples in the training data for each sample in the test set. kNN only uses the raw training data to predict the class of the samples in the test set, which means the algorithm does not create a model (Altman 1992).
k determines the number of neighbors that the algorithm searches to regulate the class of the new data. In general, k is chosen as a small positive number, though the selection of k for the best prediction is dependent on the size of the data. In this study, we set the parameter k as 3.
WEKA provides different distance measuring functions, such as Chebyshev, Euclidean, Filtered, Manhattan, and Minkowski for nearest neighbor search. We used the Euclidean distance to measure the distance between instances. The Euclidean formula to calculate the distance between two points x and y is (Deza and Deza 2009 (2)

C4.5 decision tree
Decision Trees (DTs) algorithm is another important machine learning technique in modeling. There are numerous applications of the DT. C4.5 is one of the DTs algorithms (Quinlan 1993) and is implemented by J48 in WEKA. It is a technique to be employed to generate a decision based on a sample of data. During the J48 process, trained data information is modeled on a tree. A J48 resembles a flowchart, with each internal node representing an input, each arm representing the outcome, and each leaf node representing a class. During the model building, the trained set is divided into subsets based on their various properties, and this process is repeated recursively. The iteration process is repeated until it has no effect on the prediction.

Random forest
RF is a well-known machine learning approach based on the ensemble learning principle. The strategy seeks to enhance the model's performance by running numerous DTs on different subsets of the provided dataset and taking the average.

Implementing the machine learning algorithms in WEKA
The validation of the QSAR models was made as internal and external validation. Ten-cross validation was applied for all the machine learning models for the internal validation. kfold cross-validation is a technique that is used to ensure the machine learning models perform the unseen data well. k is a chosen number that split the data sample into k groups with the same number of samples. Once one testing cycle was completed new training and test sets were created to perform the same procedure for the refreshed sets. The process was completed after the cross-validation process is repeated k times. In total, k À 1 of k samples were used for training data, while 1 was kept as a test set to validate the model. The success of the model is the sum of the performances of all classification functions is averaged by dividing by k. In our study, k was set as 10.
On the other hand, we also used external validation, which involves the use of independently derived datasets (true external set), to validate the performance of a model that was trained on initial input data. The training set was used to train or learn the model, and the test set was used to validate the model. The predictive ability of the models on the test set provided external validation.

Attribute selection
Attribute selection aids in determining which attributes are most important when developing a model. We performed attribute selection using WEKA's information-based and correlation-based attribute evaluators (CfsSubsetEvals). The information-based attribute evaluator (InfoGainAttributeEval) calculates the quality of an attribute by measuring the information gained concerning the class. It is performed with a ranker search method that ranks attributes by their calculations. The CfsSubsetEval calculates the quality of a subset of attributes by considering each feature's individual predictive ability as well as the degree of redundancy between them. It was performed with BestFirst search method, which searches the space of attribute subsets (Hall 1998). The attribute selection for the generated QSAR models is shown in Table 2.

Model performance measurement
The performance of the models was evaluated with internal and external validation. We performed a ten-fold cross-validation technique on the training set to assess the models for internal validation. The external validation was given by the results when the test set was tested by the trained models. The predictability and reliability of the models were evaluated using model accuracy (ACC), sensitivity (SE), specificity (SP), Matthews correlation coefficient (MCC) parameters, kappa statistic, error rates, and receiver operating characteristics (ROCs) analysis.
All the parameters were calculated based on the confusion matrix which consists on the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) numbers.
ACC is the percentage measurement of correctly classified instances based on the dataset and is calculated using the following equation, SE is the TP rate which is the proportion of TPs, among all the instances which truly are positively classified. The best SE value is 1.0 and the worst is 0.0 for the prediction and SP is calculated as follows, SP is the TN rate which is the proportion of TNs, among all the instances which truly are negatively classified. The best SP value is 1.0 and the worst is 0.0 for the prediction and SP is calculated as follows, MCC assesses the validity of the binary categorization by taking the imbalance of positive and negative cases into account. The best MCC value is 1.0 and the worst is 0.0 for the prediction and MCC is calculated as follows, Kappa statistic is a measure that compares an observed ACC with an expected ACC. According to Kappa statistics, the values 0 indicate no agreement, 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1.0 indicate almost perfect agreement. Kappa statistic value is calculated as follows, where p 0 is the percentage of observed agreement and p e is the percentage of expected agreement. The mean absolute error is a metric used to assess how close forecasts or predictions are to actual events. The root mean squared error is the square root of the mean of the squares of the values.
In addition to the parameters listed above, ROC analysis was used to measure the ACC of the models. The ROC curve value represents the relationship between two parameters: SE and SP. According to the area under the ROC curve, the model results between 0.5 and 0.6 are considered worthless, 0.6-0.7 are considered poor, 0.7-0.8 are considered acceptable, 0.8-0.9 are considered good and more than 0.9 is considered excellent (Mandrekar 2010).
The Topliss ratio must be at least 5-1 between the number of molecules in the training set and the descriptors utilized in the model to ensure model validity (OECD 2007;Cherkasov et al. 2014). To validate the model validity, the Topliss ratio was determined for each uniquely constructed QSAR model in this study.

Applicability domain
A defined applicability domain of a QSAR model is a necessary principle for the validation of QSAR models according to the OECD's guidelines (OECD 2007). The applicability domain refers to the need to define the scope and boundaries of a model based on the information in the QSAR model generated from the training set. The purpose of the applicability domain is to determine the applied range of models and check whether the chemical to be tested is suitable for the model. In this study, we determined the applicability domain of the models by chemical diversity. We employed chemical space and the Tanimoto similarity index to examine chemical diversity to build a reliable predictive QSAR model.

Results
We created fifteen QSAR models for three chemical descriptors, PaDEL, T.E.S.T., and alvaDesc, using five different machine learning algorithms: BayesNet, SVM, kNN, J48, and RF. Internal and external validations were used to assess the models' reliability and predictability. For the performance of the models, ten-fold cross-validation was used as internal validation. The external validation was given by the results when the test set is tested by the trained models. The predictive ability of the models on the test set provided external validation. In addition, for each newly constructed QSAR model, the Topliss ratio was determined to ensure model validity. Before comparing the performance of the models, the chemical diversity of the datasets was discussed.

Datasets and chemical diversity analysis
The dataset in our study contained 97 FDA-approved antibiotics from different pharmacological groups labeled as pregnancy risk either B or C. WEKA was used to separate 95 chemicals into two sets: a training set and an external validation set. The training set had 80 compounds, while the external validation set contained 17 compounds after splitting the dataset (Table 1). Chemical diversity in the dataset is significant for building a reliable predictive classification model. Within the chemical descriptor space spanned by the training set compounds, a reliable efficient prediction can be generated. We employed chemical space and the Tanimoto similarity index to investigate chemical diversity.
The chemical space distribution of the training set and the external validation set plotted with MATLAB_R2021a is shown in Figure 1. The chemical space was defined by MWand Ghose-Crippen LogKow (AlogP) of each chemical in the datasets. The MWs of the compounds ranged between 102.043 and 1753.64, and Alogp ranged between À18.03 and 2.23 ( Figure 1). These values indicated that data in the training and external validation sets are scattered in the same chemical space.
A quantitative interpretation of similarity can be obtained by using ISIS molecular keys and molecular proximity parameters such as the Tanimoto coefficient, a quantitative interpretation of similarity can be obtained (OECD 2007). The Tanimoto coefficient, which ranges from 0 to 1, was used to assess chemical diversity. A high Tanimoto similarity score means that the compounds are more similar, whereas a low Tanimoto similarity score means that the compounds are more diverse. In our dataset, the average Tanimoto similarity scores were 0.121 for the training set and 0.138 for the external validation set. These findings revealed that these datasets had a wide range of chemical diversity. As a result, the chemical space distribution and the Tanimoto similarity score indicated that the models have a reasonable applicability domain.

The selected descriptors for the QSAR models
We used three software, i.e., PaDEL, AlvaDesc, and TEST for descriptor computation. While some descriptor classes are shared by all software packages, others are distinct. Not all descriptor classes are available for each descriptor software. The disparity in the number of descriptive data generated by the software demonstrates this. T.E.S.T presents 797 descriptive data under 13 different classes. AlvaDesc gives 4179 descriptive data points divided into 33 classes. PaDEL provides 1444 descriptive data points in over 60 classes. We developed our machine learning methods with the descriptors that were selected for each software. We selected only 16 descriptors from the 4179 descriptors of AlvaDesc software using WEKA's correlation-based attribute selection method. On the other hand, out of 1444 descriptors for PaDEL, we selected 13 using WEKA's information gain attribute selection method. Again, using WEKA's information gain attribute selection method, we selected 12 out of 797 descriptors for T.E.S.T. None of the descriptors we selected are common, except Gmax and MAXDP which are the two descriptors in PaDEL and T.E.S.T.

Model validation
A multitude of QSAR models was generated based on five machine learning methods, including BayesNet, SVM, kNN, J48, and RF with three molecular descriptor packages: PaDEL, T.E.S.T., and alvaDesc. We performed attribute selection for each descriptor package before developing the QSAR models. Information-based and CfsSubsetEvals were used for the selection. After this step, the training and test sets were used to generate the models. The internal and external validations were performed to verify that the results of the models were accurate and meaningful. The internal validation was given by ten-fold cross-validation for the performance of the models. The external validation was given by the predictive ability of the models on the test set. We used a known set of safe independent compounds to validate the performance of the models that were trained. All the results of the performance of the models were calculated in WEKA and given as a weighted average.

Performance of training dataset and external set
In our study, the top three models of reproductive toxicity were selected from the developed models. On the other hand, the top three classifications for the training set were chosen according to the ACC, SE, SP, MCC, and ROC values of the models.
Based on the results of the internal and external validations, the top three QSAR models are TEST_kNN, PaDEL_RF, and alva_SVM. Information-based attribute evaluator (InfoGainAttributeEval) and ranker search method were performed for the models TEST_kNN and PaDEL_RF.
CfsSubsetEval and BestFirst search method were used for alvaDesc (Table 3).
The performance of the top three classification models for the internal validation set was determined as ACC > 0.80, SE > 0.82, SP > 0.74, MCC > 0.598, and ROC > 0.79 (Table 4). According to the statistical results in Table 4, TEST_kNN was the best model for ten-fold cross-validation with an ACC of 83.82%. alvaDEsc had the lowest ACC among the three models.
The performance of the top three models for the external validation set was determined as ACC > 0.88, SE ! 0.8, SP ¼ 1, MCC > 0.78, and ROC ! 0.9 (Table 5). The statistical results suggested that TEST_kNN was the best model among the three models. The TEST_kNN model outperforms the other two models with its high success rate of 94.11%. Another interesting finding here is that the values of models PaDEL_RF and alva_SVM are the same except for the ROC values (Table 5). This output is important because the ROC values are used to determine the performance of the methods. When the ROC value is closer to 1, it indicated that the model is better Mandrekar 2010). Therefore, we can say that PaDEL_RF (0.929) is better than alva_SVM (0.900).

The descriptors in the top model
Attribute selection is a critical step for choosing optimal and significant features in machine learning modeling (Demisse et al. 2017). In our study, the most critical descriptors for T.E.S.T software were selected by the information-based attribute evaluator of WEKA. The selected descriptors by WEKA for our top model TEST_kNN are shown in Table 6. The descriptive data information selected from PaDEL and alvaDesc software is shown in Supplementary File 2. Molecular descriptors are defined as mathematical representations of the properties of molecules created by algorithms. Molecular descriptors' numerical values are utilized to quantitatively describe the chemical and physical information of the molecules (Chandrasekaran et al. 2018). The appropriate descriptors chosen for each model may differ depending on the structure of the molecules and the expected toxicological effect. For example, it has been shown that among the prominent descriptors adopted in toxicity prediction applications, 'molecular property', 'connectivity', and 'topological' are the three most essential descriptors for toxicity prediction applications (Li 2020). Our study also included the descriptors in the Chi Connectivity Indices, Electrotopological State Indices (E-State), Constitutional Descriptors, and Molecular Fragments sections that contributed to creating the most successful model.
The two selected connectivity descriptors xch4 (4vch) and xvch4 (4vvch) are in the Chi Connectivity Indices class. Leegwater showed a remarkably good correlation between the acute toxicity of some industrial pollutants and the molecular connectivity indices (Leegwater 1989).
The other descriptors in our study are SsssCH, SdO_acnt, SHsNH2, Gmax, BEHm5, and MAXDP, which are in the E-State Indices class. Among all the descriptors, the topological identifiers form a base class that encodes important structural parts that govern the toxicity or property/activity data of molecules. Kier and Hall established the E-state Indices in the early 1990s to better reflect the critical topological aspects and chemical pieces that mediate a certain reaction. Because of their capacity to record the electrical environment and topology of molecular components, E-state indices have long been used in QSAR/QSPR/QSTR research (Roy and Mitra 2012).
The identifiers nO and nR04 selected for our model are in the Constitutional group of the descriptor classes. It has been stated in the literature that these descriptors are quite suitable for toxicological prediction studies (Toppur and Jaims 2021).
In our study, the aliphatic nitrogen structure appeared as a structural alert. Jiang et al. argue the existence of a double bond adjacent to nitrogen poses a risk of teratogenicity (Jiang et al. 2019). It determined that nitrogen-containing aliphatic, aromatic, or sometimes both, were found in C group molecular structures. Among these structures, which are proved to carry a risk in terms of toxicology, structures carrying two different types of aliphatic nitrogen groups were selected among our descriptive data in the model in which we achieved the highest success. These two significant descriptive data were dCjgOkd_anitrogen__aliphatic_attachb and dNe_aaliphatic_attachb. The descriptors under the Molecular Fragments heading of the Molecular Descriptors Guide are descriptors with nitrogen-containing aliphatic bonds in their structures (https://www.epa.gov/). Many amine groups stimulate respiration, hypertensive response, and heart rate. Aliphatic and aromatic nitrogen compounds were reported as a risky group in carcinogenicity studies, acute toxicity studies, sub-chronic toxicity studies, and chronic toxicity studies, especially in reproductive and developmental toxicity studies (Kennedy and Dabt 2012).
The QSAR Modeling workflow is shown in Figure 2.

Comparison of different model performance
The results of 15 binary classification models using various molecular fingerprints and different machine learning methods showed that various classification methods had different predictive power to classify antibiotics in the FDA pregnancy category B or C. We found the highest prediction percentages as TEST_kNN, PaDEL_RF, and alva_SVM, respectively. The statistical results of PaDEL_RF and alva_SVM were the same except for the ROC values. In this case, we compared these two models according to their ROC values. Since the ROC value of PaDEL_RF (0.929) was slightly greater than the ROC value of alva_SVM (0.900), PaDEL_RF had a better predictive ability than alva_SVM. On the other hand, when we examined the values of TEST_kNN, it had the best values compared to the other two models among the three. In the comparison of the performance of different types of models, the kappa values were found to test the statistical significance. As we can see from the kappa values of the models in Table 7, there was an almost perfect agreement for the external validation set of the TEST-kNN model in terms of statistical significance, while a significant agreement was observed in the others. Therewithal, the model by the TEST_kNN was the top method indicated the best predictive ability with a success rate of 94.11%. Therefore, the TEST_kNN method was more suitable to predict antibiotics in category B or C. The MCC is essentially a correlation coefficient between the observed and predicted binary classifications. It has the same significance as the ROC. Considering the ACC, SE, SP, MCC, and ROC values, the TEST_kNN method is more suitable for predicting safe use pregnancy. Error rates for the training and external validation set are shown in Table 8. The kNN classifier was powerful for our study. The objective of kNN is to predict the class of any new data using the closest training dataset in a feature space. As a result, TEST_kNN was recommended for developing the classification models to predict the risk category of antibiotics.
Our three successful models fulfill the first criteria for model validity as previously stated, with the Topliss ratio of 7.4 (97 chemicals/13 descriptors) and for T.E.S.T., PaDEL, and alvaDesc which is greater than five.

Advantages/disadvantages of our mathematical models
The strongest aspect that distinguishes our work from other studies is that it focused on a specific pharmacological group such as antibiotics. The main issue evaluated by our model is the safe use of antibiotics during pregnancy rather than developmental toxicity. Other studies from a close perspective of each other have focused on the evaluation of reproductive/developmental toxicity of various molecules using large or small datasets (Arena et al. 2004;Jiang et al. 2019). Contrary to reproductive/developmental toxicity studies using mathematical models in the literature, our model directly predicts whether a specific pharmacological group such as antibiotics can be used in pregnancy.
Moreover, contrary to other studies that use a single descriptor source (Jiang et al. 2019), collecting data from three different descriptor sources in this research shows that our study was evaluated in a very broad scope. We provided a great variety of descriptive data and descriptive classes from these three different sources. Thus, we were able to achieve different combinations for high success. Collecting molecular fingerprints from three different sources allowed us to compare the different predictive powers we created with machine learning.
The dataset we created in our study includes all antibiotics in FDA pregnancy categories B and C that are currently in use. Although our dataset seems not vast, it is not possible to expand it further. Since all antibiotics in the specified category are loaded into the system and learned by the model, we claim that our model has a very strong predictive ability for antibiotics whose pregnancy category has not yet been defined.   In this study, inorganic compounds, salts, and aromaticity in the molecule were excluded. Thus, our models could not give predictions for them. Moreover, of these substances, those with antibiotic effects and are currently used as drugactive ingredients, could not be evaluated.
Another limitation of our study is that we excluded antibiotics in categories D and X. Since there are extremely few active substances and no diverse pharmacological group in this category, our homogenous chemical structure diversity deteriorates, and our model efficiency decreases. For the model to learn molecular structures and thus make meaningful predictions, the number of molecules in different categories must be of equal value. Since there are very few antibiotics in categories D or X, these categories cannot be adapted into the model. The priority of our study was to evaluate whether the antibiotic can be used during pregnancy. Therefore, the absence of these categories is not of primary importance. While our model was being planned, we directed to categories B or C, which contain the largest number of antibiotics on the market. To understand whether an antibiotic (not yet in the FDA pregnancy category) can be used in pregnancy, the pharmacological group of the drug must first be determined. Our model will work successfully if the determined pharmacological group contains drugs in categories B or C. In near future studies, predictive strategies will be developed to solve these problems.
The fundamental clinical importance of our current model is that it responds extremely quickly to the administration of antibiotic-active ingredients during pregnancy, for which developmental toxicology studies have not yet been performed. Our model will also speed up the experimental research for the antibiotic's effective molecule, as well as the clarification of the pregnancy category. As a result, the problem of safe antibiotic use during pregnancy will quickly become clear.

Conclusion
All predictive QSAR theories suppose that structurally similar compounds display the same biological activities. This assumption provides clues about toxicity before the new molecules are experimented on in vivo test systems. In conclusion, this study developed robust and reliable prediction models. These promising results contribute to important issues, such as medicine use during pregnancy and assist in drug screening in the early drug discovery. Our model is the first step that should be applied before animal experiments and invasive clinical studies in predicting the safe use of antibiotics in pregnancy. The most crucial point of our study is the presentation of a new approach instead of invasive human studies that cause ethical problems in pregnant women. In future studies, the improvement of non-animalbased in silico models and the development of new models are critical in terms of animal and human rights and ethical considerations for medical safety evaluation.