A multiple linear regression approach to the estimation of carboxylic acid ester and lactone alkaline hydrolysis rate constants

ABSTRACT Pesticides, pharmaceuticals, and other organic contaminants often undergo hydrolysis when released into the environment; therefore, measured or estimated hydrolysis rates are needed to assess their environmental persistence. An intuitive multiple linear regression (MLR) approach was used to develop robust QSARs for predicting base-catalyzed rate constants of carboxylic acid esters (CAEs) and lactones. We explored various combinations of independent descriptors, resulting in four primary models (two for lactones and two for CAEs), with a total of 15 and 11 parameters included in the CAE and lactone QSAR models, respectively. The most significant descriptors include pK a, electronegativity, charge density, and steric parameters. Model performance is assessed using Drug Theoretics and Cheminformatics Laboratory’s DTC-QSAR tool, demonstrating high accuracy for both internal validation (r 2 = 0.93 and RMSE = 0.41–0.43 for CAEs; r 2 = 0.90–0.93 and RMSE = 0.38–0.46 for lactones) and external validation (r 2 = 0.93 and RMSE = 0.43–0.45 for CAEs; r 2 = 0.94–0.98 and RMSE = 0.33–0.41 for lactones). The developed models require only low-cost computational resources and have substantially improved performance compared to existing hydrolysis rate prediction models (HYDROWIN and SPARC).


Introduction
Regulatory organizations such as the U.S. Environmental Protection Agency (EPA) and the European Chemical Agency (ECHA) assess the potential environmental persistence of manufactured chemicals based on their physicochemical properties and transformation rates. Hydrolysis, broadly defined as a chemical reaction in which a molecule of water breaks one or more chemical bonds, can be an important environmental transformation process for dissolved organic contaminants. During the chemical registration process, manufacturers may supply data on the hydrolysis rates and products of a chemical under consideration. When measured rate constants are not available, predictive models can be used to estimate the hydrolysis rate constant based on molecular structure. Currently, there are two primary software applications with the capability to estimate hydrolysis reaction rates: HYDROWIN [1][2][3] and SPARC [4]. Both programs utilize the linear-free energy relations (LFER) approach for prediction of the reaction rates. SPARC is a commercial software program that uses chemical structure theory and perturbation theory to compute physicochemical properties and reaction rates. SPARC currently has the capability to predict rates of carboxylic acid ester and organophosphorus ester hydrolysis reactions. HYDROWIN calculates hydrolysis rates for a variety of hydrolyzable functional groups [2,3]. It is freely available as a module of EPA's Estimation Program Interface (EPI Suite TM ) [5,6], and predictive capabilities are based on a database of Hammett and Taft constants. If the fragment is not available for a particular substructure, similarity analysis must be employed to retrieve the most similar fragment available. Unfortunately, this kind of approach is expected to reach limits in predictive capabilities as molecule similarity diminishes. Also, HYDROWIN currently lacks the capability of: 1) predicting hydrolysis rates for compounds with cyclic hydrolyzable functional groups such as lactones and 2) adjusting hydrolysis rate predictions for the presence of ortho substituents on phenyl rings.
In addition to HYDROWIN and SPARC, several other QSARs have been developed that apply to carboxylic acid ester and/or lactone hydrolysis. Two recent papers report the development of general QSAR models to predict hydrolysis half-lives based on twodimensional molecular descriptors [7,8]. These models were developed to address various hydrolyzable functional groups including lactone and carboxylic acid ester. Khan et al. used a small number of variables (number of descriptors ranged from 4 to 6) for six models (pH 4, pH 7, and pH 9 each at 25°C and 50°C). The datasets contained 45, 63, and 68 organic compounds for pH 4, 7, and 9, respectively, at 25°C; and 27, 34, and 36 organic compounds for pH 4, 7, and 9, respectively, at 50°C [7]. Using a Monte Carlo method, Toropov et al. developed general hydrolysis function models using 58 compounds [8]. These general models were developed for broad application to molecules that are susceptible to hydrolysis; therefore, their accuracy would be expected to be inferior compared to chemical-class specific models.
In the case of lactones, there are no chemical-class specific QSARs applicable to predicting hydrolysis rate constants, to our best knowledge. However, there are a few chemical-class specific hydrolysis QSARs applicable to carboxylic acid esters in addition to HYDROWIN and SPARC. One recent paper reported a model with r 2 of 0.88 that was developed for predicting hydrolysis of carboxylic acid esters. The model dataset was composed of 130 compounds (taken from the SPARC dataset) and used an autoencoder method (deep learning) which requires SMILES and partial charges as an input for development [9]. Other QSAR models require quantum chemical codes for calculating descriptors. For example, Chaudry and Popelier developed a QSAR based on calculations done at five levels of theory (including AM1 for geometry optimization and four singlepoint calculations at higher levels for four descriptors: electron density, Laplacian of electron density, ellipticity, and local kinetic energy density) using 40 datapoints [10]. Another example is a paraben hydrolysis QSAR developed based on five descriptors, three of which are quantum chemical, including energy of highest occupied molecular orbitals [E HOMO ], molecular polarizability, and chemical potential [11]. Unfortunately, these QSARs require proprietary software and/or would require too much computational time for our intended purpose. This paper reports on the development of QSAR models for predicting hydrolysis reaction rates of carboxylic acid esters and lactones to be incorporated into the Chemical Transformation Simulator (CTS) [12], a web-based tool developed by EPA to provide predicted physicochemical properties, transformation pathways, and environmental transformation half-lives for organic chemicals. These QSAR models have an intuitive multiple-linear regression approach with descriptors associated with protonation (pK a ), charge (including electronegativity), Hückel analysis (charge density), and steric (topological and geometrical) parameters. In comparison with HYDROWIN and SPARC, improvements are made in modelling conjugated double and triple bonded alkyl, cyclic, alkyl aromatic, and t-butyl groups. We also present the first lactone-class specific QSARs for predicting hydrolysis rates. The end-goal for implementation is for users to be able to easily access and execute the QSARs with only chemical identifiers such as SMILES, CAS number, a chemical drawing tool, etc.

Hydrolysis mechanism background
The primary mechanism for alkaline hydrolysis (also known as base-catalyzed hydrolysis) of open and closed-chain esters is B AC 2, as illustrated in Scheme 4. This mechanism is similar to the S N 2 reaction resulting in a carboxylic acid and an alcohol after hydroxide ion attack on the carbonyl carbon. The resulting product(s) for carboxylic acid ester and lactone hydrolysis are shown in Scheme 1 and Scheme 2, respectively. Scheme 1 results in two products: an alcohol and a carboxylic acid. Scheme 2 results in a hydroxy carboxylic acid. The atom connected to the carbonyl carbon is required to be carbon to be considered a carboxylic acid ester. The other mechanisms include B AC 1, B AL 1, and B AL 2, which stand for unimolecular acyl-oxygen fission, unimolecular alkyl-oxygen fission, and bimolecular alkyl-oxygen fission, respectively [4]. Because it is sometimes difficult to distinguish between neutral and base-catalyzed hydrolysis, it should be noted that the neutral hydrolysis mechanisms shown in Scheme 3 may also apply (especially in case of lactones) [13]. Because B AC 1 and B AL 1 are not catalyzed, these mechanisms may likely There are very few cases where B AL 2 are well known [14]. Some of the best examples are hindered aromatic esters: methyl 2,4,6-tritertbutylbenzoate and ethyl 2-methyI-4,6-dit-butylbenzoate. These two compounds are cleaved primarily by alkyl-oxygen fission (B AL 2) rather than the usual acyl-oxygen fission (B AC 2) [15].
Cyclic or closed-chain esters are also known as lactones. Lactones have altered stereochemistry conformations. Unless the ring contains more than 8 to 10 atoms, the lactone compounds are usually in sterically hindered cis conformation [16]. In comparison to lactones, open esters are usually in a more stable trans conformation and are therefore generally less reactive towards hydrolysis, resulting in slower reaction rates [17,18]. In addition, the cyclic structure also limits the movement of alkyl groups and leads to entropic differences [18,19]. Besides steric strain restriction of small cyclic ring esters to cis configuration, and thermodynamic differences, the accelerated hydrolysis rates of lactones relative to straight-chain esters are also attributed to distortion of the ester group (loss of coplanarity) and change in the rate limiting step (or change in the mechanism) [19].
An example of change in the mechanism of alkaline hydrolysis is when an ionizable alkyl α-proton adjacent to a carboxyl is present on the cyclic ring leading to a ketene intermediate [20]. When the ionizable hydrogen is present, an elimination-addition (E1cB) mechanism process may take place instead of addition-elimination. There are various other examples of this mechanism for both cyclic and acyclic esters in the literature [21][22][23]. It has been reported that this mechanism enhances the rate [23].
Normally, steric strain in six-membered rings is less than in five-membered rings. However, the trend is reversed in the presence of exo double bonds [24]. Fivemembered (gamma-) and six-membered (delta-) lactones are relatively more stable and more common in nature than other lactones [25].
In aqueous environments, lactones may interconvert to the open-chain hydroxycarboxylic acid depending on pH. Because alkaline hydrolysis of lactones is commonly pseudo-first order with respect to the hydroxide ion, they tend to produce indistinguishable kinetic behaviour compared to neutral conditions. Both neutral and basic conditions lead to ring-opening hydrolysis. Acid-catalyzed reactions usually reverse the hydrolysis reaction to lactonization (lactone cyclization) [25]. Hydrolysis may also occur at similar hydrolyzable functional groups such as reduced forms of a lactone (sugars/polysaccharides). For example, it was reported that under acidic conditions, mutarotation mechanism (interconversion of anomers) is generally assumed to be the reason for opening of cyclic rings [26,27]. This hydrolysis mechanism usually occurs at the glycosidic linkage. Basecatalyzed hydrolysis was also shown to be possible as well at these glycosidic linkages with pH-dependent variations in hydrolysis mechanisms [28]. These mechanisms deviate from the traditional ring opening hydrolysis reactions. Because of all these possible variations that may interfere with the common mechanisms, compounds with glycosidic functions were not considered in our model development. However, glycosidic hydrolysis mechanisms may be considered in future developments.
It is not clear if lactones such as ellagitannins can be hydrolyzed by both acids and bases. Though both have been acknowledged, some reports conclude that they are stable under acidic conditions [29,30], although a multitude of transformation products have been reported under acidic and elevated temperature conditions due to hydrolysis and lactonization [31]. Conflicting reports on hydrolysis of alkaloids from other studies also indicate hydrolysis and lactonization are competitive reactions that are highly sensitive to experimental conditions [32]. Under basic conditions, numerous transformations can occur including hydrolysis and oxidation [29,30]. Reported hydrolysis rates under alkaline conditions may also include oxidation reactions that can take place on acidic lactones if only looking at the disappearance of the parent compound (products are not identified). For example, L-ascorbic acid undergoes oxidation at hydroxy groups during the degradation process [33]. It was reported for L-ascorbic acid that the rate of oxidation and hydrolysis are very similar [34]. It may be difficult to clearly identify the rate limiting step if other acidic lactones hydrolyze with similar reaction kinetics. Gibberellic acid decomposes to allogibberic acid via gibberellenic acid in solutions of pH 2-8 indicating that additional reactions may take place even after hydrolysis [35]. For lactones with stereocenters, the epimerization process has been demonstrated to occur before hydrolysis, which proceeds more rapidly under basic conditions [36][37][38]. Stereoisomers with trans-lactone configurations inhibit hydrolysis rates. Epimerization converts the structures into cis-lactones resulting in more rapid reaction rates [36].
Because multiple reports indicate lactonization is possible along with acid hydrolysis or mutarotation, we primarily included data from basic conditions (> pH 7). In this study, we focused on base hydrolysis rate constants because it is normally the dominant reaction pathway for both CAEs and lactones (has the fastest reaction rate) [12]. In fact, base hydrolysis is commonly used as a detoxification process for toxic lactones [39,40].

Data and screening
Experimental data were first collected from literature and curated. The following data was screened out: pretest data (screening level or preliminary testing from unverified sources), data points where testing conditions were insufficiently described (such as 1 st order rates without any specifications on pH or hydroxide ion concentrations), data points that may have a more dominant hydrolyzable functional groups such as glycosides (based on observations from our initial compiled data and previous report) [12], and data measured with potassium hydroxide (KOH) as a base which were found to have lower rates in general compared to those with sodium hydroxide (NaOH) [41,42]. For many chemicals in the CAE and lactone dataset, multiple reported rate constants were available, which helped in verifying rate data. However, our datasets included only a single rate constant for each chemical. For the CAE dataset, a single reported rate was selected for inclusion based on the most ideal experimental conditions. Conversely, a mean of the rate constant data was used for some of the lactone compounds with minimal to substantial deviation among the available rate data for each compound. Most of these datapoints were within one order of magnitude (only the datapoints for Warfarin spanned over one order of magnitude). The mean was taken as a precaution to avoid overfitting to lower quality experimental data. The rate constant data for carboxylic acid esters (CAEs) and lactones are provided in the Supporting Information (Tables S1 and S2).
OECD testing guidelines state that abiotic hydrolysis of chemicals in aquatic environments occur at pH 4-9 and that most temperatures encountered in the field are 10 to 70°C [43]. For this work, the most ideal experimental conditions for selected datapoints were set to 25°C, pH = 9, and minimal solvent concentrations. Observation of the full compilation of our data indicated the overall rate of hydrolysis were highest for alkaline conditions (thus pH 9 is the most relevant). The differences in the rates of lactone datapoints at around pH 7 and pH 9 indicated that for most lactones, the rate increases only minimally with increase in pH level. A previous compilation of data supports our observations [12]. Data was not corrected for the effects of temperature if conditions were in the temperature range of 20-30°C. For both esters and lactones, if temperatures were out of this range, corrections (to 25°C) were applied based on available activation parameters using the Arrhenius equation similar to the previous report [12]. If activation parameters were not available for lactones, an activation energy of 20 kcal/mol was used based on previous use from an EPA Report [44]. The value is based on the assumption that rates vary by a factor of 10 for each 20°C change in temperature [44].
While most of the rate data was measured in solutions that contained less than 10% solvent, due to the limited availability of rate data for lactones, the maximum percentage of solvent allowed was raised to 20%. Analysis of various solvent effects on hydrolysis rates showed that the rate is not significantly impacted within the limits set [45][46][47][48][49].
Disconnected structures including salts were stripped to achieve QSAR ready structures (only one case observed). Carboxylate ions were neutralized before performing charge density descriptor calculations. Previous QSAR studies have conducted similar curation strategies [50].

Descriptors
Molecular descriptors were selected to represent polar and steric effects on hydrolysis reaction rates. Reactivity of alkyl-substituted esters are primarily controlled by steric effects [51]; therefore, it is important to include steric parameters as descriptors in QSARs for the estimation of carboxylic acid ester hydrolysis rate constants. All descriptor values were determined using Calculator Plugins within modules of Chemaxon's Marvin and JChem cheminformatics platforms (Chemaxon LLC, Cambridge, MA, U.S.A.).
The connection between transition state (including geometries and activation energies) and rate prediction models has previously been established [9,52]. If transition state structure could quickly and easily be identified, and activation energy could be calculated with high degree of accuracy, that would enable us to efficiently develop rate prediction models. The electronic properties of the leaving groups are closely associated with the rate-determining transition state of many hydrolysis reactions [52]. Therefore, several descriptor types (pK a , electronic, topological, and geometrical) were estimated for both the starting compound and the final products.
Site specific (local) descriptors such as pK a , charge, charge density, steric hindrance (SH), and steric effect index (SEI) are key components to development of QSARs for estimation of hydrolysis rate constants. Additionally, global descriptors such as aromatic atom count and ring size-based descriptors were involved in development of both CAE and lactone rate prediction models. Other global descriptors included molecular polarizability, sum of orbital electronegativity (χ), energy (E), and dipole moment (μ).
In quantum mechanics, it is well understood that the wave function contains all the information that describes an atom or molecule(s), and that bond cleavage or formation is associated with changes in the electron density [53]. Based on these principles, quantum chemical descriptors are useful for characterizing the reactivity of any given chemical. The use of quantum chemical descriptors has previously been utilized in analyzing reaction rate constants between organic compounds [4,52,54,55]. Charge, charge density, the sum of electronegativity, and Merck Molecular Force Field (MMFF94) energy can all be considered quantum chemical descriptors. Partial charges and polarizability were previously observed to have correlation to pK a [56,57]. In other work, the active site (O=C-O) has previously been identified as a reactive centre and used for developing QSARs based on quantum chemical descriptors [4,10].
The active site identified from previous reports was thoroughly explored for finding optimal performing descriptors in this study. The examination of the mechanisms of reaction sites shown in schemes 3 and 4 support reason to study the active site (O=C-O). Each atom within the active site of carboxylic acid ester and lactone was studied by calculation of various descriptors available in Chemaxon PlugIn Calculators. These local descriptor calculations apply to the parent molecule. For the product(s), we also ran calculations on the carbonyl carbon and carbonyl oxygen of carboxylic acid group (-COOH) and the oxygen of the hydroxy group (-OH). All the local descriptors of the parent and product(s) are shown in Figure 1.

Model development
After screening and assembling datasets of measured rate constants and descriptor values, a stepwise multiple linear regression approach was used in building QSARs. Statistical data was obtained by regression in the Data Analysis Tool in Microsoft Excel, version 2202. By use of the bidirectional elimination approach, variables added to the regression were required to have p-values <0.05 as criteria to be considered for initial CAE and lactone models. At each step in the variable selection process, p-values were used for screening out descriptors with least significance. A second CAE model was derived from a dataset with six long-chain esters (ranging from C15 to C18) removed. It is expected that the long-chain esters may have higher degree of prediction errors due to experimental pressure conditions in addition to long alkyl chain lengths [58,59]. Therefore, long-chain ester hydrolysis experiments conducted under various constant pressure conditions were screened out to get CAE QSAR2 for more consistency in the experimental conditions of the dataset used for development. Data points with a deviation between the measured and estimated log (k) values greater than one were rechecked at the original source for any possible errors. For example, one report presented the incorrect name leading to error in prediction for cyclohexenyl acetate datapoint ID no. 14 of Table S1; reference no. 2 of the supporting information provided name 'cyclohexyl acetate' in hydrolysis rate data from the original source, but a different name was provided for the activation energy. The reason for rechecking for inconsistencies is based on the idea that QSAR models can sometimes make predictions which are more accurate than their training data [60]. This approach was used to avoid model overfitting due to low-quality data.
After the initial QSAR models were developed, the data from these models were retrained using the DTC-QSAR modelling tool (DTC-QSAR v1.0.6) to determine internal and external validation metrics and the applicability domain (AD), which was determined using a simple standardization approach [61]. The AD in this approach has been defined as the response and chemical structure space, characterized by the properties of the compounds in the training set. The DTC-QSAR models were developed after data splitting (80% training and 20% test) using the Kennard-Stone sampling approach, and the use of genetic algorithms. The Kennard-Stone sampling process enables the selection of samples with a uniform distribution over the predictor space [62]. Because of the small data sets (especially for the lactones), the Kennard-Stone sampling process was selected to maximize representativeness of sampling so that test sets are diverse.

Model performance evaluation
Model performance was quantified through correlation-based metrics, validation metrics, and error-based metrics. Calibration performance was assessed by correlation (r), the coefficient of determination (r 2 ) and adjusted r 2 (r a 2 ). Predictive performance was assessed by predictive squared correlation coefficient (Q 2 ). These include internal and external validation metrics. The internal validation metric for checking robustness was measured by leave the one out technique, Q 2 (LOO) . Some external validation metrics include Q 2 (F1) , Q 2 (F2) , and the concordance correlation coefficient (CCC) [63].
Values of Q 2 (F1) and Q 2 (F2) that are closer to being equal indicate that the test and training set means are close. Furthermore, the closer to equal implies that the test set used encapsulates the entire response domain of the model. CCC is a measure of correlation between two sets of data. It combines measures of accuracy and precision to determine the degree of deviation of the regression line from the perfect concordance line and the observation's distance from the perfect concordance line. Error-based metrics included root mean squared error/deviation (RMSE), standard error (s or SEE), mne (maximum negative error or largest underestimation), and mpe (maximum positive error or largest overestimation).

Results and discussion
Because of the many factors that can affect hydrolysis rates of long-chain esters, two separate models were developed for carboxylic acid esters: CAE QSAR1 and CAE QSAR2 shown in Table 1 (DTC QSARs shown only for model comparison). CAE QSAR1 was developed with an additional six long-chain ester datapoints of C15 to C18 (see datapoints 219-223 of Table S1). CAE QSAR1 has 15 total variables/descriptors including three that were not used in CAE QSAR2 -sum of pi electronegativity of the parent compound, molecular polarizability of the parent compound, and polarizability of the alcohol product. CAE QSAR2 only has 12 descriptors because the additional three descriptors become insignificant based on the p-value screening criteria when the six long-chain esters are omitted from the dataset. Data from the long-chain ester studies were conducted under variable pressure resulting in two forms of films: liquid expanded and condensed [58,59]. Interfacial reactions may vary from homogeneous solutions (expanded state) to condensed state by possible control of the true steric factor [58]. Some of the factors that Table 1. Values of the regression coefficient (Coeff.), standard error (±), significance level of the regression (p-value), and correlation (r) for the linear-regression analysis of carboxylic acid ester hydrolysis rate constants against various physicochemical and molecular descriptors for relevant models.

Variable
Coeff.  affect the rate of hydrolysis of long-chain esters include orientation and dispersion [58,59]. Multi-basic (dibasic and tribasic) esters orient at the interface, but monobasic esters behave slightly different. Pressure applied to monobasic esters has the tendency to force the side with the shorter chain length (less than five carbon atoms) underneath the surface resulting in protection of the ester group from hydroxide ion attack due to less surface area [59]. Generally, the increase in chain length increases the conformational degree of freedom and molecular complexity. As chain length increases, variability in the ability to predict hydrolysis rate is expected due to this increase in conformational degrees of freedom. This is because the more conformations lead to various transition states and activation energies [64]. The high degree of correlation between activation energies and base-catalyzed hydrolysis rates of CAEs has previously been reported [9]. A recent study has demonstrated how alkyl conformations can affect hydrolysis rates [65]. The Taft's 'steric hindrance of motions' or Tillett's 'entropic strain' descriptors seemed to explain the effect of alkyl conformations on hydrolysis rate [65]. The effect of long-chain alkyl groups can also be observed in Lactone QSAR2 rate dataset. Unlike CAE experimental data, there is no variable pressure effect, and the trend is clear. Experimental rate data from this dataset showed that after carbon chain length of C8, rates begin decreasing from slightly to more significantly as chain length increases from C10 to C14 (see ID no. 102-107 of Table S2).
Two models were also developed for the data set of lactone hydrolysis rates (Table 2). Lactone QSAR1 and Lactone QSAR2 (DTC QSARs shown only for model comparison) were developed primarily from six-member ring and five-member ring lactones, respectively. This is because most of the existing single hydrolyzable lactone function hydrolysis data is limited mainly to these two ring sizes. Three descriptors are included in both Lactone QSAR1 and Lactone QSAR2: pK a on the oxygen of the carbonyl of the parent compound (pK a OC=O), pK a on the carboxylic acid product compound (pK a COOH), and steric effect index on sp 3 alpha oxygen connected to the carbonyl of the parent compound (SEI *OC=O). Some of the other descriptors in the two models are similar. For example, the descriptor pair of largest ring size (LRS in Lactone QSAR1) and ring system count of size (RSCS in Lactone QSAR2) both consider some form of ring size. Another pair of descriptors take the charge on the alpha oxygen attached to the carbonyl of the parent compound (π q *OC=O of Lactone QSAR1 and q *OC=O of Lactone QSAR2) into account. The only difference in these two descriptors is that one of the descriptors takes account of both pi and sigma charge contributions rather than only pi.
For lactones, the main difference between the QSAR1 and QSAR2 models in Table 2 are that QSAR1 models have three descriptors not used in QSAR2: pK a on the alcohol product (pK a OH), lowest acidic pK a on the parent compound (apK a 1) and charge density on the oxygen of the carbonyl of the parent compound (ρ C=O*). QSAR2 models have one descriptor not used in QSAR1: Steric effect index on the carbon of the carbonyl of the parent compound (SEI *C=O). Table S4 of the supporting information indicate a strong correlation between pK a OH and π q *OC=O. Six member-ring lactones are capable of being aromatic, and five membered rings are not. This may explain the need for the additional descriptors in QSAR1 models, especially charge density which is highly dependent on conjugation of the carbonyl group. By comparing p-values of identical and similar pairs of descriptors, it is observed that the charge density and the pK a OC=O descriptors are most significant in Lactone QSAR1 and QSAR2, respectively. Contrarily, SEI *OC=O contributes the least of all descriptors in both Lactone QSAR1 and Lactone QSAR2. Table 2. Values of the regression coefficient (Coeff.), standard error (±), significance level of the regression (p-value), and correlation (r) for the linear-regression analysis of lactone hydrolysis rate constants against various physicochemical and molecular descriptors for related models.
p-value r

Variable
Coeff. (±) Coeff. The descriptor sets include calculated pK a values on the oxygen of the carbonyl of the initial parent compound and the sp 3 oxygen (-OH) on the two final products (leaving groups): carboxylic acid and alcohol. Correlation of the rate constants to each descriptor is shown in Tables 1 and 2. The models for both carboxylic acid esters (CAEs) and lactones show that the pK a of parent compound and/or product(s) consistently have one of the three highest-magnitude correlation values with the rate constant datasets (see Tables 1 and 2).
For both CAE QSAR models (1 and 2) in Table 1, the top four descriptors in order of highest to lowest significance based on p-values are pK a of the alcohol product, pK a of the carboxylic acid product, steric hindrance on the alcohol product, and steric hindrance at the carbon of the carboxylic acid product. Table S9 shows the experimental and model prediction rate constants for compounds with various alkyl chain lengths. The deviation in prediction data indicates that CAE QSAR1 performs significantly better than CAE QSAR2 on compounds containing chain length C9 or more (for chain lengths > C8), and CAE QSAR2 performs slightly better on compounds of chain length C1 to C8. Therefore, we recommend CAE QSAR1 for compounds that include long alkyl chains (of chain length > 8). For compounds with chain length 8 or less, CAE QSAR2 is recommended to give additional improvement with three fewer descriptors to calculate. Lactone QSAR1 applies only to six-and four-membered ring lactones, and Lactone QSAR2 applies to five-membered ring and all other lactones. Most of the correlation coefficients for pairs of descriptors for the carboxylic acid esters and lactone datasets indicate a low degree of correlation (see supporting information Tables S3-S5).
A total of 15 and 12 descriptors were used overall in the development of (a) CAE QSAR1 and (b) CAE QSAR2, respectively. Eight and six descriptors were used for developing (a) Lactone QSAR1 and (b) Lactone QSAR2, respectively. The diagram in Figure 1 summarizes the different descriptors used in all four models. Descriptors in red apply only to lactones and purple apply only to carboxylic acid esters.

Statistical performance
Subset models were appropriately developed and selected for combination (see discussion below). There are distinctions that should be noted in the datasets. CAE QSAR1 and CAE QSAR2 have overlapping experimental datapoints. The CAE QSAR1 model used include long-chain alkyl esters of C15 to C18 (n = 6) in addition to datapoints from QSAR2. The CAE QSAR2 model includes only esters with alkyl chain lengths of less than ten carbon atoms (n = 217). CAE QSAR* (combination of subset of the models) include subset of the long-chain alkyl esters of C15 to C18 (n = 6) from CAE QSAR1 and all other esters datapoints from CAE QSAR2 (n = 217). Contrary to CAE QSAR1 and QSAR2, Lactone QSAR1 and QSAR2 do not have any identical datapoints. Lactone QSAR* is simply a combination of Lactone QSAR1 and Lactone QSAR2. The four original models that have been discussed to this point were developed without splitting the data into training and test sets. Previous to the development of Lactone QSAR1 and QSAR2, a full dataset was used to develop Lactone QSAR-fulldata (results are shown in Table 3 only for comparison). After using DTC-QSAR to split the data into training and test sets (for validation), we get five additional recalibrated QSARs (CAE DTC-QSAR1, CAE DTC-QSAR2, Lactone DTC-QSAR1, Lactone DTC-QSAR2, and Lactone-fulldata DTC-QSAR). Combining subsets of QSAR1 and QSAR2 of CAE and lactone results in four QSAR models with combined datasets. These variations result in a total of 14 models. Statistical performance of only these four models with combined datasets and two models developed from full dataset are presented in Table 3. All other preliminary models from development can be found in Table S7. A flowchart is shown in Figure 2 to visualize how the different datasets were assembled and the relative size of each. Two preliminary models of Lactone-fulldata QSARs are not included here. Table 3 gives the statistical performance metrics of r 2 , MAE, RMSE, mne, and mpe for QSAR models with combined datasets, and comparable models. Overall, all the newly developed models presented here have a high level of correlation and low errors. Statistical performance data for the four QSAR models were provided based on QSAR1/QSAR2 combinations. Combining QSAR1 and QSAR2 subsets to a total set (combined) minimizes prediction errors. CAE QSAR1 was developed using 223 datapoints, and CAE DTC-QSAR1 was developed using 179 datapoints respectively, including long-chain alkyl esters of C15 to C18 (see ID no. 218-223 of Table S1). CAE QSAR2 and CAE DTC-QSAR2 were developed without longchain alkyl ester data of C15 to C18 (see ID no. 1-217). Predictions from CAE QSAR1 for longchain esters (C15 to C18) were combined with predictions from CAE QSAR2 for shorter chain esters (C1 to C8) to give CAE-QSAR. Similarly, long-chain predictions from CAE DTC-QSAR1 were combined with predictions from CAE DTC-QSAR2 for short-chain esters to give CAE DTC-QSAR. Overall, Table 3 shows slight statistical improvements with the combined CAE QSAR models (CAE QSAR* and CAE DTC-QSAR**) compared to separate models from Table  S7 (CAE QSAR1, CAE QSAR2, CAE DTC-QSAR1, and CAE DTC-QSAR2).  Lactone QSAR-fulldata (see Table 3) was initially developed using the full dataset, whereas Lactone QSAR1 and Lactone QSAR2 were developed after splitting the data into two sets based on ring size. Kaiser and Kézdy previously reported variations in lactone hydrolysis rates based on ring size [19]. Another report showed similar findings for lactams which further supports this trend [66]. Based on these papers, it was determined that splitting the data would give improved performance. Table 3 confirms that predictive performance is improved by combining estimated rate constants from models based on split datasets (Lactone QSAR* and Lactone DTC-QSAR**) compared to Lactone QSARfulldata (without splitting the dataset).
The QSAR models for carboxylic acid esters also show improvements compared to SPARC and HYDROWIN for our datasets with and without long-chain alkyl esters of C15 to C18 (see Table S7). The r 2 of our QSAR models are higher than SPARC and HYDROWIN, and the error metrics (MAE, RMSE, mne, and mpe) are all lower. SPARC and HYDROWIN also have the largest discrepancies between predicted and observed rate constants, with deviations over two orders of magnitude for both overestimates and underestimates. In comparison, our CAE QSAR models' largest discrepancies are within 1.5 orders of magnitude of the measured values. The largest discrepancies (between calculated and observed) identified for the CAE QSARs were for phenyl 3-phenylpropanoate and (2Z)-4-ethoxy-4-oxobut-2-enoate. These fall under chemical classes of alkyl aromatics and O − anion of Table 5. The largest discrepancies for Lactone QSAR* were less significant (within one order of magnitude).
Long-chain alkyl esters are considered to have carbon chain lengths of at least C8 [67]. Table 4 further shows CAE QSAR1 and CAE QSAR2 performance in comparison to HYDROWIN and SPARC for only long-chain alkyl esters (8 data points). CAE QSAR1 performs best in all the metrics presented. CAE QSAR1 and SPARC had the highest coefficients of determination (r 2 = 0.84 and 0.83, respectively), followed HYDROWIN (0.69), and CAE QSAR2 (0.62). The largest discrepancies for the eight long-chain alkyl esters appear to be overestimation by over an order of magnitude in all the models. Although SPARC has the 2 nd highest correlation, the MAE and RMSE values show lower predictive performance compared to the CAE QSARs. The CAE QSAR1 model developed shows great potential to be applied to these complex molecules (long-chain alkyl esters under pressure) despite the high degree of variability.
To examine the predictive performance of the CAE QSAR2, HYDROWIN, and SPARC for particular chemical classes of carboxylic acid esters, the dataset was divided into various Table 4. Performance statistics for long-chain alkyl esters (n = 8) for various models. The data points range from chain-lengths of C8 to C18. Bold font denotes the best performing model based on each metric. See Table S9 for reference on chemicals used for this analysis.   categories and performance metrics (correlation coefficients, MAE, and RMSE) were calculated separately for each chemical class ( Table 5). The various chemical categories of CAEs include formates, t-butyl groups, alkyl halides, alkenes, conjugated alkynes, cyclic compounds, aromatic R-groups, carboxylic anions (COO − ), parabens, and amines. Some of the groups were further divided into more specific subgroups: R1-aromatic groups which are conjugated with carbonyl (*C=O) of the ester and R2-aromatic groups, alpha-and beta-alkyl halides, and conjugated R1-alkenes. Some of the subgroups were further divided into even more specific secondary subgroups: R1-aromatics with nitro group substituents, R1-aromatics with only para substituents, R2-aromatics with ortho, meta, and para substituents, and alkyl halides substituted at alpha-and beta-positions. The results show that the correlation coefficients of the CAE QSAR2 was highest for almost all the chemical classes except for alpha alkyl halide esters and ortho and meta substituted R2-aromatics. Furthermore, a very high degree of correlation is observed for all the cases with r values ranging from 0.90 to 1.00. CAE QSAR2ʹs statistical errors including MAE and RMSE have the lowest values for all chemical categories except for formates, alkyl halides (beta-), meta-substituted R2-aromatic groups, and parabens (Table 5). Tables 4 and 5 indicate the CAE QSARs have better overall performance compared to other models in handling various types of chemical classes. Relative to HYDROWIN and SPARC, the most significant improvements are observed for cyclic CAEs, t-butyl CAEs, beta-alkyl halide, conjugated *C=O (includes R1-aromatic, R1-alkenes, and R1-alkynes), ortho-substituted R2-aromatic, COO − , paraben, and amine compounds (includes alkyl and aromatic amines).
It should be noted that alkyl halides are susceptible to hydrolytic dehalogenation reactions; therefore, molecules that include both a carboxylic acid ester and alkyl halide groups may form several hydrolysis transformation products. A previous evaluation of observed hydrolysis rates in the environment determined that carboxylic acid esters generally hydrolyze faster than alkyl halides under basic conditions, with reported hydrolysis rates on the order of days for carboxylic acid esters and months to years for alkyl halides [12]. Because the hydrolysis reaction is expected to be faster at the carboxylic acid ester site, the rate of hydrolytic transformation should reflect carboxylic acid ester hydrolysis. As is shown by the performance statistics in Table 5, HYDROWIN and SPARC generally underpredict the rate of hydrolysis of the three beta-alkyl halide esters in the dataset (based on negative correlation). In particular, for one highly fluorinated molecule (ID no. 184 of Table S1), SPARC and HYDROWIN underestimated the value of log (k b ) by factors of 4.12 and 2.75, respectively. In contrast, the CAE QSAR* model predicted the log (k b ) of molecule 184 within 0.12 log units, likely by better accounting for electronegativity effects, inductive effects, and steric contributions.
Conjugation can play an important role in hydrolysis rates. For example, conjugation with the carbonyl group of the carboxylic acid ester slows the reaction. This is evident in the CAE dataset. More energy is required to break these carbonyl bonds to form the tetrahedral intermediate from the hydroxide ion attack because of the increased stability resulting from conjugation. Substituents in conjugation with the reaction centre will have the most impact on reaction rates [68]. Specifically, it was shown that a nitro group introduced to the para position of the phenyl ring of 2-phenyl-4 H-3,l-benzoxazin-4-one accelerates the hydrolysis rate by a factor of 4.5. In the same position, introduction of a methoxy group slows it down by a factor of 1.3. The experimental data provided in the supporting information confirms these patterns for some similar systems (compare ID no. 103 to 126 and 133 to 134 of Table S1). For example, the log (k) values of methyl 1-naphthoate (−1.51; ID no. 103) is much lower than that of ethyl 4-nitrobenzoate (−0.13; ID no. 126). A similar pattern was also observed going from single nitro substituent (4nitrophenyl benzoate, ID no. 133) to two nitro substituents (2,4-dinitrophenyl benzoate, ID no. 124). Conjugated nitro groups in a molecule may possibly undergo hydrolysis at nitro groups under extreme alkaline conditions [69][70][71][72][73]. However, none of the models showed any significant deficiencies from possible interfering mechanisms probably because alkaline conditions were not extreme. In the papers reporting hydrolysis at nitro groups under extreme alkaline conditions (pH >9), most of the compounds had multiple nitro groups and none had carboxylic acid ester groups.
There are far fewer datapoints available for conjugated aromatics (R1-aromatics) especially for ortho-and meta-substituents. With 13 observed datapoints for parasubstituted R1-aromatics, Table 5 shows that the CAE QSAR* outperforms HYDROWIN and SPARC in R, MAE, and RMSE metrics. The six observed paraben datapoints can essentially be considered a subset of para-substituted R1-aromatics. The para-hydroxyl group of these datapoints can exist as either one of two microspecies (O − or OH) between pH of 7-9, according to the Chemaxon pK a model. The data shows that the CAE QSAR* has a higher value of r than HYDROWIN and SPARC, but SPARC has slightly lower values of MAE and RMSE than the CAE QSAR*.
Internal and external validation metrics were used to measure the predictive ability of the DTC-QSAR models in Table 6. The validation statistics in Table 3 (for the combined training and test set) show that the modelled predictions from DTC-QSAR** closely fit to the regression line with identical r 2 value compared to the Excel-based MLR models (CAE QSAR* and Lactone QSAR*). In Table 6, adjusted r 2 values are relatively identical to r 2 values indicating that the additional descriptors/variables do not skew r 2 measurements and that the models are precise and reliable.  Table S2 for more info on outliers). Outliers identified by standardization approach in DTC-QSAR may not be reliable since it does not consider inter-correlation between descriptors and does not consider relative contribution of descriptors (descriptors with lower contributions to model should not be as influential in determining the AD) [61]. The outliers that were identified by this approach have predicted value deviations of less than one order of magnitude relative to observed values. Internal and external error metrics across all QSARs in Table 6 display low values. SEE range from 0.41 to 0.52. Internal MAE based on leave-one-out approach ranged from 0.341 to 0.484. Internal RMSE ranged from 0.375 to 0.457. External MAE and RMSE ranged from 0.281 to 0.362 and 0.326 to 0.454, respectively. According to standard OECD principles for validation of QSARs, our models show an overall excellent performance [74]. Two test sets from DTC-QSARs were used to compare against HYDROWIN and SPARC as an additional external validation metric in Table 7. It is shown that both DTC-QSAR models outperform HYDROWIN and SPARC for both sets. It should be noted that only 23% of these datapoints are identical between both the test sets. These results further validate the external performance of our models. Figures 3 and 4 show regression plots of QSARs for carboxylic acid esters and lactones. Figures 3(a) and 4(a) plot logarithm of the rate constants [log (k b )] calculated using CAE QSAR* and Lactone QSAR*, respectively, versus the logarithm of observed rate constants from various experimental studies. Figures 3(b) and 4(b) plot log (k b ) calculated using CAE DTC-QSAR** and Lactone DTC-QSAR**, respectively, versus the same observed log (k b ) from various experimental studies. These DTC-QSAR models are derived based on recalibration with separate training and test sets. The data splitting process was done by DTC-QSAR using Kennard-Stone process (80% training and 20% test set). Overall, the models show data points are quite evenly distributed along the regression line. The logarithm of rate constants appears to range between −4.5 and 4.5 for both CAEs and lactones. Even when comparing the distribution of training and test sets, both appear to be fairly even in distribution across the model domain. The six long-chain alkyl esters' experimentally    observed log (k b ) values range from −0.56 to −4, while the other carboxylic acid esters' experimental log (k b ) values range from −4.09 to 4.11. Lactone QSAR1 covers a wider range of log (k b ) than Lactone QSAR2. The range of log (k b ) values for Lactone QSAR 1 is −3.70 to 3.53, whereas Lactone QSAR2 ranges for log (k b ) are approximately between −2 and 3.5 based on Figure 4(a). Although experimental data is limited in the lower range, the predicted rate constants closely fit to the regression line. Figure 5 shows a plot of calculated versus observed log (k b ) for carboxylic acid esters estimated by HYDROWIN, SPARC, and CAE QSAR*. Among the three models, the CAE QSAR* predicted rate constants are the most closely fitted to the perfect concordance line, and the slope of the CAE QSAR* regression line is closest to one. Figure 6 shows a plot of residual error against observed log (k b ) of carboxylic acid esters for HYDROWIN, SPARC, and CAE QSAR*. This plot shows that the HYDROWIN predicted log (k b ) values are overestimated in the lower range of observed log (k b ). The CAE QSAR* model also shows an overestimation of hydrolysis rates in the lower range, but to a lesser degree. SPARC rate predictions do not show bias but have the highest amount of deviation (positive and negative) within the midrange of rate constants. Figure 7 shows plots of residual error against observed log (k b ) of lactones by the Lactone QSAR1 and Lactone QSAR2 models. There is no evidence of bias across the range of observed log (k b ) values for lactones.

Conclusions
QSAR models have been developed and tested based on datasets of 223 carboxylic acid esters and 108 lactone compounds covering a log (k b ) range of −4.09 to 4.11 and −3.70 to 3.53, respectively. The stepwise regression approach has been demonstrated to be an effective method for building robust QSARs. Based on the p-values of the regression coefficients, the most significant descriptors for the CAE QSARs were pK a and steric hindrance. For the lactone QSARs, charge and charge density are more important for six-and four-membered rings, while other lactones show pK a and charge are most significant. Most of the predicted rate constants from CAE QSARs and all of the predicted values from the lactone QSARs are within one order of magnitude of the measured rate constants. The largest discrepancies identified for the CAE QSARs were for alkyl aromatics and parabens, but even for these molecules, the deviations between predicted and measured log k b values were less than 1.5. The analysis of the CAE QSAR* model performance demonstrates improved predictive capabilities over existing software tools for the estimation of CAE hydrolysis rates, particularly for long-chain alkyl, cyclic, t-butyl, beta-alkyl halide, ortho substituted R2-aromatic, conjugated *C = O, COO − , paraben, and amine compounds.
Additional measured rate constants for slowly hydrolyzable CAEs and lactones could be useful for improving model performance in the lower range. Many sources tend to report compounds as stable to hydrolysis after running hydrolysis experiments for 30 days. This makes rate data in the lower range harder to find. Exploration of additional molecular descriptors from other cheminformatic software could also improve model predictive performance.
This work has demonstrated a framework for developing hydrolysis rate constant QSARs, including models for cyclic functional groups which are currently not available. As a further extension to this work, we plan to use these models for addressing multi-ester compounds including phthalate esters (PAEs). Many PAEs are highly produced and are chemicals of toxicological concern. The capability of incorporating the effects of ortho substituents is important for this class of chemicals, especially since experimental determination of k b is difficult [75]. Additionally, we plan to use this framework to develop QSARs for other hydrolyzable functional groups to be implemented in the CTS application. We also intend to test the performance on compounds with multiple hydrolyzable functional groups of identical and non-identical classes.

Funding
The author(s) reported there is no funding associated with the work featured in this article.

Data availability statement
The hydrolysis rate constant data used to develop and evaluate the developed QSAR models has been made available in the Supporting Information and will also be posted on https://data.gov/. The DTC-QSAR tool was used to validate the models described herein. DTC-QSAR was developed by the Drug Theoretic and Cheminformatics (DTC) Laboratory within the Department of Pharmaceutical Technology of Jadavpur University, and the software tool is available at https:// dtclab.webs.com/software-tools.