Prediction of clearing temperatures of bent-core liquid crystals using decision trees and multivariate adaptive regression splines

ABSTRACT Accurate prediction of transition temperature is very helpful for the design of new liquid crystals (LCs) because even small changes in structure can dramatically alter the transition temperature, and therefore the synthesis of LCs should not be governed only by chemical intuition. A quantitative structure–property relationship (QSPR) study was performed on 243 five-ring bent-core LCs in order to predict their clearing temperatures using molecular descriptors. Decision tree and multivariate adaptive regression splines (MARS), techniques well suited for high-dimensional data analysis, were applied to select important descriptors (dimension reduction) and to generate nonlinear models. These techniques were applied both on two-dimensional (2D) descriptors only and on the pool of 2D and 3D descriptors (2&3D). The obtained QSPR models were tested using 15% of available data, and their performance and ability to generalise were analysed using multiple statistical metrics. The best results for the external test set were obtained using the MARS model created with 2&3D descriptors, with a high correlation coefficient of r = 0.95 and a root mean squared error of 7.41 K. All metrics suggest that the proposed QSPR model, generated by MARS, is a robust and satisfactorily accurate approach for the prediction of clearing temperatures of bent-core LCs. GRAPHICAL ABSTRACT


Introduction
In the field of new supramolecular functional materials, bent-core liquid crystals (LCs) have attracted much attention. The formation of new types of mesophases, the induction of supramolecular chirality using achiral molecules and the optical, ferroelectric and anti-ferroelectric responses of these materials among others, enable their promising applications. [1][2][3] To harness their application, these materials need to exhibit an LC behaviour at lower temperatures, thus the transition temperature has paramount importance and represents a target property for optimisation.
The estimation of transition temperatures of LCs has been frequently studied using different approaches in terms of chemical structure representation. For example, Schröder et al. [4] estimated the transition temperature of smectic liquid crystals directly from their structure using artificial neural networks (ANNs). They divided the molecule into different fragments and presented the chemical structure as a binary pattern obtained by encoding all fragments in a molecule. This modified group contribution approach has also been applied to the prediction of the clearing temperature of nematic liquid crystals. [5] An approach based on molecular dynamics simulation has been proposed and successfully applied by Berardi et al. [6] for the prediction of the transition temperature of nematic liquid crystals.
However, the prediction of transition temperature of LCs based on molecular descriptors using quantitative structure-property relationship (QSPR) methodology remains the most widely used approach. [7][8][9][10][11][12][13] Molecular descriptors code chemical information into mathematical form and a variety of them (constitutional, electrostatic, topological, geometrical and quantum chemical) have been used in combination with statistical techniques in order to establish structure-property relationships.
Many statistical techniques have been used in QSPR studies for the prediction of transition temperature of LCs, nevertheless decision tree (DT) and multivariate adaptive regression splines (MARS) have not been frequently applied even though they are simple, flexible and capable to deal with high-dimensional data. [14,15] 'Moreover, to our knowledge, MARS models have not been applied for the prediction of temperatures of any sort (melting, boiling, etc.), while Carrera and Aires-de-Sousa [16] showed that DTs are able to develop good predictive models for the melting points of ionic liquids.
In this study, the development and evaluation/assessment of QSPR models for the prediction of clearing temperatures of five-ring bent-core LCs (Figure 1) are described. These models were constructed separately for only two-dimensional (2D) descriptors and for a pool of 2D and 3D molecular descriptors calculated for 243 compounds. Dimension reduction was performed together with the model development using DTs and MARS algorithms. The performance of QSPR models was analysed using multiple statistical metrics.

Data processing
The bent-core compounds and their clearing temperatures (Table S1, Supplemental Material) used in this study were taken from literature (references S1-S18 presented in Supplemental Material). The dataset consists of 243 five-ring aromatic compounds with different linkage groups and their orientation, with a variety of substituents on the central and outer rings, and with different types and lengths of terminal chains. The set of these compounds has a wide clearing temperature range of 352.15-458.15 K. The molecular structures were optimised and the geometries of the minimum energy conformations were obtained using the MMFF94 optimisation routine (ChemAxon, Marvin [17]). Following this step, the calculation of the theoretical descriptors was performed using PaDEL-Descriptor software, [18] and a variety of constitutional, topological, geometric, electrostatic and hybrid descriptors were obtained. To reduce redundant and useless information, the descriptors with constant or near constant values were excluded and the reduced pool of 501 descriptors (360 2D and 141 3D) were used further on. DT and MARS models were created separately using 2D descriptors and the combination of 2D and 3D descriptors (2&3D), since 2D descriptor-based models may demonstrate better performance than the models containing 3D ones. [19][20][21] In addition, it should be considered that for the generation of 3D descriptors an extensive computational effort is required, especially in the processing of large datasets of compounds.

Decision tree
DTs are fast and efficient predictive tools for high-dimensional data. Introduced by Breiman et al. [14], it is a nonparametric statistical procedure for which no assumptions are necessary about the relationship between the input and output variables. The main advantage of DTs over classical linear techniques is their ability to manage a large number of variables. In addition, the graphical representation of the DT is easy to interpret, and it provides an insight into the influence of the defined inputs.
Among the several available DT algorithms, the Classification and Regression Tree (CART) methodology, suitable for either categorical or continuous variables, was employed in this study. CART is a method for the prediction of output variables based on recursively partitioning data into successively smaller groups using binary splitting procedure. [22] The partitioning procedure searches through all values of input variables to find the input that provides the best partition, i.e. that maximises the homogeneity of the two resulting groups with respect to the output. The final result is a tree diagram whose terminal nodes contain the mean output value. The input variables that appear in the upper nodes are more important and have more significant influence on the output than the input variables located at the lower nodes of the tree, which still need to be considered for a more accurate prediction of the output. [23] In general, CART analysis consists of three steps: (1) creating the maximal tree, (2) pruning the overfitted tree and (3) selecting the optimal tree. For creating the maximal tree, the splitting procedure is repeated until no further split can be performed, i.e. until all terminal nodes are homogeneous. Since the maximal tree is by definition oversized and overfitting, and therefore has inherent poor generalisation ability, a tree pruning procedure is necessary. This procedure is based on a cost-complexity parameter, which takes into account the accuracy and complexity of the tree. The result of the pruning procedure is a set of subtrees, which are obtained by successively eliminating the branches of the maximal tree. In the final step, the selection of optimal tree from the generated set of subtrees is often performed using cross validation.
In this study, the least-squared deviation (LSD) as a measure of homogeneity of terminal node was used [Equation (1) where N w (t) is the weighted number of cases in the node t, w i is the value of weighting variable for case i, f i is the value of frequency variable, y i is the value of output variable and y(t) is the mean value for the node t.

Multivariate adaptive regression splines
MARS, developed by Friedman [15], is an effective non-parametric regression technique suited for highdimensional problems. Similar to the CART method, MARS uses recursive partitioning, but it generalises the piecewise constant functions of CART to continuous functions by fitting multivariate splines. [24] In that respect, MARS avoids the CART's lack of continuity and tends to provide models with higher accuracy. [25] MARS develops flexible nonlinear models using separate linear regression slopes in distinct intervals of the independent variable space where the slope is changed from one point (knot) to the other. [26] A knot separates the truncated spline function into the left-sided [Equation (2)] and right-sided [Equation (3)] segment: Generally, MARS analysis consists of three steps: (1) forward stepwise addition of spline functions, (2) backward stepwise elimination of unnecessary spline functions and (3) the selection of the optimal MARS model.
During the forward stepwise addition procedure, the model selects the knot and its corresponding pair of spline functions (also called basis functions, BF) that give the greatest decrease in the residual sum of squares. The addition of BF proceeds until the performance of the MARS model improves. In addition, the MARS model takes into account interactions between the variables, and a higher order of interactions are considered if they enhance the model accuracy.
During the backward-pruning procedure, the BF that contribute least to the overall fit of the model are eliminated. This pruning procedure follows the generalised cross-validation (GCV) methodology, which takes into account the model accuracy and complexity, as is shown in the following equation: where N is the number of observations, andC M ð Þ is a complexity penalty function, which is defined as where C(M) is the number of linearly independent BF, M is the number of knots selected in the forward procedure, and d is the penalty for each basis function included in the model. Larger values of d decrease the model complexity (fewer BFs) and provide smoother function. In practice, d is increased during the pruning procedure in order to obtain smaller models, the best MARS model being the one with the lowest GCV value. Finally, the selection of the optimal MARS model is the third step, and it is based on an evaluation of the prediction abilities of different models. It is usually performed using cross validation or a new independent test set. A general MARS model can be represented using the following equation: where a 0 is a constant; a m is the coefficient of the mth BF; M is the number of BF; K m is the number of knots; x v(k,m) is the value of particular input variable; t km indicates the knot location; s km takes on values of either 1 or -1; [. . .] + denotes that only positive truncated spline functions are considered [Equations (2) and (3)].

Performance evaluation
The performance of models created in this study was evaluated using correlation coefficients (r), adjusted R 2 and Fisher (F)-test value. Four commonly used performance metrics, root mean squared error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE) and the percentage of predictions within a factor 1.05 (FA1.05) of the observed values, were also employed. The calculation of these metrics is performed using the following equations: where T P and T O are the predicted and observed temperatures, respectively. Because only a several descriptors are retained in a QSPR model, there is a possibility that a selected combination of descriptors gives an excellent fit without meaningful correlation between the input descriptors and QSPR response. To provide an estimate of chance correlation, the Y-randomisation test [27,28] has been used in this study. In order to randomise the output values without altering the variance, [29], the measured clearing temperatures were shuffled by 100 random exchanges in their positions with respect to the descriptors matrix, which remained unaltered. The resulting models obtained on the training set with the randomised T values should have significantly lower R 2 values than the real one because the relationship between the structure and response is broken. [30] Therefore, the risk of chance correlation was quantified by the value of R p 2 parameter computed from Equation (11) that reflects the deviation in the values of R 2 of the randomised (R r 2 ) and real QSPR models. For a QSAR model having R p 2 > 0.5, it might be considered that the model is not obtained by chance alone [31].
3. Results and discussion

Model building and performance
In this study, the DT and MARS models were created using the Statistica C&RT and MARSplines routines, respectively. [32] The final DT models were obtained using the fivefold cross-validation method, while the final MARS models were selected based on the GCV value. The dataset was randomly divided into the training set (≈85% of data) and external test set (≈15% of data). The DT methodology was applied separately to the pool of 2D and 2&3D descriptors, and, as can be seen in Figure 2, the two DT models differ only in five terminal nodes. Both DT models have 12 parent and 13 terminal nodes distributed over five levels. The list of 13 descriptors used in both DT models and their short descriptions are presented in Table 1. In order to determine the optimal order of interaction, the MARS models with different order of interactions (from 1 to 4) were created and compared according to the GCV value ( Figure 3).
All created MARS models had a restricted number of BF to the maximum of 30 and the GCV penalty of 2. For the 2D-MARS model, the second order of interaction was selected as optimal, while in the case of 2&3D-MARS, the first order of interaction had the lowest GCV value. The 2D-MARS model consists of 20 BFs that are generated using 14 different descriptors, while the 2&3D-MARS model uses the same number of descriptors (12 2D and two 3D) and has 19 BF. The list of BF with corresponding coefficients for each of the two MARS models is shown in Table 2, while the list of descriptors used is presented in Table 3. The values of performance metrics for both DT and MARS models are shown in Table 4.   Both DT models demonstrate very similar performance, as it could be anticipated, since they differ only in five terminal nodes and knowing that the descriptors in the lower nodes are less important than descriptors located at the upper nodes of the tree. Nevertheless, it should be noted that the 2&3D-DT model has slightly better performance in terms of RMSE, MAE and MAPE than the 2D model, while the FA1.05 has the same value in both cases ( Table 4).
As presented in Table 4, the MARS models are superior to the DT models in terms of all performance metrics, while the 2&3D-MARS model outperforms 2D-MARS. This indicates that MARS overall, has better generalisation ability than DT, and also points out the importance of molecular shape, which is presented with 3D descriptors, for an accurate determination of the clearing point of large molecules. Additional validation was performed with Y-randomisation ( Figure 4). For all models, the R p 2 value was higher than 0.5 (Table 4), which indicates the robustness of the developed QSPR model. The lower R p 2 values were obtained for the DT models, since from a sequence of pruned randomised DT models only the one with highest R 2 (i.e. created without pruning) was used for the calculation of R p 2 , regarding that predictive accuracy of randomised models is not important and to ensure that maximal chance correlation is captured.
The quality of predictions obtained using created models can be further analysed using Figure 5. It can be   seen that two predictions, in the case of both DT models, have relative errors higher than 5%, while the majority of predictions (50-56%) have relative errors in the range of 1-5%. These two compounds, namely 40 and 223 (see Table S1 in Supplemental Material) with relative errors higher than 5% can be considered as outliers. Compound 40 and structurally similar compounds 39 and 42 were placed in the same terminal node, and therefore for all of them the same value of clearing temperature (428.9 K) is predicted ( Figure 5). Considering that compound 40 has the lowest clearing temperature in the terminal node, it has exceeded 5% error limit.
In the case of compound 223, bulky iodine substituents at the peripheral rings hinder molecular packing, which results in a lower clearing temperature in comparison with corresponding fluorine substituent compounds. This effect was not captured by DT models, in contrast to the 2&3D-MARS model, which has predicted the clearing temperature of this compound with an absolute error of only 1.05 K. The 2&3D-MARS model was able to make accurate prediction because it uses gravitational index (GRAV-4), which is a measure of the bulk cohesiveness of the molecules. [33] For the MARS models, 97.2% and 100% (FA1.05) of the predictions have a relative error less than 5%, in the case of 2D and 2&3D model, respectively, and both models have given the majority of predictions (53%) with a relative error of less than 1%. The 2D-MARS model has made only one prediction with the relative error higher than 5% (in the case of compound 50) that can be considered as an outlier. As can be observed in Table S1 (Supplemental Material), this compound exhibits an unexpectedly high clearing temperature in comparison with its homologues from the same series, which caused the higher relative error.
Up to now, several models for the prediction of transition temperatures of two-and three-ring LC compounds have been reported in literature (Table 5). In general, the RMSE increases with the increase of the number of rings in the LC molecule, thus achieving a model with a similar RMSE for more complex (five ring) LCs seems to be a challenging task. Compared to the ANN model [7] used for the prediction of clearing temperatures of three-ring LCs, the proposed 2&3D-MARS model exhibits similar, or even improved performance.

Discussion of the selected descriptors
In both DT models, the root node with 207 molecules and an average temperature of 412.7 K was first divided into lower and higher temperature groups depending on whether the SC-3 value is greater or less than 2.56, respectively. The simple chi cluster of order 3 is a topological descriptor that has been frequently selected as an important descriptor in studies related to the prediction of melting, boiling and flash point, [34][35][36][37] since it provides a quantitative assessment of molecular branching, which presents results in the amplification of entropy and therefore in the reduction of clearing point.
Further splitting was performed using the 'Burden -CAS -University of Texas eigenvalues (BCUT) descriptors, whose significance is evident considering their appearance in the three splitting rules. The BCUT descriptors are the eigenvalues of the modified Burden matrix, and consider three classes of matrices whose diagonal elements correspond to atomic charges, atomic polarisabilities and atomic H-bond abilities. [38] They convey information about the connectivity and atomic properties, which are relevant for intermolecular interactions, and in this case they provide significant structural information needed for the prediction of clearing point. The BCUT descriptors have already been recognised as important descriptors for the prediction of boiling point, and it was shown that they are able to segregate the dataset according to the structure class. [39] Another group of topological descriptors used in DT models are those calculated from a Barysz matrix (VR2_Dze, VR2_Dzs and VE3_Dzm). The Barysz matrix is a weighted distance matrix that counts for the presence of heteroatoms and multiple bonds in a molecule. [38] One of these three descriptors located at the upper part of the tree (VR2_Dze) is weighted by Sanderson electronegativities, and therefore it encodes information about the distribution of electronic charge within the molecule. The other two are weighted by I-state (VR2_Dzs) and mass (VE3_Dzm), and in this case they are less important since they split the terminal nodes.
An additional topological descriptor used in both DT models is the molecular distance edge (MDE) descriptor, which encodes information about the structure in terms of the interaction between atoms in a molecule. Specifically, MDEO-12 describes the interaction between all primary and secondary oxygen atoms in the molecule. The MDE descriptors were found to be highly correlated with boiling points in a set of alkanes. [40] The path count descriptors (R_TpiPCTPC and WTPT-3), [41,42] which characterise molecular branching, also partition the nodes at the third level of the tree. R_TpiPCTPC is the total conventional bond order (up to the order 10) divided by the total path count (up to the order 10), while WTPT-3 represents the sum of path lengths starting from heteroatoms, and both of them contribute significantly to the prediction of the clearing point of bent-core liquid crystals. The nodes at the third level were further split by GGI5 and by the previously mentioned BCUT descriptors, with the node with 22 compounds (average temperature = 441.3 K) being the terminal node in the case of the 2&3D model, while in the case of the 2D model it was further split by the Mi descriptor ( Figure 2). The GGI5 descriptor, proposed by Galvez et al. [43], is a topological charge index of order 5, and it represents the total amount of charge transfer in the molecules, while the mean first ionisation potential (Mi) of a molecule is the amount of energy required to remove an electron from the molecule in the gaseous state. [38] The final (fifth) level is quite different for the 2D and 2&3D models, which share only two same terminal nodes obtained by splitting with the Barysz matrix descriptor (VR2_Dzs). Another split for the 2D model was performed by one additional Barysz matrix descriptor (VE3_Dzm). In the 2&3D model, at this level, 3D descriptors were introduced (E2m and L1u). E2m and L1u are WHIM descriptors related to the molecular accessibility and molecular size, respectively. The WHIM descriptors were developed in order to capture 3D chemical information (size, shape, symmetry and atom distribution), and therefore they are able to distinguish different molecular conformations. [44] They have been successful in the prediction of melting, boiling and flash points of halogenated aromatic hydrocarbons. [45]  The introduction of 3D descriptors in the final (fifth) level shows that 2D descriptors have a decisive influence for the prediction of clearing point, while 3D models contribute to 'fine tuning', thus enhancing somewhat the model accuracy. This is reflected in a slightly enhanced performance of the 2&3D model in terms of RMSE, MAE and MAPE values.
In the MARS models, the influence of selected descriptors on the clearing point cannot be unambiguously determined. The significance of each descriptor is more difficult to interpret owing to the complexity of the MARS equation, which consists of BF in the form of max(0; BF n ). Most of the selected descriptors belong to the descriptor groups used in DT models and they are previously described, while for the others more information can be found in associated literature. [38] Conclusion This paper presented the development and evaluation of QSPR models for the prediction of clearing temperatures of five-ring bent-core LCs. The dataset consisted of 243 LC compounds with a wide clearing temperature range of 352.15-458.15 K. The developed QSPR models were based on molecular descriptors obtained from either 2D chemical structures or both 2D and conformation-dependent 3D structures, for which the optimisation of geometry was performed. DT and MARS were employed as modelling techniques, and it was demonstrated that they are suitable for an easy screening of a large number of descriptors and for fast model development. The MARS model appeared to be more accurate than the DT models, wherein 3D descriptors contributed to the model improvement, thereby indicating the importance of molecular shape. The best 2&3D-MARS model is able to predict the clearing temperature with a RMSE of 7.41 K for the external test set, which classifies it as a generally applicable model. Compared to the model for the prediction of transition temperatures of three-ring LC compounds reported in literature, this model exhibits improved performance, despite the challenging task related to the use of much larger molecules with five aromatic rings.
In conclusion, the proposed models enable a simple, fast and practical method for the prediction of clearing temperatures of LCs that have not been synthesised yet, and achieve that by using widely available software tools.

Disclosure statement
No potential conflict of interest was reported by the authors.