Assessment of applicability domain for multivariate counter-propagation artiﬁcial neural network predictive models by minimum Euclidean distance space analysis: A case study (cid:2)

(cid:2) The concept of applicability domain (AD) in QSAR modeling is discussed. (cid:2) The AD assessment method for non-linear neural network predictive models is proposed. (cid:2) The counter-propagation artiﬁcial neural network (CP-ANN) was applied for modeling. (cid:2) Minimal Euclidean distance space (MEDS) of CP-ANN model was deﬁned and analyzed. (cid:2) The resulting outliers coincide with those from linear models (leverage based AD).


Introduction
In the past 40 years, the interest in the quantitative structure-activity relationship (QSAR) as a concept is undoubtly increased. Just in the period between 1975 and 2011, an incredible number of approximately 11,000 published articles observing the problem of QSAR can be found (Web of Knowledge), which data clearly indicate the emergence and usefulness of the developed QSAR models for solving a variety of chemical problems [1,2]. Beside these facts, the quality as well as reliability of the QSAR models must be also taken into account [3]. Therefore, the OECD principles for validation of the QSAR models for regulatory purposes, clearly stated that a model should be used within the boundaries of its applicability domain (AD) [4]. According to the Setubal's Workshop guidance and acceptability criteria for application of QSAR models for chemical management purposes, the AD of a QSAR model is defined as a "physico-chemical, structural, or biological space, knowledge or information on which the training set of the model has been developed, and for which it is applicable to make predictions for new compounds. The AD of a QSAR should be described in terms of the most relevant parameters, i.e., usually those that are descriptors of the model" [5].
Depending on the modeling strategy utilized, several approaches for assessing the AD were developed [6] that can be applied to a particular type of QSAR model (The Report and Recommendations of ECVAM Workshop 52) [7]. A comprehensive description of these approaches can be found in the review articles by Jaworska and Nikolova-Jeliazkova [8,9]. As described there, the methodologies for assessment of the AD of QSAR models are classified in four major categories: (1) range-based methods, (2) geometric methods, (3) distance-based methods, and (4) probability density distribution methods.
Distance (similarity) metrics are very frequently used in QSAR for solving different chemical problems [10][11][12] and their practical utilization in the model's AD evaluation is recently described in the literature [13,14]. Among the various distance metrics for model's AD assessment (e.g., Euclidean, Mahalanobis, Manhattan, Hotelling T 2 , and Leverage), the Euclidean distance (ED) metric is one of the most commonly utilized. In its simplest form, it can be defined as an "ordinary distance", i.e., a line connecting two points A and B defined by their two-dimensional coordinates A(x 1 , x 2 ) and B(y 1 , y 2 ). This definition is mainly related to two-dimensional space, i.e., Euclidean plane [15]. However, the majority of the statistical methods used in the QSAR modeling (including artificial neural networks) deal with the problem of modeling multidimensional data and therefore the ED between a query point (e.g., a training/test/external validation set compound) and the centroid of the model can be determined within the framework of a more complex high-dimensional Euclidean space [16,17]. The query object n i (e.g., a compound from the training/test/external validation set) is defined as a vector of independent or input variables (e.g., molecular descriptors), whereas the model's centroid is usually defined as a multidimensional averaged vector obtained by averaging the input vectors for all objects representing the compounds in the dataset.
In non-linear modeling methods such as artificial neural networks (ANNs), there are two general strategies for determining the activation signal of a neuron, by either calculating the dot product (vector scalar product) between the input vector and the weight vector (synaptic efficacy) for each neuron, or by determining similarity (Euclidean distance) between the input vector and the neuron in the Kohonen type of neural networks. In both cases, the ANNs algorithm computes the so-called net input, which procedure runs repeatedly for all the neurons constructing the network. In the learning strategy of the Kohonen or self-organizing type of neural networks, the neuron with the maximal output of the dot product or the minimal ED is chosen as the "winning" or "central" neuron [18,19]. This iterative procedure (i.e., learning of the network) results in calculation of the minimum EDs for each input object (compound) to the "central" neuron and construction of a so-called "minimum ED space" (MEDS) [20]. Therefore, the model's coverage, i.e., the boundary to which a model is applicable could be simply defined by selection of the training set object with maximal value for ED within the MEDS (Fig. 1).
Comparing to some recommended distance-based methods widely used for solving the AD problem for linear models (e.g., the well known leverage approach [3]), a very small number of publications observing the problem of distance-based AD estimation for non-linear models can be found in the literature [21][22][23]. In a comprehensive comparative study published recently, Fjodorova et al. demonstrated a successful utilization of ED metrics for estimation of AD in case of non-linear ANNs-based classification models [24].
This study presents an effective methodology for graphical assessment of AD for non-linear ANNs-based prediction models (descriptor space vs. model response space) -counter-propagation artificial neural network (CP ANN) models by taking into account the so-called "minimum ED space" (MEDS) as a function of the total number of objects (compounds from the training/test/external validation set), coupled with a standard residual analysis. We focused on a detailed explanation along with a graphical depiction of the AD for three pre-built and validated CP ANN models, as well as on a thorough in-depth assessment of the detected outliers. Furthermore, the same case studies were approached by the partial least squares (PLS) method and the leverage-based AD assessment of the models was performed for the purpose of comparison of both methods.

Data and computational methods
In order to define and assess the practical applicability for a given non-linear ANNs model utilizing the MEDS methodology, a developed and validated ANN model must be available. For these purposes we used our previously published CP ANN predictive models developed for three different datasets. The data, modeling strategies utilized as well as the CP ANN prediction models developed are already elaborated in details [25][26][27], and therefore we give here only a short abridgement. The biological assay data (a total of 88 compounds) which belong to different chemical classes (nucleobases, nucleosides, nucleotides, and various endogenous molecules, dyes and drugs) were used as an initial data source for development of the in silico CP ANN prediction model. The model was built to estimate the transport activity of the transmembrane protein bilitranslocase for a large set of diverse endogenous compounds and xenobiotics (Table 1) [25].
The biological activity values were known for 81 compounds. For the majority of them (a total of 75 compounds) the inhibition constants (K I [mmol L −1 ], expressed in logarithmic units as pK I ) were experimentally determined [25], while for the rest 6 compounds the inhibition constants were obtained from the literature [28][29][30]. For the remaining 7 compounds (which are part of the total 88 compounds) the experimental assays are not yet finished.
The effect of 75 compounds on bilitranslocase transport activity was evaluated spectrophotometrically in rat hepatical plasma membrane vesicles. The initial rate of the bromosulphophtalein Table 1 List of all 88 compounds (Dataset 1) separated as training, external test, and external validation set, together with their activity classification code (ACC), the experimental/predicted inhibition constants (pKI-exp, pK I-pred ), as well as their calculated Euclidean distances (ED) [25]. (BSP) passage into vesicles (called electrogenic BSP uptake) was used as the measure of bilitranslocase transport activity [28,[31][32][33][34].

ID
The experimental assays performed, successfully identified 28 compounds as active (of which 25 compounds as competitive ("C") and 3 compounds as non-competitive inhibitors ("NC")), and 53 compounds as inactive ("I"). For all compounds determined as inactive ("I"), a pK I-exp value of (pK I-exp = −2.000) that corresponds to a hypothetical concentration of 100 mmol L −1 was assigned (Table 1) [25]. The geometry of all 88 chemical structures was initially optimized by using AM1 semi-empirical procedure [35], whereas Kohonen artificial neural network (KANN) [36,37] was employed for dataset division (Table 1) into training set (45 compounds), external test set (10 compounds), and external validation set (33 compounds). The training set was additionally split into an internal training set (Table 1; 35 compounds indexed by b) and internal test set (Table 1; 10 compounds indexed by c) for the optimization of the ANN parameters [25]. Counter-propagation artificial neural network (CP ANN) [38,39] which can be considered as an extension of KANN, was exploited for development of the 14-parameters CP ANN predictive model (R 2 = 0.89). The model was built using the internal training set, optimized with the internal test set (Q 2 te/F3 = 0.89), as well as externally validated using the external validation set (Q 2 ext/F3 = −1.83) which was not used during the model development [25]. The model's predictive ability quantificators, were calculated according to the following equations (Eqs. (1) and (2)) [40,41]: where n tr designates the number of training set objects, whereas n (te/ext) is the number of test set, i.e., number of external validation set objects, respectively. Here we would like to stress that the Q 2 parameter as defined by (Eq. (2)) might be negative in case of large error in prediction of the external validation set compared to the range of the target values in the training set [40]. For the purpose of this study, an additional PLS model (R 2 = 0.78, Q 2 te/F3 = 0.47, Q 2 ext/F3 = 0.15) was developed by using the same dataset division as well as the number of molecular descriptors, as described previously.

Dataset 2
The second dataset consists of 59 trypsin inhibitors, initially selected from the Brookhaven PDB database, that were experimentally tested for their inhibitory potency (pK I ) against the trypsin  [47]. (a) Determination of ED between two points A and B in a simple two-dimensional Cartesian (Euclidean) space, defined by their two-dimensional Cartesian coordinates. (b) Mapping of multidimensional input data (e.g., compounds) encoded as multidimensional descriptor vectors onto two-dimensional grid of neurons (Kohonen map) and determination of the minimal ED (the red line, min(W j − Xs) → ED j ,s) for each input object to the "central" neuron (Wc,s) [38]. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) enzyme. All inhibitors are heterogeneous, non-covalently bound to the same site on the enzyme surface. The dataset, modeling strategy applied, as well as the CP ANN prediction model are described in details in our previous work [26]. The best model

Fig. 2. Euclidean geometry and calculation of EDs between objects
.85) was developed by using 97 molecular descriptors, while 26, 15 and 18 compounds in the training, internal test, and external validation set, respectively, were used for its modeling and validation. Furthermore, the same dataset division was used for building two additional QSAR models taking into account a reduced number of molecular descriptors (7 out of total pool of 97 molecular descriptors were selected) for the purpose of this study; CP ANN (R 2 The descriptors reduction here was performed explicitly to avoid obtaining an ill-conditioned diagonalized matrix [42] constructed of extremely large diagonal elements (such as the leverage values, h ii ) which are usually impractical for work and further interpretation (e.g., identification of structurally-influential outliers).

Dataset 3
The third dataset includes the retention factors (k) of ionized inorganic and organic acids measured on ion-exchange analytical column (IonPac AS18, Dionex) at 25 mM eluent (KOH) [27]. CP ANN and PLS models were built to predict retention factor of ions from molecular structures of their corresponding acids. Starting with a pre-selected pool of total 64 molecular descriptors, the best CP ANN model was build (

From trained network to definition of the "minimum Euclidean distance space"
No matter which modeling strategy (linear or non-linear) will be utilized for developing of a QSAR model for chemical management purposes, two substantial things must be taken into account: (1) chemical structures must be represented numerically in a form of molecular descriptors, and (2) a defined experimentallydetermined endpoint (e.g., pK I for Dataset 1 and 2, or retention factor k for Dataset 3) must be available for each compound. Since the molecular descriptors are unique numerical values obtained by quantification of various structural and/or physico-chemical properties of the molecule, they could be successfully used not only for prediction of the endpoint values of unknown compounds, but also for performing similarity measurements between two or more molecular structures. For those purposes, different distance (similarity) metrics are available (e.g., Euclidean distance (ED), Mahalanobis distance, Manhattan (City-block) distance, etc.) [13,14,43], which can be exploited in several different ways [44][45][46][47]. Among them, the Euclidean distance (ED) metric is one of the most widely used metrics for the determination of similarity. The most simple form of ED could be defined as a distance between two points in a two-dimensional Cartesian space (known as Cartesian or Euclidean plane) [15]. Therefore, if A and B are two points defined by their two-dimensional Cartesian coordinates A(x A1 , x A2 ) and B(x B1 , x B2 ), respectively (Fig. 2a), then the ED between them can be simply calculated using the following equation (Eq. (3)) [48]: The smaller the ED between the points A and B is, the higher the similarity between them is. Although, this similarity-measurement concept is very simple, it can be effectively used for solving more complex problems. The QSAR modeling methods are constructed for handling large amount of data describing a so-called multidimensional data space, where each chemical structure is usually encoded in a form of a multidimensional descriptor vector. Consequently, the similarity measures between the objects must be calculated in such complex space according to the dimensionality of the descriptor vector.
In order to simplify the problem of complexity, a data compression (mapping) from the multidimensional space to a space of lower complexity is required, which is the basis of the Kohonen artificial neural networks (KANN) [36,49,50]. KANN provides a so-called self-organizing maps (SOMs) which are not comparable to the Cartesian space mentioned previously, since the metrics are not conserved. It is conserving the topology of the multidimensional input data (usually multidimensional descriptor vectors) within a two-dimensional network of neurons (as a result of unsupervised competitive learning), which are captivated with objects in a way where similar objects are situated on same or close neighbor neurons [38].
The mapping of the objects progresses iteratively over a nonlinear algorithm (known as training of the network), which is mainly based on object-similarity determination through utilization of ED metrics. It can be literally described as "winner takes all strategy", which means that for each multidimensional input vector, only the most excited neuron (or the so-called "winning" or "central" neuron) is selected and all the corrections are performed around it. The learning algorithm selects the "central" neuron according to the minimal ED calculated between the multidimensional input vector X s (x s1 , x s2 , x s3 , . . ., x si , . . ., x sm ) for each input object (e.g., a compound from the training set) and all the weight vectors W j (w j1 , w j2 , w j3 , . . ., w ji , . . ., w jm ), using the following equation (Eq. (4)) [38]: where N net describes the total number of neurons constructing the network, while W c,s designates the selected "central" neuron to which the object s is the most similar ( Fig. 2b). At the end of the learning process (developing of the model), every object is assigned to its "central" or closest neuron; there are as much minimum EDs calculated, as is the total number of input (training) objects, which together define a so-called "minimum ED space" (MEDS) [20]. Since, each training compound used for building of the ANN model is characterized with its unique ED, the MEDS concept could be effectively used for assessing of the model's applicability domain (AD) which boundaries can be simply defined by selection of the maximal ED (hereafter named as ED crit ) within the MEDS (Fig. 1). For the purpose of assessing the AD, the MEDS boundaries should be extended to the internal test compounds involved in the model parameters optimization (internal test set distances). Consequently, each object (e.g., a compound from the external test/external validation set) entering the trained network (model) with calculated ED greater than ED crit , can be considered as it is outside of the model's AD, and vice versa. The distances in MEDS depend on the dimension of the descriptor vectors entering the network. In order to unify their scale, they should be normalized by division with the total number of molecular descriptors used for modeling. Hereafter, normalized values are considered for ED and ED crit .

Descriptor space versus model response space
While the MEDS concept is carrying only the structural information (encoded in a form of molecular descriptors) extracted from the trained Kohonen map for the compounds used in the QSAR study, it can be applied only for assessment of the possible outliers within the compound's descriptors space. Although, these data could be extremely helpful for detection of the structurally-influential outliers that could affect the quality of the QSAR model (e.g., a two dimensional plot of the EDs vs. total number of compounds), they do not tell anything about the accuracy of predictions, and therefore the AD of the model would be only partially defined. To alleviate this problem as well as to define as much as informative domain of model's applicability, the model's predictability information (model response space) must be additionally incorporated into the AD definition.
Contrary to KANNs which are constructed of only one layer of neurons, the counter-propagation artificial neural networks (CP ANNs) contain an additional layer located exactly below the Kohonen layer, and therefore it can be regarded as an extension of KANN [37,38,50]. This additional layer is trained in exactly the same manner as the Kohonen layer, i.e., it receives the vector of target (dependent) variables T s (e.g., experimentally-determined values of K I ), and as a result of the learning procedure controlled by the Kohonen network above, outputs their predicted (response) values which reflect the model's response space (Fig. 1) [51]. Taking these additional information into account, the AD of the CP ANN predictive model could be defined in a more informative way through expression of the MEDS of the model as a function of the standardized residuals.

Partial least squares model development
In order to assess the performances of the MEDS-based AD method developed for non-linear CP ANN models with a well known leverage approach [3] used in multiple linear regression or partial least squares (PLS) regression modeling, the linear models were built with the same data as non-linear ones. PLS models were developed taking into account the same settings (number and type of molecular descriptors as well as the same training/internal test/external validation set division) as described previously for the neural network models (see Sections 2.1.1-2.1.3).

The hat matrix and influential observations -leverage approach
Analogously to the MEDS-based AD approach defined previously, the leverage approach which is usually applied for regression diagnostics (outliers identification) in case of linear models (e.g., MLR, PLS) [52,53], was additionally included in this study. Although, both methods are substantially diverse with regards to the difference between the distance metrics on which are based, they were jointly utilized just for comparative purposes.
In the linear regression it is often very useful to determine the influence of a given y i value (e.g., experimentally-determined endpoint values such as K I ) over each predictedŷ i value. Contrary to the relationship between y i andŷ i , which interpretation can be easily performed through implementation of a simple residual analysis (y i −ŷ i ), the influence of the independent variables x i (e.g., calculated molecular descriptors) on the model might be difficult to determine. The solution to this problem lies into the so-called hat matrix H [54,55], which maps the vector of observed values to the vector of fitted values. It is an n × n symmetric matrix which diagonal elements h ii (known as leverage values) directly reflect the structural influence of a compound to the values predicted by the model (i.e., a distance metric which shows how far a compound is from the model experimental space) [53].
Therefore, for a given set of compounds encoded in a form of calculated molecular descriptors, one can solve the hat matrix H through an implementation of a simple set of matrix algebra rules [54,55]. Let us X tr be a multidimensional n × m matrix carrying the structural information m j (calculated molecular descriptors) for each training set compound n i separately. The calculation of the hat matrix H of the original X tr matrix requires several consecutive steps: (1) generation of a transposed version (X T tr = m × n) of the original descriptor matrix X tr .
(2) multiplication of the transposed matrix X T tr and the original matrix X tr : (X T tr X tr ). (3) inversion of the product matrix (X T tr X tr ) obtained by multiplication in (2): (X T tr X tr ) −1 .
(4) multiplication of the original X tr , the inverse (X T tr X tr ) −1 , and the transposed matrix X T tr .
The obtained result is an n × n symmetric matrix (H, hat matrix) [54] which maps the y i,tr values intoŷ i,tr (Eq. (5)): H tr = X tr (X T tr X tr ) −1 X T tr →ŷ tr = H tr y tr ↔ŷ tr = X tr (X T tr X tr ) −1 X T tr y tr (5) whereŷ tr and y tr are the predicted and experimentally-determined endpoint values for the training set compounds, respectively, whereas H tr designates the hat matrix for the training set which diagonal elements (h ii,tr ) describe the distance of each training set compound to the structural centroid of the model [53]. The calculation of the hat matrix as well as the extraction of the diagonal elements (leverages) for any test/external validation set compound entering the model, can be performed as for the training set described above, but slightly modified (Eq. (6)): (1) generation of a transposed version (X T te/ext = m × n) of the original descriptor matrix of test data X te/ext . (2) generation of a clone version of the inverse matrix solved above for the training data: (X T tr X tr ) −1 .
(3) multiplication of the original X te/ext , the inverse (X T tr X tr ) −1 , and the transposed matrix X T te/ext .
H te/ext = X te/ext (X T tr X tr ) −1 X T te/ext →ŷ te/ext = H te/ext y te/ext ↔ŷ te/ext = X te/ext (X T tr X tr ) −1 X T te/ext y te/ext (6) whereŷ te/ext and y te/ext are the predicted and experimentallydetermined endpoint values for the test/external validation set compounds, respectively, whereas H te/ext designates the hat matrix for the test/external validation set which diagonal elements (h ii,te/ext ) describe the distance of each test/external validation set compound to the structural centroid of the training set (model) [53]. One of the recommended hat-based methods for AD investigation in case of linear QSAR models is the widely known leverage approach [3,53,56]. The method offers a graphical assessment of the leverage values (h ii ), as a function of the standardized crossvalidated residuals (Williams plot) [53,57] and it is suitable not only for detection of the structurally-influential outliers, but also for determination of the response outliers. The model predictions should be referred as unreliable for those compounds for which h ii diagonal elements are greater than the cut-off leverage value (h*). These compounds are located far from the structural centroid of the model, and therefore could be referred as structurally-influential outliers, too. The cut-off leverage value (h*) is usually defined as (Eq. (7)): where p is the total number of descriptors used for developing of the QSAR model, while n is the total number of the training set compounds. Moreover, the compounds for which the calculated standardized residual values are greater than three standard deviation units (>±3 ), could be considered as response outliers.

Results and discussion
No matter how good (robust, meaningful, and validated) a developed QSAR model could be, its inherent limitation is data-driven correlation pursuit: it cannot be expected to give reliable predictions for the property under investigation for the compounds out of the domain of chemical space defined by the training data. In order to be used for chemical management (screening) purposes, the QSAR model should have a clearly defined domain of applicability, and therefore the property estimations for only those compounds situated within the model's AD boundaries can be considered reliable [3][4][5]. Among the various methods available today for AD estimation of established QSAR models, the distance-based approaches (e.g., ED-based or leverage-based) proved to be particularly useful not only in the case of linear models [53], but also for non-linear models such as artificial neural networks [24]. The following study involves a simple and effective approach for AD estimation of a non-linear QSAR models [25][26][27] developed using CP ANN modeling method through the utilization of the MEDS concept elaborated in Section 2.2.

Dataset 1
In order to demonstrate the practical significance of the MEDS concept for graphical assessment of the AD (Fig. 1), a developed and validated predictive CP ANN model was used. As described previously, the MEDS concept for AD assessment is simply a distance-based method (based on ED metrics), and therefore the availability of calculated ED data is crucial. For these purposes, as a result of the network training which process runs iteratively (i.e., learning of the network), the minimum EDs (Table 1) between each input object (e.g., a compound expressed as a multidimensional descriptor vector) entering the network and the so-called "central" neuron are calculated (Fig. 2b) according to the (Eq. (4)) [38]. Once calculated, the AD of the CP ANN model could be easily defined as a two-dimensional column plot (Fig. 3) which expresses the calculated minimum EDs (MEDS) for each compound (e.g., training/external test/external validation set) as a function of the total number of objects (88 compounds).
As shown in Fig. 3, the AD coverage of our CP ANN prediction model is simply defined by the threshold value (ED crit = 0.350), which corresponds to the training set object with maximal value for ED (Table 1; ID = 50, Digoxin). Therefore, all other objects (e.g., external test or external validation set compounds) for which the calculated ED is greater than ED crit , could be considered as being outside of the model AD. According to Fig. 3, a total of 9 compounds (ID = 57, 64, 66, 70, 73, 77, 80, 83, and 88) belonging to the external validation set (33 compounds), could be distinguished as potential outliers (Table 1) with (ED > ED crit = 0.350). Except two compounds (ID = 57 (Deltorphin; a heptapeptide of exogenic origin) and 88 (␤-Estradiol; steroidal hormone)) which belong to different therapeutic classes, i.e., selective ␦-opioid agonist (analgesic) and selective estrogen receptor modulator (contraceptive), respectively, the rest of the identified outliers are mainly non-steroidal antiinflammatory agents. These compounds are structurally very different comparing to the compounds used for building of the CP ANN model which are mainly purine and pyrimidine derivatives, and therefore their calculated EDs are much larger than the selected ED threshold (ED crit = 0.350). Consequently, the identified outlying  (Table 1); the central neurons are colored according to the calculated ED values in the range between dark blue (minimal ED value) and strong red (maximal ED value), while the intact (empty) neurons were colored in white. objects could be considered as pure structurally-influential outliers, i.e., outside of the CP ANN model's AD. All the objects identified as outliers (Fig. 3), were experimentally-determined as inactive ("I") compounds ( Table 1).
The MEDS analysis of the CP ANN model (Fig. 3) makes possible to assess the AD regarding the compound's structural space. Thus it is capable to detect only those objects (compounds) which are structurally different comparing to the compounds used for developing of the model, taking into account only the structural information (i.e., compounds descriptor space) stored in the framework of the trained Kohonen map of the CP ANN architecture (Fig. 1). In order to define the domain of applicability regarding the response data (the molecular property or activity values), the information about the predictivity of the CP ANN model (response space) must be additionally included into the AD definition. As a result, the AD of our CP ANN predictive model could now be defined as a two-dimensional dot plot representing the model response space (depicted in a form of calculated standardized residuals) as a function of the previously defined MEDS (Fig. 4a). In this way, one can identify those compounds for which the predicted activity values are questionable [53,56].
Similarly to the column plot represented in Fig. 3, the structurally-influential outliers (signed by the corresponding ID numbers in Fig. 4a) can be easily (visually) determined  7)). The compounds identified as outliers (outside of the model's AD) by both methods are signed by their corresponding ID numbers ((a) Table 1; (b) Supporting information in Table S1). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) (ED > ED crit = 0.350), however only two of them (ID = 70 and 88) can be recognized as correctly predicted by the model, with calculated standardized residual values in the boundaries between ±3 units [53,56]. The rest 7 outlying compounds regarding structural domain (ID = 57, 64, 66, 73, 77, 80 and 83) are also out of the 3 boundaries (with large prediction errors, see Table 1 and Ref. [25]) and therefore they could be recognized as pure response outliers, as well. A closer look at Fig. 4a indicates five additional response outliers from the external validation set (ID = 63, 67, 74, 76, and 79). According to the calculated ED values, these external validation set compounds are located within the domain boundaries as defined by ED crit = 0.350, however their calculated standardized residual values greater than ±3 units pinpoint to objects wrongly predicted by the model (Table 1). In summary, 9 compounds from the external validation set were found out of structural AD and thus the predictions are considered unreliable, and 7 of them were really poorly predicted. On the other hand, 23 compounds were found within the structural AD and thus expected as reliably predicted by the model, and 18 of them were predicted correctly, only 5 of them were response outliers. Thus, by assessing the structural AD outliers only, we got rid of 7 out of 12 response outliers. This is important because the assessing of structural AD is affordable also in case of unknown experimental values of properties of new active/toxic compounds.

Dataset 1
As demonstrated in Figs. 3 and 4a, the MEDS-based AD assessment of our CP ANN prediction model [25] performed so far, showed reasonable and in some instance satisfactory results. Namely, all compounds which were identified as potential structurallyinfluential outliers (Table 1) are indeed structurally very different than the compounds used for developing of the model, and therefore they could be expected to be situated far from the structural centroid of the model, i.e., outside of the model's AD. Comparing to the well known leverage approach [3] and particularly its graphical representation for AD estimation (Williams plot; standardized residuals vs. leverages) [53], a certain degree of similarity between both methods is evident (distance-based methods) [8,9]. On the other hand, both methods are considerably different in regard to the distance metrics used (Euclidean distances vs. leverages) and consequently limited to the nature of the developed QSAR model (non-linear, i.e., linear model, respectively) to which can be applied, however, for comparison purposes we decided to evaluate and demonstrate both of them.
Analogously to the MEDS-based AD approach (Fig. 4a) for which definition the calculated EDs for each investigated object are required, the leverage-based method for AD estimation (Williams plot) [53] requires the leverage values (h ii ), which are the diagonal elements of the hat matrix (H; see Eqs. (5) and (6)) [52,53]. In order to compare both methods as well as to assess the performances of the MEDS-based AD approach applied to our CP ANN prediction model (Fig. 4a) [25], a linear model (14-parameters PLS model) was additionally developed (see Section 2.1.1) using the same data (same number and type of molecular descriptors and same training/external test/external validation set division were exploited) to which the leverage-based method was applied (Fig. 4b). The experimental and predicted activity values (pK I-exp , pK I-pred ) and calculated leverage values (h ii ) for each investigated compound are available as Supporting information in Table S1.
As shown in Fig. 4, both methods give comparable distribution of compounds in the AD plots, however, the vertical border (cut-off value) of the AD showing the limit of the model with regard to the structural space is much more restrictive for the PLS model (Fig. 4b).
Consequently, the external validation set compounds (ID = 57, 64, 66, 70, 73, 77, 80, 83, and 88) which were identified as structurallyinfluential outliers by MEDS-based approach in the CP ANN model (Fig. 4a), were identified as extreme points (h > h* = 1.00) in the Williams plot of the PLS model (Fig. 4b), too. While the majority of these compounds (more precisely the objects with ID = 57, 64, 66, 73, 77, 80 and 83) were determined as response outliers (points with calculated standardized residual values greater than ±3 ) on the MEDS plot (Fig. 4a), the same compounds could be identified as acceptably predicted by the PLS model (see Supporting information, Table S1; Fig. 4b). Comparing to the MEDS-based approach (Fig. 4a), the leverage-based approach (Fig. 4b) identified an external test set compound thymol blue (thymolsulphonephthalein, ID = 53) as a structurally-influential outlier (h ID=53 = 1.277). Indeed, this compound is structurally very different comparing to the training/internal test set compounds, and therefore it is located far from the structural centroid of the PLS model (Fig. 4b). It is also far from the training/internal test set ensembles (ED ID=53 = 0.309; Fig. 4a), nevertheless, it is situated within the CP ANN model AD boundaries since the cut-off value (ED crit ) is determined as 0.350 by the training compound Digoxin (ID = 50). Surprisingly, of total 33 compounds from the external validation set, the leverage-based approach identified only 7 compounds (ID = 62, 63, 72, 75, 78, 82, and 85) situated within the PLS model AD (Fig. 4b), while the majority of them are located out of the model's domain (h > h*), although they are structurally-similar to the compounds used for developing of the model. On the other hand, these compounds were correctly identified by the MEDS-based approach as non-outlying objects (inside the model's applicability domain) as shown in Fig. 4a.

Dataset 2
The Dataset 2 of total 59 trypsin inhibitors was used to build three models, as described in Section 2.  [26], while the other two models were built additionally utilizing a reduced number of molecular descriptors. In order to avoid the obtaining an ill-conditioned hat matrix [42] (constructed of extremely large elements) and consequently extremely large leverage values h ii which are usually not very useful for construction of the Williams plot, we decided to reduce the number of molecular descriptors in a way where the number of molecular descriptors not exceed the total number of training set objects. Taking this limitation into account and for the purpose of this study, the additional CP ANN and PLS models were built by using only 7 out of 97 molecular descriptors through elimination of those descriptors for which the intercorrelation coefficients were more than or equal than 0.40 [58]. Although the performance of the obtained models with 7 descriptors was worse comparing to the original one (CP ANN model: As demonstrated in Fig. 5a, the CP ANN model constructed by using 97 molecular descriptors covers very well the structural domain of the compounds from the external validation set -all the objects are situated within the MEDS boundary with calculated EDs significantly smaller than the critical ED value (ED crit = 0.153) that corresponds to the training set object (ID = 17). However, two external validation set compounds (ID = 42 and 59) could be determined as pure response outliers (see Supporting information, Table S2). On the other hand, the CP ANN model constructed by using 7 molecular descriptors assigns three external validation set compounds (ID = 34, 42, and 56 with EDs > ED crit = 0.279) to be situated out of the MEDS (Fig. 5b). They are also determined outside of the AD of the PLS model (h > h* = 0.59) as represented on its Williams plot (Fig. 5c). Chemical structure of the compounds with ID = 34 and 56 do not contain SO 2 group as most of the compounds in the training set, while the outlier ID = 42 contains three fluor atoms, unlike the rest of the compounds in the Dataset 2. While no response outliers could be identified in the PLS model's AD depicted on the Williams plot (Fig. 5c), only one external validation set compound (ID = 42) is identified as a common response outlier on the MEDS plots ( Fig. 5a  and b).

Dataset 3
Similarly to the Dataset 2, the first model of the retention factor (k) was taken from previous work [27]  The applicability domain assessment for all three models is shown in Fig. 6.
As demonstrated in Fig. 6, one can observe satisfactory performances in all three cases (almost all objects are situated within the models AD). In the AD plot for the CP ANN model constructed by using 64 molecular descriptors (Fig. 6a) only one object from the external validation set is evidently structurally-influential outlier, ID = 20; This is sulfurous acid ant its 64-dimensional descriptor vector shows significant difference from the most similar structure in the training set, which is sulfuric acid that has a different number of oxygen atoms bound to sulfur atom with a different oxidation state. No outlying objects could be identified in the 13-parameters CP ANN and PLS models AD as to be located outside of their structural centroids (Fig. 6b and c). On the other hand, regarding the models response space, the AD for the 13-parameters CP ANN model (Fig. 6b) identified an external validation set object (ID = 9) as a response outlier, hence it is acceptably predicted by 64-parameters CP ANN and 13-parameters PLS models as depicted in Fig. 6a and c (see Supporting information, Table S3).

Is the MEDS-based AD estimation affected by the neural network architecture?
All the MEDS-based AD estimations performed so far, were established by using a single-architecture CP ANN prediction  Table S2). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) model. On the other hand, when working with non-linear QSAR modeling methods such as ANN, the colloquial strategy is to train different ANN architectures before to make a final decision which ANN model will be accepted (usually according to the highest value for the cross-validated coefficient of correlation for the model) [51]. According to this, the probability to find more than one acceptable model (trained under conditions different than those used to develop the best one) must be also taken into account, and therefore a crucial question arises: Is the MEDS-based AD estimation really affected by the ANN architecture employed? In order to give an answer to this question, a side-by-side comparative evaluation of the ADs for different ANN models of Dataset 1 developed under various network architectures must be performed as demonstrated in Fig. 7. Fig. 7 shows the MEDS-based AD plots for three CP ANN models for prediction of the bilitranslocase transport activity for 88  Table S3). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) compounds (Dataset 1) developed under different conditions (network architectures). Namely, alongside our published CP ANN model [25] which AD plot is represented in Fig. 7b (6 × 6 network), for comparative purposes we selected two more models developed under two vicinal network architectures, i.e., 5 × 5 network and 7 × 7 network, which AD plots are depicted in Fig. 7a and c, respectively. Comparing to the 6 × 6 model's AD (Fig. 7b), no significant differences were determined between all three cases in regard to the identified outliers. The external validation set compounds (ID = 57, 64, 66, 70, 73, 77, 80, 83, and 88) which were previously identified as extreme points, i.e., structurallyinfluential outliers (ED > ED crit ) as well as the response outliers (ID = 63, 67, 74, 76, and 79), were found to be outside of the AD in all three cases. The only questionable compound predicted by 5 × 5 (Fig. 7a) and 7 × 7 (Fig. 7c) models as structurally-influential outlier is the compound hydrochlorothiazide (ID = 74; ED ID=74[5×5] = 0.357 Fig. 7. Side-by-side comparative assessment of the MEDS-based AD for three CP ANN prediction models for estimation of the bilitranslocase transport activity developed using different network architectures: (a) 5 × 5, (b) 6 × 6 (our published model, Dataset 1) [25], and (c) 7 × 7. The normalized ED threshold values (a) ED crit = 0.350, (b) ED crit = 0.350, and (c) ED crit = 0.354, respectively, correspond to the training set object with maximal ED value. The training set objects (45 compounds) are depicted as solid blue rectangles, external test set objects (10 compounds) as solid red circles, while the external validation set objects (33 compounds) are represented as solid green triangles. The objects identified as potential outliers (ED > ED crit ) are signed with their corresponding ID number ((a, c) Supporting information Table S4; (b) Table 1). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) and ED ID=74[7×7] = 0.356) that belongs to the thiazide diuretic's therapeutic class. Indeed, hydrochlorothiazide is structurally very different than the compounds used for developing of the CP ANN models (no similar compound can be found) and therefore should be expected to be outside of the model's AD, even it was experimentally determined as bilitranslocase competitive inhibitor (Table 1). Hence, the same compound is situated within the 6 × 6 model's AD as shown in Fig. 7b.

Conclusion
In this study a novel, simple and effective distance-based methodology for AD estimation in case of established predictionlike ANN models was introduced taking into account the so-called "minimum Euclidean distance space" (MEDS) concept. The method offers a graphical depiction of the ANNs model AD for fast and accurate visual determination of the detected structurally-influential outliers. The performances of our MEDS-based AD estimation concept were thoroughly evaluated through three case studies utilizing a pre-built and validated CP ANN prediction models. In the model for the estimation of the transport activity of the transmembrane protein bilitranslocase for a diverse set of compounds, the method identified 9 compounds out of total 33 compounds (used for external validation of the CP ANN model) as pure structurallyinfluential outliers with calculated EDs greater than the critical threshold value (ED crit ), of which the majority were also determined as response outliers (objects with standardized residual values greater than ±3 units). The structural analysis of these compounds clearly confirmed that no similarity exist between them and the chemical structures used for developing of the CP ANN model, and therefore it is very likely to be situated far from the structural centroid of the model. Furthermore, the PLS modeling methodology was applied on the same data and the leverage-based AD was estimated and compared with the MEDS-based AD results. Although, both AD estimation methods are essentially different with regards to the distance metrics utilized (EDs vs. leverages), they gave comparable outcome. The same outlying objects identified by using the MEDS-based AD approach were again identified as extreme points on the assessed Williams plot (h > h*), together with a much larger ensemble of external validation set objects. These results explicitly show that the calculated cut-off leverage value (h*) is much more restrictive for the PLS model, comparing to the ED crit for the CP ANN model. Finally, the side-by-side comparative assessment of the AD plots generated for different CP ANN models developed under various ANN architectures close to the optimal one showed no significant differences in the identified outliers, allowing us to conclude that the MEDS-based AD assessment is not significantly affected by the neural network size.
From pharmacological point of view, the majority of the compounds identified as outliers throughout this case study belong to the class of non-steroidal antiinflammatory drugs (NSAIDs) which are totally different than the training set compounds. Therefore we believe that their permanent appearance out of the model AD pinpoint to some degree of weakness of the model for prediction of NSAIDs. Additionally, the same methodology was applied on two more case studies and similar outcomes were observed.
In conclusion, our MEDS-based approach demonstrated satisfactory performances for proficient assessment of the AD in case of non-linear predictive distance-based ANN models such as KANN and CP ANN, as has been already shown for classification CP ANN models [24]. This method is supposed to improve the choice of standard methods needed in providing the high quality validations in the QSAR world.

Notes
All the AD plots provided in this study were generated and analyzed by using the CPNN-AD Builder and MLR-AD Builder available upon request.