Chemometrics approach for the prediction of structure–activity relationship for membrane transporter bilitranslocase£

Membrane transport proteins are essential for cellular uptake of numerous salts, nutrients and drugs. Bilitranslocase is a transporter, specific for water-soluble organic anions, and is the only known carrier of nucleotides and nucleotide-like compounds. Experimental data of bilitranslocase ligand specificity for 120 compounds were used to construct classification models using counter-propagation artificial neural networks (CP-ANNs) and support vector machines (SVMs). A subset of active compounds with experimentally determined transport rates was used to build predictive QSAR models for estimation of transport rates of unknown compounds. Several modelling methods and techniques were applied, i.e. CP-ANN, genetic algorithm, self-organizing mapping and multiple linear regression method. The best predictions were achieved using CP-ANN coupled with a genetic algorithm, with the external validation parameter QV2 of 0.96. The applicability domains of the models were defined to determine the chemical space in which reliable predictions can be obtained. The models were applied for the estimation of bilitranslocase transport activity for two sets of pharmaceutically interesting compounds, antioxidants and antiprions. We found that the relative planarity and a high potential for hydrogen bond formation are the common structural features of anticipated substrates of bilitranslocase. These features may serve as guidelines in the design of new pharmaceuticals transported by bilitranslocase.


Introduction
Historically, the primary route of drug uptake into cells was considered to be passive diffusion through the lipid bilayer and this notion was widely accepted in the pharmaceutical industry for a long time. Recently a considerable amount of evidence and implications indicate the essential role of carrier-mediated transport for drug uptake into cells [1,2,4,5]. For most of pharmaceutical drugs in use nowadays these mechanisms were not specifically considered during drug design, but rather occurred serendipitously, by hitchhiking on membrane carriers normally used for the transport of nutrients and intermediary metabolites [4]. However, the importance of carrier-mediating properties of potential agents is now becoming seriously considered because of their high potential for improving the effectiveness, selectivity and safety of drugs, and dietary supplements [1,5]. One of such membrane carriers is bilitranslocase, a transmembrane protein basically involved in bilirubin transfer from blood to liver cells.
Bilitranslocase (BTL) is a plasma membrane transporter (T.C. 2.A.65.1.1) specific for water-soluble organic anions (bile pigments, dietary flavonoids, nucleotides) [6][7][8][9][10][11]. It has been experimentally confirmed that BTL is localized on the vascular endothelium, absorptive (gastrointestinal) and excretory (hepatic and renal) epithelia [12,13]. BTL functions do not overlap with other organic anion transporters and probably the most unique feature of BTL is the ability to transport nucleotides, which no other known membrane transporter is able to do [11,14]. However, several transporters of nucleosides and nucleobases are reported [15,16], i.e. concentrative nucleoside transporters belonging to the SLC28 gene family, which enable a unidirectional high affinity transport of nucleosides and their analogues into the cells by coupling their transport to the inward directed sodium gradient [17], and equilibrative nucleoside transporters from the SLC29 gene family, which mediate a bidirectional, sodium-independent facilitated diffusion with a low affinity [18]. The capacity of BTL to transport anthocyanins [6,12], which are natural glycosylated molecules, suggests the possible development of watersoluble orally effective drugs. Thus, a comprehensive knowledge of BTL would offer new space for solutions in drug discovery, which could be of great interest for the pharmaceutical industry. Several studies have focused on advancing BTLs structure, but its three-dimensional (3D) structure remains enigmatic. The structures of the four transmembrane subunits of BTL have been successfully predicted by chemometrics modelling [20]. Furthermore, the stability of these four BTL transmembrane regions has been studied with the use of molecular dynamics simulations and the 3D structures of two transmembrane regions have been resolved by nuclear magnetic resonance (NMR) spectroscopy [22,23].
To gain some insights into the transport mechanism of structurally unresolved transmembrane proteins, the so-called quantitative structure-activity relationship (QSAR) methods can be applied [24][25][26][27], which are independent of the host protein structure. In the past decade the transport activity of BTL for many different ligands has been characterized with a biological assay [6,7,[10][11][12][13][14], and the experimental data obtained represented the basis for the development of several QSAR models for prediction of BTL transport ability for specific sets of polyaromatic organic compounds. The models were built with the use of counter-propagation artificial neural networks (CP-ANN) for sets of anthocyanin and flavonol derivatives [10] and various endogenous compounds, xenobiotics, and purine and pyrimidine derivatives [11,28].
In this study we present novel QSAR models for the prediction of BTL transport activity that were developed using the combined data from our previous studies [10,11]. These models cover a wider chemical space than previous ones and can be used for the prediction of BTL transport activity of new drug candidates such as nucleotide-like anticancer drugs [29,30], antivirals [31], and phenol-based drugs [32,33] with potential antioxidant or anticancer activity. The modelling procedure is based on the nonlinear CP-ANN and support vector machines (SVM), and linear multiple linear regression (MLR) methods, following the principles set out by the Organisation for Economic Co-operation and Development (OECD) [34].
Furthermore, two sets of pharmaceutically interesting compounds are evaluated with the newly developed QSAR models, including several antioxidants (pulvinic acid and coumarine derivatives) [35,36] and small aromatic organic compounds with the potential of being therapeutic agents for prion diseases [37], with the aim of creating a list of the most promising active compounds for further experimental studies regarding bioavailability. Antioxidants are important for their reactive oxygen species scavenging potential, which is a high priority domain in medical research today, while antiprion compounds are widely investigated for their potential therapeutic effects in neurodegenerative diseases. In order to set guidelines for further drug design studies the most promising compounds from the two studied datasets are highlighted and their structural properties having impact on the BTL transport ability are discussed.
2. Materials and methods 2.1 Dataset and structural descriptors 2.1.1 Dataset for model construction The dataset contains 22 anthocyanins and 21 flavonols [10], and 77 compounds belonging to different chemical classes (nucleotides, nucleosides, nucleobases, various endogenous molecules, dyes and drugs) [11], together with the corresponding experimental data on BTL transport activity from previous studies. Among 120 compounds included in this study 70 compounds are inactive (I), while 50 compounds are active (A). Two-dimensional (2D) chemical structures for all 120 compounds can be found in Table S1 (in the supplementary material which is available via the multimedia link on the online article webpage), while Table 1 presents structure names and experimental values only for the active compounds. For the modelling purposes all K i (mM) values are expressed in logarithmic units as pK i (M).

Two pharmaceutically interesting datasets for model application
The data include two sets of pharmaceutically interesting compounds with undetermined BTL transport activity: (1) a set of 111 antioxidants [35,36] (ID = 201-311; see Table S2 in the supplementary material which is available via the multimedia link on the online article webpage), and (2) a set of 109 active small antiprion compounds [37] (ID = 401-509; see Table S3 in the supplementary material which is available via the multimedia link on the online article webpage).

Descriptors of chemical structures
The equilibrated 3D structures were obtained using the 2D chemical structures as an input to the AM1 semi-empirical method of the MOPAC software [40]. The Codessa software [41] was used to calculate the molecular descriptors, which are used for the mathematical representation of molecules and are generally grouped into five categories: constitutional, topological, geometrical, electrostatic and quantum-chemical. The Codessa output resulted in more than 380 molecular descriptors for each molecule.

Computational methods and software
For the modelling purposes, QSARINS software was applied for the development of multiple linear regression (MLR) models [42], our in-house FORTRAN based software was employed for the development of counter-propagation artificial neural network (CP-ANN) models [43][44][45], and the LIBSVM software package [46] was used for the construction of support vector machines (SVM) models. Kohonen neural networks were applied for the reduction of the number of variables and splitting of the data, while a genetic algorithm (GA) was employed for the selection of influential variables [47].

Initial reduction of the number of descriptors
Since the reduction in the number of initially calculated descriptors is crucial for computational modelling, we first removed all descriptors with low variance (s 2 < 0.03). The remaining pool of 175 molecular descriptors was normalized to zero mean and unit standard deviation, resulting in a 120 × 175 data matrix. Further reduction of the number of variables was achieved based on the similarity criterion by the mapping of their transposed dataset (175 × 120 data matrix) into a 7 × 7 dimensional Kohonen neural network, where descriptors were mapped according to their similarity and consequently several of them were placed onto the same neurons. From each neuron only two descriptors, those with the minimal and maximal Euclidean distances (EDs) were selected into the final pool of 66 molecular descriptors used for the modelling.

Splitting of the data into the training, test and validation set
To avoid inconsistent results a careful splitting of the initial data is needed. This was performed following the recommended methodology for building QSAR models [48]; the training set (TR) was used for the learning of the network, the test set (TE) for defining the optimal model parameters (internal validation) and the validation set (V) for the external validation of the developed models. The division of compounds into the TR, TE, and V sets was done according to the optimal distribution of 120 compounds described with 66 molecular descriptors on the Kohonen top map, which was obtained by varying different parameters of the network until the maximal occupancy of neurons and the minimal average error at one object was achieved. The optimal distribution was attained using a network with following parameters: 13 × 13 neuron grid, 1000 learning epochs, 0.5 maximal learning rate, 0.01 minimal learning rate, non-toroidal boundary conditions, and with the use of a triangular correction function of the neighbourhood. The optimal Kohonen top map with the distribution of 120 compounds is presented in Figure 1.
The compounds are distributed on the Kohonen top map in accordance with their structural similarity; similar structures are positioned on the same or neighbouring neurons. Figure 1 depicts the grouping of compounds in three clear clusters: the first one contains nucleobases and their derivatives (ID = 1-65), the second one consists of structurally related anthocyanins and flavonols (ID = 78-96 and 100-120, respectively), and the third one is comprised of various drugs (ID = 121-153). The division of 120 compounds into the TR (70), TE (31) and V (19) set was performed on the basis of their distribution on the Kohonen map (shown in Figure 1), taking careful consideration to cover entire chemical spaces of each of the three initial datasets. The ratio of the division was 58:26:16 for the TR, TE and V set, respectively. Data used for classification purposes enclosed both active and inactive compounds, represented equally in all three sets. For predictive models only the active compounds were considered and consequently the 70 inactive compounds were omitted. Therefore, only the remaining 50 active compounds were distributed in the TR (30), TE (13) and V (7) set. The corresponding ratio of the data division was 60:26:14 for the TR, TE and V set, respectively.

Development of classification models
With the aim of developing a reliable and accurate classification model that could classify the compounds into active (A) and inactive (I) ones, two types of supervised learning algorithms were used: counter-propagation neural networks (CP-ANN) and support vector machines (SVM). Both algorithms are well known and frequently used methods giving good classification results [49].
The input data for both algorithms were 120 compounds, described with 66 calculated molecular descriptors and the corresponding BTL inhibition constants (pK i ). These were the basis for the separation of active from inactive compounds, the threshold for the separation being set at a pK i of 1.7 log units; this value had already been determined in the previous study of Župerl et al. [11].
All the obtained models were internally evaluated with the TE set and the V set was used for the external validation. The efficiencies of all the models were evaluated using classification coefficients based on the number of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions; accuracy (equation (1)), sensitivity (equation (2)), specificity (equation (3)) and Matthews correlation coefficient (equation (4)) [50].
To obtain the classification model with a minimal error for the internal test set (root mean square, RMS TE ), the network parameters were varied. The best CP-ANN classification model was constructed with 93 epochs of training, neuron grid size of 13 × 13, maximal and minimal learning rate of 0.5 and 0.01, respectively, non-toroidal boundary conditions, and a triangular correction function of the neighbourhood. The cumulative AC, SE, SP, and MCC of all objects obtained by the best CP-ANN classification model were 0.91, 0.84, 0.97, and 0.82 respectively.
The SVM classification was first performed using different kernel functions (linear, polynomial, Gaussian). In each case the algorithm parameters were determined using a grid search, where the optimal set of parameters was defined as the one that produced the lowest mean square error for the cross-validation of the TR set. The optimal SVM classification model was selected based on the accuracy of the V set. Best performance was achieved using a third-degree polynomial kernel function, parameter C set at one, and gamma at 0.03. The cumulative AC, SE, SP and MCC of all objects obtained by the top SVM classification model were 0.88, 0.86, 0.90, and 0.76, respectively. The classification results (AC, SE, SP, and MCC) for all three sets (TR, TE and V) obtained by the best CP-ANN and SVM classification models are given in Table 2.

Application of classification models on two pharmaceutically interesting datasets
The best CP-ANN and SVM classification models were applied to two sets of compounds with unknown transport activity: (1) a set of 111 antioxidants (ID = 201-311, Table S2 in the supplementary material which is available via the multimedia link on the online article webpage); and (2) a set of 109 antiprion compounds (ID = 401-509, Table S3 in the supplementary material which is available via the multimedia link on the online article webpage). The compounds were classified as active only if their predicted pK i values were higher than 1.7. The classification results (i.e. list of compounds predicted as active) for both datasets are given in Table 2. With both classification methods, 48 compounds were predicted as active (29 antioxidants and 19 antiprions). Although the SVM model gives a lower number of active compounds (20 compounds) than the CP-ANN model (41 compounds), the results of the two methods mostly overlap. Note: Bold indicates compounds classified as active by both classification models.

Development of predictive models
The models for predicting BTL transport activity were built on the basis of 50 active compounds from the initial dataset (Table 1), which were divided into 30, 13 and 7 compounds for the TR, TE and V set, respectively. All compounds were described with 66 molecular descriptors and the corresponding pK i values, which were directly used as an input for the modelling algorithms.
First the MLR predictive model was built. Since the MLR modelling strategy in the QSA-RINS software incorporates leave-one-out cross-validation for the internal validation of the models, the division into the TR and TE set was not necessary. Therefore a merged TR and TE set (43 compounds) was used for the model development and the V set (seven compounds) for the external model validation. The best predictive ability of the MLR model was achieved using seven molecular descriptors ( Table S4 in the supplementary material which is available via the multimedia link on the online article webpage), with reasonably high internal (Q LOO 2 = 0.92) and external (Q V 2 = 0.77) validation parameters. All Q 2 validation parameters in this study were calculated using the Q 2 F3 formula, proposed by Consonni et al. [51]. The best MLR model is presented in Figure 2(a) as a plot of experimental versus predicted pK i values.
Next the CP-ANN algorithm was applied. In order to obtain the predictive model with a minimal error of the internal test set (RMS TE ), different network parameters were varied.
Within more than a thousand tested models, the optimal model (RMS TE = 0.75) was obtained with the network dimension of 7 × 7, trained for 14 epochs, with maximal and minimal learning rates of 0.5 and 0.01, respectively. The low internal (Q TE 2 = 0.66) and external (Q V 2 = 0.57) validation parameters were ascribed to the high number of variables (66) used for the model construction. To further reduce the number of molecular descriptors used in the modelling procedure, we coupled the CP-ANN with a genetic algorithm (GA). More than 200 GA runs with different origin parameters (4-20 initial genes, 10-30 survivals, 0.5-1% of mutations) were performed with the goal of defining the set of most significant variables based on the lowest error of the TR and TE set, minimal RMS TR+TE . During 600 generations of each GA run a population of 100 chromosomes was evolving, with varying parameters of the network: number of neurons , number of learning epochs (50-1000) and learning rate (0.1-0.7). The best CP-ANN_GA model with high internal (Q TE 2 = 0.93) and external (Q V 2 = 0.96) validation parameters was obtained using 12 molecular descriptors, with a network dimension of 10 × 10, 100 learning epochs, 0.5 maximal and 0.01 minimal learning rates. Figure 2(b) depicts the regression plot of the experimental versus predicted pK i values obtained by the best predictive CP-ANN model coupled with GA. The predicted pK i values obtained from the best CP-ANN_GA model for all 50 active compounds used for the modelling are given in Table 1. According to Figure 2, the best nonlinear modelling method (CP-ANN_GA) resulted in a model with better performance compared with the best linear one (MLR).
To compare CP-ANN predictive modelling with some other non-linear modelling method, the support vector regression (SVR) algorithm was applied. When all 66 initially selected molecular descriptors were used for the modelling, the internal (Q TE 2 ) and external (Q V 2 ) validation parameters of the resulting SVR model reached values of 0.83 and 0.67, respectively, which is comparable with or even slightly better than in the case of the corresponding CP-ANN model. The SVR algorithm was also applied to the development of a model using only the 10 molecular descriptors from the best CP-ANN_GA model. The resulting internal (Q TE 2 ) and external (Q V 2 ) validation parameters were 0.46 and 0.43, respectively, which is quite poor and significantly worse than in the case of the best CP-ANN_GA model.
For further study the best CP-ANN_GA model was selected, not only for the reasons of a superior performance with the reduced set of descriptors, but also for the well-established assessment of the applicability domain, which is lacking in case of SVR modelling.

Applicability domains of the developed models
In QSAR modelling the applicability domain (AD) represents the chemical space (chemical information with regard to the properties and structures of compounds) that was used in the development of the model. It has to be defined for each model to determine the reliability of predictions for compounds not used in the model development [34]. The structural space that the model covers is defined with the compounds in the training and internal test set; consequently the model's predictions are reliable only for compounds within this space. The distance-based approach for the applicability assessment is widely used and applicable for both linear and nonlinear modelling methods [48,52,53]. The compound is labelled as out of the domain when the distance between the compound and the centre of the training data set exceeds a defined threshold. For the AD evaluation of the MLR predictive model, a leverage based approach was used [48,52] where a so-called Williams plot is obtained (Figure 3(a)), which shows standardized residuals as a function of leverage values, while for the CP-ANN_GA predictive model, the Euclidean distance (ED) approach was used [53], where the compounds of the training and internal test set define the minimum ED space (MEDS) (Figure 3(b)).
The boundaries of the residuals of predicted properties are set at ±3σ for both methods and they determine the response outliers. The boundaries concerning the structural similarity of molecules differ for both methods. In case of the MLR predictive model, the leverage boundary was determined by the critical hat value h* = 0.56 [48], while for the CP-ANN_GA predictive model, the threshold distance was determined as a maximal ED value from the TR and TE set (ED crit = 0.17) [53]. The critical hat value h* and ED crit determine structural outliers of the MLR and CP-ANN_GA model, respectively. According to Figure 3(a) (MLR model), two structural (ID = 75, 78) and one response (ID = 96) outliers are detected; two compounds (ID = 75, 96) are from the TR set and one (ID = 78) from the V set. In Figure 3(b) (CP-ANN_GA model), two compounds (ID = 154, 11) from the V set are considered as structural outliers. Compound with ID = 154 is sulphobromophtalein and is indeed structurally quite different to the compounds used for the model construction. Additionally one compound (ID = 95) from the V set is recognized as a response outlier.

Structural variables influencing the BTL transport activity
The influential variables can give guideline information on which chemical features correlate with inhibition constants, which can highlight structural details of compounds that have an important influence on the BTL transport activity. To interpret and understand this mechanism a detailed analysis of the selected descriptors was carried out. Furthermore, quantitative information about the most influential descriptors was collected on the basis of the top 121 CP-ANN_GA models selected by the following criteria: RMS TE ≤ 0.4, RMS V ≤ 1.3, and the number of selected descriptors ≤ 20. The percentage of the occurrence of molecular descriptors included in 121 models was determined and the 11 most frequent variables are listed as a consensus descriptor set in Table S4 in the supplementary material which is available via the multimedia link on the online article webpage.
The most frequently selected molecular descriptor is quantum-chemical, describing minimal nuclear-electron attraction for a carbon atom, presented in 80% of 121 models. The second most common descriptor (72%) is topological, describing the average bonding information content and the third most common (60%) is the hydrogen acceptor dependent area-weighted surface charge of hydrogen bonding donor atoms (HDSA-2), which is directly connected to the hydrogen bonding between the molecules and encodes features responsible for polar interactions between molecules. A comparison of 12 molecular descriptors from the best CP-ANN_GA predictive model M1 and the 11 most frequently selected variables shows that most of them are overlapping and thus describing the same structural features.
In general the selected descriptors are in an agreement with the influential variables published in our previous studies [10,11,28], which highlighted descriptors related to the ability of compounds to form hydrogen bonds (HDSA-2), the shape and compactness of compounds (Information content, Wiener index) and the ionic properties of compounds (FPSA-2, PPSA-2, PPSA-3). Therefore, with this new robust model, which covers a bigger structural domain, we can generalize and support the hypothesis that active molecules should be capable of forming hydrogen bonds and that BTL has the capacity to transport ionic species. Furthermore, a detailed screening of the shape of active compounds (competitive or noncompetitive inhibitors) reveals a third important feature which is the planarity of molecules. This is mostly reflected in two selected descriptors: the number of aromatic bonds and Gravitation index. The importance of the planarity of molecules for the BTL transport had been already observed and experimentally confirmed in previous studies [10,11,28,39].
3.8 Application of the predictive models on two pharmaceutically interesting datasets Two pharmaceutically interesting datasets (antioxidant and antiprion compounds) with undetermined BTL transport activity were evaluated with the proposed prediction models, MLR and CP-ANN_GA. Only the compounds predicted as active by the two above described classification models (SVM and CP-ANN, Table 2) were tested with the developed predictive models (MLR and CP-ANN_GA). The results of the classification (SVM and CP-ANN) and predictive models (CP-ANN_GA; model M2) for the set of antioxidants and antiprions are given in Tables S2 and S3in the supplementary material which is available via the multimedia link on the online article webpage.
The main concern of the newly developed models is the reliability of predictions for the compounds of diverse chemical structures; therefore we first had to check if they fall into the AD. Because the biological properties of these compounds are not yet determined, the applicability assessment was possible only for the structural domain, so the modified AD plots had to be used; these were the so-called Insubria graphs in case of MLR [48,52], and MEDS graphs in case of CP-ANN_GA models [53]. The boundaries for the chemical space of the MLR and CP-ANN_GA models are again determined by the critical hat value (h*) and ED crit (x-axis same as in Figure 3), respectively, while the y-axis shows predicted values instead of standardized residuals (Figure 4). Predictions for compounds that fall out of the model structural domain are considered unreliable and thus should not be included in further investigations.
The Insubria graph (Figure 4(a)) of the MLR model indicates that 20 out of 29 antioxidants and 10 out of 19 antiprions fall within the boundaries of the structural domain. In the MEDS graph of the CP-ANN_GA model (Figure 4(b)) all 29 antioxidants are positioned out of the AD and only two antiprions (ID = 451, 503) are placed inside the structural AD.
Even though the best CP-ANN_GA model has good predictive ability (i.e. low prediction error for the V set), it is not applicable for the set of antioxidant and antiprion compounds. This can be improved by choosing the model parameters that broaden the AD at the price of reducing the accuracy of predictions. Therefore, several other predictive CP-ANN_GA models were evaluated regarding the new criteria, and only the models with broader AD and still acceptable accuracy were considered. First, the models were ranked according to the decreasing Q V 2 ; six best models are shown in Table 3. Then the compounds from two pharmaceutically interesting datasets were input to the models to check if they fall into the ADs (see Table 3).
Model M2 is the second best CP-ANN_GA model with slightly lower validation parameter then M1 (Q V 2 = 0.91), but with significantly broader AD for the antioxidants. The new MEDS graph for model M2 applied to the two pharmaceutically interesting datasets is shown in Figure 5. As can be seen in Table 3 and in Figure 5, 17 out of 29 antioxidants are located in the model AD. Model M4 with lower validation parameter (Q V 2 = 0.86) has even more antioxidants (20) inside the AD. For antiprion compounds, however, only two compounds are placed in AD of models M1 and M4, and 3 compounds are placed in AD of model M2. Hence, for various untested compounds it is important that the appropriate model is selected as a consensus among its accuracy and applicability.
In   multimedia link on the online article webpage) are predicted to pass the membrane through BTL and lie within the AD of the prediction model M2. All these compounds have large systems of conjugated double bonds, which are the cause of the planarity of these molecules. The least planar compounds are 204 and 206, with relatively flexible sugar moieties attached; since it is known that the attachment of sugar groups can influence the transport of substances through various transporters [54], this could also be the case with BTL, with sugar moieties not being disruptors of planarity of these compounds but rather the facilitators of transport. However, it is interesting that none of the coumarine based antioxidants (ID = 295-311) were selected as active, despite their planarity and relative resemblance to compounds 203, 204, 206 and 208. All the active compounds also contain an abundance of oxygen moieties attached, mostly in the form of hydroxyl groups, but also lactones, ethers, esters and free carboxylic groups. Conjugated double bonds are probably crucial for the antioxidant activity through the stabilization of the free radical, whereas the abundance of oxygen atoms may suggests a high capacity for hydrogen bonding, which is thought to be important in the BTL transport. Besides this, compound 226 contains a cyano group, compound 252 a couple of chlorine atoms, compound 259 two nitro groups, compound 260 a bromine atom, and compound 263 a thiophene group; however, none of these seem to have a major influence on the transport activity, since no other compound with these substituents was classified as active.
Compound 201 (norbadione) is a polyphenol, extracted from mushrooms; it was found to be an effective in vivo antioxidant, but unfortunately rather toxic [55]. Compounds 203 (quercetin), 204 ((+)-rutin trihydrate) [56], 205 (curcumin) [57], 206 (esculin) [58], 208 (scopoletin) [59] and 234 (gallic acid) [60] are also known antioxidants. Since our results indicate these compounds can be transported through BTL, we suggest a further experimental evaluation of these findings because BTL transport could turn out to be an effective delivery system for such compounds, especially due to its ubiquity, which would in turn give better support to the clinical use of these antioxidants. Compounds 451 (diketopiperazine derivative), 480 (Congo red derivative), and 503 (hematein) from the antiprion dataset (Table S3 in the supplementary material which is available via the multimedia link on the online article webpage) are also predicted as active by our classification models and are placed within the AD of the prediction model M2. Again the most noteworthy common features of these compounds are their planarity, and the abundance of oxygen and/or nitrogen atoms, indicating their high capacity for forming hydrogen bonds. Compounds 451 and 480 were synthesized for the purposes of the prion disease studies and were experimentally determined as active against prion accumulation [61,62]. Hematein (CAS 475-25-2) is a natural compound isolated from Caesalpinia sappan and is known to be used as an analgesic and anti-inflammatory agent in oriental medicine [63,64]. Furthermore, its potential for anti-inflammatory and anti-atherogenic activity was experimentally studied and shows promising results [65,66]. A study by Kocisko et al. [67] estimated that hematein acts also as an antiprion therapeutic agent. Due to new findings on the ability of BTL to transport these compounds, they could serve as leads in the future design of antiprion therapeutics for their potential of improving the bioavailability.

Conclusions
In this study QSAR models were developed in order to predict the transport activities of some pharmaceutically interesting compounds through the membrane transporter BTL; the ability of BTL to transport heterogeneous organic molecules as well as some drugs had been already determined in previous studies. First classification models for the prediction of BTL transport ability were built using CP-ANN and SVM algorithms on the basis of 120 compounds and the corresponding experimental data on the BTL ligand specificity. CP-ANN and SVM models performed with similar statistical parameters for cumulative accuracy (0.91 and 0.88, respectively), selectivity (0.84 and 0.86, respectively) and specificity (0.97 and 0.90, respectively). Construction of the predictive models was based only on active compounds, using CP-ANN and MLR methods. The best model performance was achieved with CP-ANN coupled with a GA, with a good external validation parameter (Q V 2 = 0.96) and a low number of variables used (12). The applicability domains of the predictive models were assessed using the Euclidean distance based approach for CP-ANN models and the leverage based approach for MLR models. The proposed models give reliable predictions of the transport ability by BTL for various small aromatic compounds.
The classification models were applied to the experimentally untested compounds from the antioxidant and antiprion datasets. We had estimated that some of them are likely to be transported by BTL. After challenging the developed predictive models with the compounds classified as active, we checked if they lay in the applicability domain. On the basis of a compromise between the accuracy of the model and the applicability domain area, it was concluded that for antioxidants the CP-ANN_GA models were better than MLR models, while for most of the antiprions, CP-ANN_GA models were not applicable due to severe structural differences. The MLR model with a lower number of descriptors proved to be more suitable for the prediction of antiprion transport ability because of its broader applicability domain.
The analysis of the resulting predictions for the compounds within the applicability domains of the CP-ANN_GA models shows that they have some common features, most notably the planarity of large moieties of these molecules and an abundance of functional groups capable of hydrogen bond formation. These findings, together with the newly developed models, could serve as guidelines in current design of different therapeutics, not only of antioxidants or antiprions but also nucleotide-like and phenol-based anticancer and antiviral drugs.