Gradual in Silico Filtering for Druglike Substances

The suitability of decision trees in comparison to support vector machines for the classification of chemical compounds into drugs and nondrugs was investigated. To account for the requirements upon screening virtual compound libraries, schemes for successive filtering steps with gradual increasing computational cost are outlined. The obtained prediction accuracy was similar between decision trees and support vector machine approaches for the applied compound data sets. By using rapidly computable variables such as druglikeness indices, XlogP, and the molar refractivity, at least 39% of the nondrugs can be filtered out, while retaining more than 83% of the actual drugs. Computationally more demanding descriptors such as specific substructure queries and quantum chemically derived variables can be postponed to subsequent classification schemes for the reduced set of compounds, whereby up to 92% of the nondrugs can be sorted out without loosing considerably more drugs. Using all available computed descriptors simultaneously in the first step did not yield significantly better results. Furthermore, the generated decision trees are used to derive guidelines for the design of druglike substances. The numerical margins found at the branching points suggest several criteria that separate drugs from nondrugs:  a molecular weight higher than 230, a molar refractivity higher than 40, and the presence of one or more rings as well as one or more functional groups. Also reported are additionally required parameters to compute values for XlogP, SlogP, and the molar refractivity of boron and silicon containing compounds.