posted on 2023-12-13, 19:11authored byThomas M. Whitehead, Joel Strickland, Gareth J. Conduit, Alexandre Borrel, Daniel Mucs, Irene Baskerville-Abraham
Imputation machine learning (ML) surpasses traditional
approaches
in modeling toxicity data. The method was tested on an open-source
data set comprising approximately 2500 ingredients with limited in vitro and in vivo data obtained from
the OECD QSAR Toolbox. By leveraging the relationships between different
toxicological end points, imputation extracts more valuable information
from each data point compared to well-established single end point
methods, such as ML-based Quantitative Structure Activity Relationship
(QSAR) approaches, providing a final improvement of up to around 0.2
in the coefficient of determination. A significant aspect of this
methodology is its resilience to the inclusion of extraneous chemical
or experimental data. While additional data typically introduces a
considerable level of noise and can hinder performance of single end
point QSAR modeling, imputation models remain unaffected. This implies
a reduction in the need for laborious manual preprocessing tasks such
as feature selection, thereby making data preparation for ML analysis
more efficient. This successful test, conducted on open-source data,
validates the efficacy of imputation approaches in toxicity data analysis.
This work opens the way for applying similar methods to other types
of sparse toxicological data matrices, and so we discuss the development
of regulatory authority guidelines to accept imputation models, a
key aspect for the wider adoption of these methods.