figshare
Browse

The toxicity data sourced from TOXRIC database

Version 4 2024-10-24, 03:27
Version 3 2024-10-09, 13:48
Version 2 2024-10-09, 13:44
Version 1 2024-10-09, 13:03
dataset
posted on 2024-10-24, 03:27 authored by Jiang LuJiang Lu

The expanded predictive toxicology dataset is sourced from TOXRIC, a comprehensive and standardized toxicology database. The toxric_30_datasets contains 30 assay datasets with ~150,000 measurements related to five categories. These categories span a range of toxicity assessment, including genetic toxicity, organic toxicity, clinical toxicity, developmental and reproductive toxicity, and reactive toxicity. These 30 toxic datasets were given in toxric_30_datasets.zip.

The acute toxicity dataset includes 59 various toxicity endpoints with 80,081 unique compounds represented using SMILES strings, and 122,594 usable toxicity measurements described by continuous values with a unified toxicity chemical unit: -log(mol/kg). The larger the measurement value, the stronger the toxicity intensity of the corresponding compound towards a certain endpoint. The 59 acute toxicity endpoints involve 15 different species including mouse, rat, rabbit, guinea pig, dog, cat, bird wild, quail, duck, chicken, frog, mammal, man, women, and human, 8 different administration routes including intraperitoneal, intravenous, oral, skin, subcutaneous, intramuscular, parenteral, and unreported, and 3 different measurement indicators including LD50 (lethal dose 50%), LDLo (lethal dose low), and TDLo (toxic dose low). In this dataset, each compound only has toxicity measurement values concerning a small number of toxicity endpoints, so this dataset is very sparse with nearly 97.4% of compound-to-endpoint measurements missing. Meanwhile, this dataset is also extremely data-unbalanced with some endpoints having tens of thousands of toxicity measurements available, e.g., mouse-intraperitoneal-LD50 has 36,295 measurements, mouse-oral-LD50 has 23,373 measurements, and rat-oral-LD50 has 10,190 measurements, etc, while some endpoints contain only around 100 measurements like mouse-intravenous-LDLo, rat-intravenous-LDLo, frog-subcutaneous-LD50, and human-oral-TDLo, etc. The sparsity and unbalance of this dataset present acute toxicity evaluation as a challenging issue. Among the 59 endpoints, 21 endpoints with less than 200 measurements were considered small-sized endpoints, and 11 endpoints with more than 1000 measurements were treated as large-sized endpoints. Three endpoints targeting humans, human-oral-TDLo, women-oral-TDLo, and man-oral-TDLo, are typical small-sized endpoints, with only 140, 156, and 163 available toxicity measurements, respectively (The acute toxicity intensity measurement values of the 80,081 compounds concerning 59 acute toxic endpoints, as well as the 5-fold random splits, were provided in the multiple_endpoint_acute_toxicity_dataset.zip. The molecular fingerprints or feature descripors of the 80,081 compounds, such as Avalon, Morgan, and AtomPair, were given in the all_descriptors.txt).

Funding

National Key R&D Program of China (2023YFC2604400) and the National Natural Science Foundation of China (62103436)

History