Count-Based
Morgan Fingerprint: A More Efficient and
Interpretable Molecular Representation in Developing Machine Learning-Based
Predictive Regression Models for Water Contaminants’ Activities
and Properties
Posted on 2023-07-05 - 19:40
In this study, we introduce the count-based
Morgan fingerprint
(C-MF) to represent chemical structures of contaminants and develop
machine learning (ML)-based predictive models for their activities
and properties. Compared with the binary Morgan fingerprint (B-MF),
C-MF not only qualifies the presence or absence of an atom group but
also quantifies its counts in a molecule. We employ six different
ML algorithms (ridge regression, SVM, KNN, RF, XGBoost, and CatBoost)
to develop models on 10 contaminant-related data sets based on C-MF
and B-MF to compare them in terms of the model’s predictive
performance, interpretation, and applicability domain (AD). Our results
show that C-MF outperforms B-MF in nine of 10 data sets in terms of
model predictive performance. The advantage of C-MF over B-MF is dependent
on the ML algorithm, and the performance enhancements are proportional
to the difference in the chemical diversity of data sets calculated
by B-MF and C-MF. Model interpretation results show that the C-MF-based
model can elucidate the effect of atom group counts on the target
and have a wider range of SHAP values. AD analysis shows that C-MF-based
models have an AD similar to that of B-MF-based ones. Finally, we
developed a “ContaminaNET” platform to deploy these
C-MF-based models for free use.
CITE THIS COLLECTION
DataCiteDataCite
No result found
Zhong, Shifa; Guan, Xiaohong (2023). Count-Based
Morgan Fingerprint: A More Efficient and
Interpretable Molecular Representation in Developing Machine Learning-Based
Predictive Regression Models for Water Contaminants’ Activities
and Properties. ACS Publications. Collection. https://doi.org/10.1021/acs.est.3c02198