sorry, we can't preview this file

Benchmark_AFLOW_Data_Sets_for_Machine_Learning.zip (402.5 MB)

Benchmark AFLOW Data Sets for Machine Learning

dataset

posted on 2020-03-08, 07:20 authored by Conrad ClementConrad Clement, Steven Kauwe, Taylor Sparks

Materials informatics is increasingly finding ways to exploit machine learning algorithms. Techniques such as decision trees, ensemble methods, support vector machines, and a variety of neural network architectures are used to predict likely material characteristics and property values. Supplemented with laboratory synthesis, applications of machine learning to compound discovery and characterization represent one of the most promising research directions in materials informatics. A shortcoming of this trend, in its current form, is a lack of standardized materials data sets on which to train, validate, and test model effectiveness. Applied machine learning research depends on benchmark data to make sense of its results. Fixed, predetermined data sets allow for rigorous model assessment and comparison. Machine learning publications that don't refer to benchmarks are often hard to contextualize and reproduce. In this data descriptor article, we present a collection of data sets of different material properties taken from the AFLOW database. We describe them, the procedures that generated them, and their use as potential benchmarks. We provide a compressed ZIP file containing the data sets, and a GitHub repository of associated Python code. Finally, we discuss opportunities for future work incorporating the data sets and creating similar benchmark collections.

Funding

NSF CAREER Award DMR 1651668

History

Usage metrics

Keywords

AFLOW Benchmark Data Sets Machine Learning Materials Informatics Computational Chemistry Pattern Recognition and Data Mining Cheminformatics

Licence

MIT

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM