figshare
Browse

OpenREACT-CHON-EFHOpen REaction Dataset of Atomic ConfiguraTions comprising C, H, O, N with Energies, Forces, and Hessians

Version 4 2025-05-29, 20:37
Version 3 2025-05-29, 20:16
Version 2 2025-05-29, 20:15
Version 1 2025-05-29, 20:09
dataset
posted on 2025-05-29, 20:37 authored by Austin RodriguezAustin Rodriguez, Justin S. Smith, Jose L. Mendoza-Cortes

These datasets were used in the training and testing of Machine Learning Interatomic Potentials (MLIPs) as part of the work represented in the article titled Does Hessian Data Improve the Performance of Machine Learning Potentials?.

RTP Dataset (Reactant–Transition State–Product Dataset):

The RTP dataset forms the core training and evaluation set and consists of 35,087 molecular geometries sampled from 11,961 unique elementary reactions. For each reaction, three critical geometries are included: the optimized reactant, transition state (TS), and product. Each geometry is labeled with its corresponding DFT-computed potential energy, atomic forces, and Hessian matrix, calculated at the wb97xd/6-31g(d) level of theory. This dataset represents stationary points (critical points) on the potential energy surface and serves as the foundation for training the MLIPs to reproduce energies, gradients, and curvatures.

IRC Dataset (Intrinsic Reaction Coordinate Dataset):

To assess the extrapolation performance of the trained MLIPs along continuous reaction pathways, a dataset of 34,248 geometries was compiled from 600 Intrinsic Reaction Coordinate (IRC) paths, each corresponding to a distinct elementary reaction in the RTP dataset. These geometries were obtained by following the minimum energy path (MEP) from the transition state to both reactant and product wells using quantum chemistry calculations at the wb97xd/6-31g(d) level of theory. While these geometries are not explicitly used in training, they provide a rigorous benchmark for evaluating the ability of MLIPs to generalize beyond training data and accurately model transition state connectivity and reaction dynamics.

NMS Dataset (Normal Mode Sampling Dataset):

To evaluate MLIP robustness on off-equilibrium, perturbed structures, 62,527 geometries were generated via Normal Mode Sampling (NMS). These structures are derived by displacing intermediate IRC geometries along their vibrational modes with random amplitudes, simulating thermal fluctuations and non-equilibrium distortions. The properties of these perturbed structures were calculated at the wb97xd/6-31g(d) level of theory. This dataset allows for testing the model's stability and accuracy in more realistic, noisy molecular environments as encountered in molecular dynamics simulations or under experimental conditions.

History