figshare
Browse

Hessian QM9 Dataset

Version 4 2024-12-12, 10:12
Version 3 2024-09-18, 15:43
Version 2 2024-09-18, 15:28
Version 1 2024-07-30, 10:41
dataset
posted on 2024-12-12, 10:12 authored by Nicholas WilliamsNicholas Williams


Overview

Hessian QM9 is the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $\omega$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as in water, tetrahydrofuran, and toluene using an implicit solvation model.

A pre-print article associated with this dataset is available at here.

Data records

The dataset is stored in Hugging Face's dataset format. For each of the four implicit solvent environments (vacuum, THF, toluene, and water), the data is divided into separate datasets containing vibrational analysis of 41,645 optimized geometries. Labels are associated with the QM9 molecule labelling system given by Ramakrishnan et al.

Please note that only molecules containing H, C, N, O were considered. This exclusion was due to the limited number of molecules containing fluorine in the QM9 dataset, which was not sufficient to build a good description of the chemical environment for fluorine atoms. Including these molecules may have reduced the overall precision of any models trained on our data.

Load the dataset:

Use the following Python script to load the dataset dictionary:

```python

from datasets import load_from_disk

dataset = load_from_disk(root_directory)

print(dataset)

```


Expected output:


```python

DatasetDict({

vacuum: Dataset({

features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],

num_rows: 41645

}),

thf: Dataset({

features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],

num_rows: 41645

}),

toluene: Dataset({

features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],

num_rows: 41645

}),

water: Dataset({

features: ['energy', 'positions', 'atomic_numbers', 'forces', 'frequencies', 'normal_modes', 'hessian', 'label'],

num_rows: 41645

})

})

```


DFT Methods

All DFT calculations were carried out using the NWChem software package. The density functional used was $\omega$B97x with a 6-31G* basis set to create data compatible with the ANI-1/ANI-1x/ANI-2x datasets. The self-consistent field (SCF) cycle was converged when changes in total energy and density were less than 1e-6 eV. All molecules in the set are neutral with a multiplicity of 1. The Mura-Knowles radial quadrature and Lebedev angular quadrature were used in the integration. Structures were optimized in vacuum and three solvents (tetrahydrofuran, toluene, and water) using an implicit solvation model.

The Hessian matrices, vibrational frequencies, and normal modes were computed for a subset of 41,645 molecular geometries using the finite differences method.

Example model weights

An example model trained on Hessian data is included in this dataset. Full details of the model will be provided in an upcoming publication. The model is an E(3)-equivariant graph neural network using the `e3x` package with specific architecture details. To load the model weights, use:


```python

params = jnp.load('params_train_f128_i5_b16.npz', allow_pickle=True)['params'].item()

```

History