OrbNet Denali Training Data
This repository contains the data for the paper "OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy". The data set consists of geometries of molecules and the corresponding energy labels calculated and the DFT and semi-empirical level.
CitationAnders S. Christensen(1,a), Sai Krishna Sirumalla(1,a), Zhuoran Qiao(2), Michael B. O'Connor(1), Daniel G. A. Smith(1), Feizhi Ding(1), Peter J. Bygrave(1), Animashree Anandkumar(3,4), Matthew Welborn(1), Frederick R. Manby(1), and Thomas F. Miller III(1,2) "OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy" (2021) https://arxiv.org/abs/2107.00299
a) Indicates equal contribution
- Entos, Inc., Los Angeles, CA 90027, USA
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125, USA
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA 91125, USA
- NVIDIA, Santa Clara, CA 95051, USA
The following files are included:
Filename | Description | MD5checksum |
---|---|---|
denali_labels.tar.gz |
.csv file with energy labels and other metadata |
bc9b612f75373d1d191ce7493eebfd62 |
denali_xyz_files.tar.gz |
Archive with .xyz geometry files |
edd35e95a018836d5f174a3431a751df |
Geometry data
The geometries are stored in XYZ+ format, which is compatible with a standard .xyz
format, but additionally has the
multiplicity and charges annotated in the comment line (2nd) line. The coordinates are in units of Ångstrøm.
For example, a water molecule with a charge of 0 and a spin-multiplicity of 1 (i.e. singlet) can be specified in this format as:
3
0 1
O -1.08201 1.07900 -0.02472
H -0.09268 1.08664 0.01745
H -1.37137 1.24781 0.90715
The directory structure of the geometry data contained within denali_xyz_files.tar.gz
is as follows:
xyz_files/
├── mol_id1/
│ ├──sample_id0.xyz
│ ├──sample_id1.xyz
│ ├──sample_id2.xyz
│ ├──sample_id3.xyz
│ └──sample_id4.xyz
├── mol_id2/
│ ├──sample_id0.xyz
│ ├──sample_id1.xyz
│ ├──sample_id2.xyz
│ └──sample_id3.xyz
├── ... etc
Each uniquely identifies a molecule, with the various conformer geometries for that molecule stored in the corresponding folder.
Those geometries are in turn identified by a unique
identifier.
Grouping the geometries by
is used in the OrbNet loss-function, see the Eqn. 3 in the paper.
Note that not all molecules has multiple geometries.
Training labels
The training labels (i.e. the wB97X-D3/def2-TZVP and GFN1-xTB
energies) and the training and test/validation splits are provided in
the file denali_labels.csv
in units of Hartree. All molecules are singlet states.
The .csv
file contains the following columns:
Column | Description |
---|---|
sample_id |
A unique hash generated from the QM input, also corresponds to the .xyz filename of that geometry |
subset |
The data source for that geometry, please refer to the paper for a detailed description of the various subsets |
mol_id |
Identifier for the parent molecule |
test_set |
True if the geometry is part of the test/validation set of neutral molecules |
test_set_plus |
True if the geometry is part of the test/validation set of charged molecules |
prelim_1 |
True if the geometry is part of the 10% OrbNet Denali training set |
training_set_plus |
True if the geometry is part of the full OrbNet Denali training set |
charge |
The charge of the molecule |
dft_energy |
wB97X-D3/def2-TZVP energy calculated with Qcore 0.8.17 in Hartree |
xtb1_energy |
GFN1-xTB energy calculated with Qcore 0.8.17 in Hartree |
The .csv file can be loaded in python, for example using Pandas.