figshare
Browse

OrbNet Denali Training Data

Version 2 2021-07-02, 16:33
Version 1 2021-07-01, 09:49
dataset
posted on 2021-07-02, 16:33 authored by Anders S. ChristensenAnders S. Christensen, Sai Krishna Sirumalla, Zhuoran Qiao, Michael B. O'Connor, Daniel G. A. Smith, Feizhi Ding, Peter J. Bygrave, Animashree Anandkumar, Matthew WelbornMatthew Welborn, Frederick R. Manby, Thomas F. Miller III
OrbNet Denali Training Data

This repository contains the data for the paper "OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy". The data set consists of geometries of molecules and the corresponding energy labels calculated and the DFT and semi-empirical level.

Citation

Anders S. Christensen(1,a), Sai Krishna Sirumalla(1,a), Zhuoran Qiao(2), Michael B. O'Connor(1), Daniel G. A. Smith(1), Feizhi Ding(1), Peter J. Bygrave(1), Animashree Anandkumar(3,4), Matthew Welborn(1), Frederick R. Manby(1), and Thomas F. Miller III(1,2) "OrbNet Denali: A machine learning potential for biological and organic chemistry with semi-empirical cost and DFT accuracy" (2021) https://arxiv.org/abs/2107.00299

a) Indicates equal contribution

  1. Entos, Inc., Los Angeles, CA 90027, USA
  2. Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125, USA
  3. Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA 91125, USA
  4. NVIDIA, Santa Clara, CA 95051, USA
Contents

The following files are included:

Filename Description MD5checksum
denali_labels.tar.gz .csv file with energy labels and other metadata bc9b612f75373d1d191ce7493eebfd62
denali_xyz_files.tar.gz Archive with .xyz geometry files edd35e95a018836d5f174a3431a751df

Geometry data

The geometries are stored in XYZ+ format, which is compatible with a standard .xyz format, but additionally has the multiplicity and charges annotated in the comment line (2nd) line. The coordinates are in units of Ångstrøm.

For example, a water molecule with a charge of 0 and a spin-multiplicity of 1 (i.e. singlet) can be specified in this format as:

3
0 1
O   -1.08201   1.07900  -0.02472
H   -0.09268   1.08664   0.01745
H   -1.37137   1.24781   0.90715

The directory structure of the geometry data contained within denali_xyz_files.tar.gz is as follows:

xyz_files/
├── mol_id1/
│   ├──sample_id0.xyz
│   ├──sample_id1.xyz
│   ├──sample_id2.xyz
│   ├──sample_id3.xyz
│   └──sample_id4.xyz
├── mol_id2/
│   ├──sample_id0.xyz
│   ├──sample_id1.xyz
│   ├──sample_id2.xyz
│   └──sample_id3.xyz
├── ... etc

Each uniquely identifies a molecule, with the various conformer geometries for that molecule stored in the corresponding folder. Those geometries are in turn identified by a unique identifier. Grouping the geometries by is used in the OrbNet loss-function, see the Eqn. 3 in the paper. Note that not all molecules has multiple geometries.

Training labels

The training labels (i.e. the wB97X-D3/def2-TZVP and GFN1-xTB energies) and the training and test/validation splits are provided in the file denali_labels.csv in units of Hartree. All molecules are singlet states.

The .csv file contains the following columns:

Column Description
sample_id A unique hash generated from the QM input, also corresponds to the .xyz filename of that geometry
subset The data source for that geometry, please refer to the paper for a detailed description of the various subsets
mol_id Identifier for the parent molecule
test_set True if the geometry is part of the test/validation set of neutral molecules
test_set_plus True if the geometry is part of the test/validation set of charged molecules
prelim_1 True if the geometry is part of the 10% OrbNet Denali training set
training_set_plus True if the geometry is part of the full OrbNet Denali training set
charge The charge of the molecule
dft_energy wB97X-D3/def2-TZVP energy calculated with Qcore 0.8.17 in Hartree
xtb1_energy GFN1-xTB energy calculated with Qcore 0.8.17 in Hartree

The .csv file can be loaded in python, for example using Pandas.

Funding

DE-AC02-05CH11231

DE-AC05-00OR22725

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC