OrbNet Denali Training Data
<p>This repository contains the data for the paper "OrbNet Denali: A machine learning potential for biological and
organic chemistry with semi-empirical cost and DFT accuracy". The data set consists of geometries of molecules
and the corresponding energy labels calculated and the DFT and semi-empirical level.</p>
<a href="https://gist.github.com/andersx/9780f0f9d7be2f604872062cac4f8086#citation"></a>Citation
<p>Anders S. Christensen(1,a), Sai Krishna Sirumalla(1,a), Zhuoran
Qiao(2), Michael B. O'Connor(1), Daniel G. A. Smith(1), Feizhi Ding(1),
Peter J. Bygrave(1), Animashree Anandkumar(3,4), Matthew Welborn(1),
Frederick R. Manby(1), and Thomas F. Miller III(1,2)
"OrbNet Denali: A machine learning potential for biological and organic
chemistry with semi-empirical cost and DFT accuracy" (2021) https://arxiv.org/abs/2107.00299</p>
<p>a) Indicates equal contribution</p>
<ol><li>Entos, Inc., Los Angeles, CA 90027, USA</li><li>Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA 91125, USA</li><li>Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA 91125, USA</li><li>NVIDIA, Santa Clara, CA 95051, USA</li></ol>
<a href="https://gist.github.com/andersx/9780f0f9d7be2f604872062cac4f8086#contents"></a>Contents
<p>The following files are included:</p>
<table>
<tr>
<th><strong>Filename</strong></th>
<th><strong>Description</strong></th>
<th><strong>MD5checksum</strong></th>
</tr>
<tr>
<td><code>denali_labels.tar.gz</code></td>
<td><code>.csv</code> file with energy labels and other metadata</td>
<td><code>bc9b612f75373d1d191ce7493eebfd62</code></td>
</tr>
<tr>
<td><code>denali_xyz_files.tar.gz</code></td>
<td>Archive with <code>.xyz</code> geometry files</td>
<td><code>edd35e95a018836d5f174a3431a751df</code></td>
</tr>
</table>
<h2><a href="https://gist.github.com/andersx/9780f0f9d7be2f604872062cac4f8086#geometry-data"></a></h2><h2>Geometry data</h2>
<p>The geometries are stored in XYZ+ format, which is compatible with a standard <code>.xyz</code> format, but additionally has the
multiplicity and charges annotated in the comment line (2nd) line. The coordinates are in units of Ångstrøm.</p>
<p>For example, a water molecule with a charge of 0 and a spin-multiplicity of 1 (i.e. singlet) can be specified in this format as:</p>
<pre><code>3
0 1
O -1.08201 1.07900 -0.02472
H -0.09268 1.08664 0.01745
H -1.37137 1.24781 0.90715
</code></pre>
<p>The directory structure of the geometry data contained within <code>denali_xyz_files.tar.gz</code> is as follows:</p>
<pre><code>xyz_files/
├── mol_id1/
│ ├──sample_id0.xyz
│ ├──sample_id1.xyz
│ ├──sample_id2.xyz
│ ├──sample_id3.xyz
│ └──sample_id4.xyz
├── mol_id2/
│ ├──sample_id0.xyz
│ ├──sample_id1.xyz
│ ├──sample_id2.xyz
│ └──sample_id3.xyz
├── ... etc
</code></pre>
<p>Each <code></code> uniquely identifies a molecule, with the various conformer geometries for that molecule stored in the corresponding folder.
Those geometries are in turn identified by a unique <code></code> identifier.
Grouping the geometries by <code></code> is used in the OrbNet loss-function, see the Eqn. 3 in the paper.
Note that not all molecules has multiple geometries.</p>
<h2><a href="https://gist.github.com/andersx/9780f0f9d7be2f604872062cac4f8086#training-labels"></a></h2><h2>Training labels</h2>
<p>The training labels (i.e. the wB97X-D3/def2-TZVP and GFN1-xTB
energies) and the training and test/validation splits are provided in
the file <code>denali_labels.csv</code> in units of Hartree. All molecules are singlet states.</p>
<p>The <code>.csv</code> file contains the following columns:</p>
<table>
<tr>
<th><strong>Column</strong></th>
<th><strong>Description</strong></th>
</tr>
<tr>
<td><code>sample_id</code></td>
<td>A unique hash generated from the QM input, also corresponds to the .xyz filename of that geometry</td>
</tr>
<tr>
<td><code>subset</code></td>
<td>The data source for that geometry, please refer to the paper for a detailed description of the various subsets</td>
</tr>
<tr>
<td><code>mol_id</code></td>
<td>Identifier for the parent molecule</td>
</tr>
<tr>
<td><code>test_set</code></td>
<td>True if the geometry is part of the test/validation set of neutral molecules</td>
</tr>
<tr>
<td><code>test_set_plus</code></td>
<td>True if the geometry is part of the test/validation set of charged molecules</td>
</tr>
<tr>
<td><code>prelim_1</code></td>
<td>True if the geometry is part of the 10% OrbNet Denali training set</td>
</tr>
<tr>
<td><code>training_set_plus</code></td>
<td>True if the geometry is part of the full OrbNet Denali training set</td>
</tr>
<tr>
<td><code>charge</code></td>
<td>The charge of the molecule</td>
</tr>
<tr>
<td><code>dft_energy</code></td>
<td>wB97X-D3/def2-TZVP energy calculated with Qcore 0.8.17 in Hartree</td>
</tr>
<tr>
<td><code>xtb1_energy</code></td>
<td>GFN1-xTB energy calculated with Qcore 0.8.17 in Hartree</td>
</tr>
</table>
<p>The .csv file can be loaded in python, for example using Pandas.</p>