figshare
Browse
1/13
258 files

QM1B: one billion quantum mechanical simulations containing 9-11 heavy atoms

dataset
posted on 2023-11-08, 17:38 authored by Alexander MathiasenAlexander Mathiasen, Hatem HelalHatem Helal, Kerstin Klaser, Paul Balanca, Josef Dean, Carlo Luschi, Dominique BeainiDominique Beaini, Andrew Fitzgibbon, Dominic Masters

This is the permanent storage location for the dataset described in Generating QM1B with PySCFIPU and documented by the accompanying Datasheet.

QM1B is a low-resolution DFT dataset generated using PySCF IPU. It is composed of one billion training examples containing 9-11 heavy atoms. It was created by taking 1.09M SMILES strings from the GDB-11 database and computing molecular properties (e.g. HOMO-LUMO gap) for a set of up to 1000 conformers per molecule.

We provide both the source code for PySCFIPU and dataset tools for using the QM1B dataset.

Dataset schema

See the QM1B datasheet for detailed documentation following the datasheets for datasets framework.

QM1B dataset is stored in the open-source columnar Apache Parquet format, with the following schema:

  • smile: The SMILES string taken from GDB11. There are up to 1000 rows (i.e. conformers) with the same SMILES string.
  • atoms: String representing the atom symbols of the molecule, e.g. ”COOH”.
  • z: Integer representation of atoms used by SchNet (the atomic numbers).
  • energy: energy of the molecule computed by PySCF IPU (unit eV).
  • homo: The energy of the Highest Occupied Molecular Orbital (HOMO) (unit eV).
  • lumo: The energy of the Lowest occupied Molecular Orbital (LUMO) (unit eV).
  • N: The number of atomic orbitals for the specific DFT computation (depends on the basis set STO3G).
  • std: The standard deviation of the energy of the last five iterations of running PySCFIPU, used as convergence criteria std < 0.01 (unit eV).
  • y: The HOMO-LUMO Gap (unit eV).
  • pos: The atom positions (unit Bohr).

Further examples for working with this dataset are available in accompanying github repo where we welcome contributions.

History

Research Institution(s)

Graphcore Research

I confirm there is no human personally identifiable information in the files or description shared

  • Yes

I confirm the files and description shared may be publicly distributed under the license selected

  • Yes

Competing Interest Statement

The sole funder of this project was Graphcore.