posted on 2021-05-04, 17:07authored byLowik Chanussot, Abhishek Das, Siddharth Goyal, Thibaut Lavril, Muhammed Shuaibi, Morgane Riviere, Kevin Tran, Javier Heras-Domingo, Caleb Ho, Weihua Hu, Aini Palizhati, Anuroop Sriram, Brandon Wood, Junwoong Yoon, Devi Parikh, C. Lawrence Zitnick, Zachary Ulissi
Catalyst
discovery and optimization is key to solving many societal
and energy challenges including solar fuel synthesis, long-term energy
storage, and renewable fertilizer production. Despite considerable
effort by the catalysis community to apply machine learning models
to the computational catalyst discovery process, it remains an open
challenge to build models that can generalize across both elemental
compositions of surfaces and adsorbate identity/configurations, perhaps
because datasets have been smaller in catalysis than in related fields.
To address this, we developed the OC20 dataset, consisting of 1,281,040
density functional theory (DFT) relaxations (∼264,890,000 single-point
evaluations) across a wide swath of materials, surfaces, and adsorbates
(nitrogen, carbon, and oxygen chemistries). We supplemented this dataset
with randomly perturbed structures, short timescale molecular dynamics,
and electronic structure analyses. The dataset comprises three central
tasks indicative of day-to-day catalyst modeling and comes with predefined
train/validation/test splits to facilitate direct comparisons with
future model development efforts. We applied three state-of-the-art
graph neural network models (CGCNN, SchNet, and DimeNet++) to each
of these tasks as baseline demonstrations for the community to build
on. In almost every task, no upper limit on model size was identified,
suggesting that even larger models are likely to improve on initial
results. The dataset and baseline models are both provided as open
resources as well as a public leader board to encourage community
contributions to solve these important tasks.