sorry, we can't preview this file

MPtrj_2022.9_full.json (11.35 GB)

Materials Project Trajectory (MPtrj) Dataset

Download (11.35 GB)
Version 2 2023-08-25, 22:20
Version 1 2023-07-20, 01:57
posted on 2023-08-25, 22:20 authored by Bowen DengBowen Deng

This data file is the MPtrj dataset.

The json file contains 1,580,395 structures, 1,580,395 energies, 7,944,833 magnetic moments, 49,295,660 forces, and 14,223,555 stresses that were used to train the pretrained CHGNet

The structures and labels are parsed from all the GGA/GGA+U static/relaxation trajectories from 2022.9 version Materials Project, with selection method that avoids imcompatible calculations and duplicated structures.

The format of the json file looks like this:




-'structure': dictionary of pymatgen.core.Structure

-'uncorrected_total_energy': [eV] raw energy from VASP output

-'corrected_total_energy': [eV] VASP total energy after MP2020 compatibility

-'energy_per_atom': [eV/atom] corrected energy per atom, this is the energy label used to train CHGNet

-'ef_per_atom': [eV/atom] formation energy per atom

-'e_per_atom_relaxed': [eV/atom] corrected energy per atom of the relaxed structure, this is the energy you can find for the mp-id on materials project website

-'ef_per_atom_relaxed': [eV/atom] formation energy per atom of the relaxed structure

-'force': [eV/A] force on the atoms

-'stress': [kBar] stress on the cell

-'magmom': [muB] magmom on the atoms

-'bandgap': [eV] bandgap






1. The frame id has syntax: 'task_id-calc_id-ionic_step', where 'calc_id' is 0 (second) or 1 (first) in the double relaxation process for each material project relaxation task.

2. Since MPtrj is a diverse dataset that contains both GGA and GGA+U calculation, which has different energy values, MP2020 compatibility is applied to the VASP raw energies to make GGA and GGA+U universally compatible. The 'energy_per_atom' (which is after MP2020 correction) is used for pretrained CHGNet training.


3. There're missing MAGMOMs labels in the MPtrj, which we put None as labels. These do not mean the MAGMOM label is 0. CHGNet is trained on absolute value of DFT magmom, which is the absolute value of the labels contained in MPtrj, the unit conversion is automatic if you use the dataset we provide, see:

4. The stress values in MPtrj json are raw stress values from VASP. CHGNet output stress is in unit of GPa, which is -0.1 * the VASP raw stress in MPtrj dataset. The unit conversion is also implemented in CHGNet dataset, so you don't have to convert the VASP stress unit when passing them to the dataset object.


If you use CHGNet or MPtrj dataset, please cite:


title={CHGNet as a pretrained universal neural network potential for charge-informed atomistic modelling},


journal={Nature Machine Intelligence},

author={Deng, Bowen and Zhong, Peichen and Jun, KyuJung and Riebesell, Janosh and Han, Kevin and Bartel, Christopher J. and Ceder, Gerbrand},







Usage metrics



    Ref. manager