figshare
Browse

Data for HEAD_TED

Version 4 2025-05-02, 06:48
Version 3 2025-04-08, 07:37
Version 2 2025-01-15, 10:33
Version 1 2024-11-18, 11:52
dataset
posted on 2025-05-02, 06:48 authored by Fan FanFan Fan, Bin Xi, Xianghu Meng, Han WangHan Wang, Bowen Zhang, Qingbo Xu, Wei Feng, Xiaoman Wang, Hongbo Zhang, Feng Zhou, Zhenming Liu, Wenbiao Zhou, Bo Huang
# HEAD_TED Data Collections
## Data Inventory
This collection comprises a set of files generated and utilized in our study. The following archives and tables are included:
- [x] Raw_conformations.tar.gz
- [x] Mininplace_conformations.tar.gz
- [x] All_results_extended_tables.tar.gz
- [x] PoseBusters_detailed_results.tar.gz
- [x] LigBoundConf_TED_PoseCheck.csv
- [x] GM-5K_mmff94_min.sdf.tar.gz
- [x] GM-5K_raw.sdf.tar.gz
- [x] GM-5K_collection.csv
- [x] GM-1K_mmff94_min.sdf.tar.gz
- [x] GM-1K_raw.sdf.tar.gz
- [x] GM-1K_collection.csv
- [x] DFT-5K.sdf.tar.gz
- [x] DFT-5K.csv

## Data File Descriptions
### Pre- and Post-Optimization Conformations
The `Raw_conformations.tar.gz` archive contains the initial, unoptimized molecular conformations generated by 5 models in `.sdf` format. Correspondingly, the `Mininplace_conformations.tar.gz` archive houses the same set of molecules following in situ refinement using OPLS3e force field with protein pocket fixed. Within each `.sdf` file, individual molecular entries are distinguished by unique identifiers embedded in the molecule headers. Each archive contains five independent `.sdf` files, with filenames corresponding to the different generative models employed.

### Extended Results Tables
The `All_results_extended_tables.tar.gz` archive encompasses five independent `.csv` files, each named according to the generative model used. These tables provide a comprehensive overview of the generated molecules, with the following columns:

- **mol_id**: Unique identifier extracted from the header of each molecule block in the respective `.sdf` file.
- **smiles**: Simplified Molecular Input Line Entry System (SMILES) representation of the molecule.
- **model**: The generative model responsible for producing the molecule.
- **pdb_id**: The Protein Data Bank (PDB) identifier of the protein target pocket from which the binding site information was derived.
- **qed**: Quantitative Estimate of Drug-likeness (QED) score, calculated using RDKit.
- **sas**: Synthetic Accessibility (SA) score, calculated using RDKit.
- **mininplace_failure**: A binary indicator of the success of the local optimization process within the binding pocket:
- `0`: Optimization was successful.
- `1`: Optimization failed.
- **raw_PoseBusters_ligand_protein_interaction_invalidity**: A binary indicator assessing the validity of ligand-protein interactions in the unoptimized conformation, as determined by PoseBusters:
- `0`: Valid interaction geometry.
- `1`: Invalid interaction geometry.
- **raw_PoseBusters_ligand_conformation_invalidity**: A binary indicator assessing the validity of the ligand's internal conformation in the unoptimized state, as determined by PoseBusters:
- `0`: Valid ligand conformation.
- `1`: Invalid ligand conformation.
- **min_PoseBusters_ligand_protein_interaction_invalidity**: A binary indicator assessing the validity of ligand-protein interactions in the minimized conformation, as determined by PoseBusters:
- `0`: Valid interaction geometry.
- `1`: Invalid interaction geometry.
- **min_PoseBusters_ligand_conformation_invalidity**: A binary indicator assessing the validity of the ligand's internal conformation in the minimized state, as determined by PoseBusters:
- `0`: Valid ligand conformation.
- `1`: Invalid ligand conformation.
- **HEAD_ligand_protein_interaction_invalidity**: An indicator assessing the validity of ligand-protein interactions in the unoptimized conformation, as determined by HEAD:
- `0`: Valid interaction geometry.
- `1`: Invalid interaction geometry.
- `-1`: Unsupported conformation that may contain elements out of {H, C, N, O, F, S, Cl} or encounter unexpected error during loading
- **HEAD_ligand_conformation_invalidity**: An indicator assessing the validity of the ligand's internal conformation in the unoptimized state, as determined by HEAD:
- `0`: Valid ligand conformation.
- `1`: Invalid ligand conformation.
- `-1`: Unsupported conformation that may contain elements out of {H, C, N, O, F, S, Cl} or encounter unexpected error during loading
- **TED_torsion_energy_irrationality**: An indicator of the plausibility of the predicted torsion energy, as determined by TED:
- `0`: Torsion energy is considered rational.
- `1`: Torsion energy is considered irrational.
- `-1`: The conformation is unsupported by the TED model.
The original, detailed output files from the PoseBusters analysis are provided in the compressed archive `PoseBusters_detailed_results.tar.gz`. This archive contains 10 independent `.csv` files, detailing the results for all analyzed conformations, encompassing both the raw and the in situ optimized structures.

### LigBoundConf Extension Data

The `LigBoundConf_TED_PoseCheck.csv` file contains data used for the comparative analysis of PoseCheck and TED on a set of optimized ligand-bound conformations. The columns are defined as follows:


- **mol_id**: The ligand identifier as used in the original LigBoundConf publication.

- **LCSE**: The Local Conformational Strain Energy value (in kcal/mol) as reported in the original LigBoundConf publication.

- **TED_torsion_energy_irrationality**: As described in the "Extended Results Tables" section.

- **PoseCheck_strain_energy**: The strain energy calculated by PoseCheck (units should be specified if known, e.g., kcal/mol).


### GM-5K and GM-1K Subsets

The GM-5K and GM-1K datasets represent subsets of 5,000 and 1,000 molecules, respectively, extracted from the larger collection. In addition to the original conformations, this release includes two new compressed `.sdf` archives containing conformations optimized using the MMFF94 force field: `GM-5K_mmff94_min.sdf.tar.gz` and `GM-1K_mmff94_min.sdf.tar.gz`. The molecule headers within these optimized `.sdf` files retain the identifiers corresponding to the raw conformations in the original datasets, facilitating direct comparison.


Detailed DFT single-point energies (for both the original and MMFF94-optimized conformations), along with HEAD analysis results, are provided in the `.csv` files `GM-1K_collection.csv` and `GM-5K_collection.csv`.


Besides the columns mentioned above, the other columns in the `GM-1K_collection.csv` file are explained as follows:


- **DFT_single_point_energy_raw**: Single-point energy (in kcal/mol) of the unoptimized raw conformation calculated using Density Functional Theory (DFT).

- **DFT_single_point_energy_min**: Single-point energy (in kcal/mol) of the MMFF94-optimized conformation calculated using Density Functional Theory (DFT).

- **dE**: The energy difference ($ΔE=E_{raw}−E_{opt}$) in kcal/mol, representing the change in energy upon MMFF94 optimization.

- **HEAD_invalid_atoms**: A list detailing any atoms flagged as invalid by HEAD, along with the associated high-energy response value (in kcal/mol). If no invalid atoms are detected, this field contains None. For example: [(2, 'C', 40.962)] indicates that carbon atom number 2 (indexed from 1) was flagged with a high-energy response of 40.962 kcal/mol (Note: this energy value is a relative indicator and may not represent the absolute energy).

- **information_entropy_label**: A binary indicator where 1 signifies an invalid conformation detected solely by the information entropy approach; otherwise, 0.


The `GM-5K_collection.csv` file includes all the columns mentioned above for `GM-1K_collection.csv`, as well as the following additional columns:


- **MM/GBSA**:The calculated binding energy (in kcal/mol) of each molecule and its binding protein using the Molecular Mechanics/Generalized Born Surface Area (MM/GBSA) method. Amber ff14SB force field is used for protien and the General Amber Force Field 2 (GAFF2) is for organic molecules.

- **HEAD_E_bind**: The calculated binding energy by HEAD approach of each ligand molecule and its binding protein, unit in (kcal/mol). $E_{bind} = E^{bound}_{complex}-E^{isolated}_{protein} - E^{isolated}_{ligand}$.

- **PoseCheck_num_clashes**: Number of steric clashes detected in the binding pose by PoseCheck.

- **SMINA_docking_score**: The docking score of the input binding pose by SMINA software, using 'score_only' tag.

- **DrugPose_score**: The calcuated simlarity score by DrugPose approach, where a higher value indicates a more similar binding pose to the reference ligand.

- **DrugPose_ligand_protein_interaction_invalidity**: An indicator assessing the validity of protein-ligand interaction detected solely using DrugPose by comparing its DrugPose_score with a threshold of 50:

- `0`: Valid interaction geometry.

- `1`: Invalid interaction geometry.

- `-1`: Not supported.


### DFT-5K

The conformations of DFT-5K dataset are provided in the `DFT-5K.sdf.tar.gz` archive. All molecules in this file have been optimized using the DFT method with appropriate constraints on specific dihedrals. Their single-point energies are re-calculated using a higher-level DFT method than used for optimization.


Each torsion fragment in DFT-5K is represented by 24 conformations, grouped under the same name in the header, corresponding to different dihedral angle values ranging from -180° to 180° in 15° increments (-180° is equivalent to 180°). Each molecule block in the `.sdf` file includes the following properties:

- **TORSION_ATOMS**: Indicese of atom quartet defining the specific dihedral, starting from 0.

- **DIHEDRAL_ANGLE**: The degree of the dihedral being investigated.

- **DFT_SINGLE_POINT_ENERGY**: Single-point energy (in kcal/mol) calculated using DFT method.


In addition to the `.sdf` file, the dataset includes a `DFT-5K.csv` file containing the following columns:

- **DFT_id**: The name in the header of each molecule block in `.sdf` file. Conformations with different dihedral angles of the same torsion fragment share the same DFT_id.

- **xTB_dih_relative_energies**: A string of relative energies, joined by `--`, representing the energies of the conformations optimized with constraints and calculated using GFN2-xTB. The order of energies corresponds to dihedral angles from -180° to 180° in 15° increments (24 values in total). **Note**: These are relative energies, calculated by subtracting the minimum energy value among the 24 conformations, different from the single-point energies listed in previous sections.

- **model_dih_relative_energies**: Similar to the above, but the relative energies are predicted by TED-Model.




History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC