figshare
Browse

BGC-Prophet

software
posted on 2025-04-01, 07:34 authored by Haohong ZhangHaohong Zhang, Yuguo ZhaYuguo Zha, Shuai Luo, Qilong Lai, Haobo Zhang, Ying Ye, Yonghui Zhang, Hong Bai, Kang Ning

BGC-Prophet


BGC-Prophet is a deep learning approach that leverages a language processing neural network model to accurately identify known biosynthetic gene clusters (BGCs) and extrapolate novel ones. For more information, visit the GitHub repository.

Installation


BGC-Prophet can be installed using pip:

pip install bgc_prophet

Alternatively, an offline installation package is available on the GitHub release page:

pip install bgc_prophet-0.1.0-py3-none-any.whl

BGC-Prophet is developed in Python 3 and uses PyTorch for model training and inference. GPU devices are recommended for acceleration.


Usage


BGC-Prophet Pipeline


BGC-Prophet can detect and classify BGCs in genomic sequences through the following steps:

1. Extracting word embeddings for genes using the ESM2 model.

2. Organizing genomes and splitting them into gene sequences of length 128.

3. Identifying BGC genes using a trained detection model.

4. Classifying detected BGCs and saving results in a CSV file.


Example command for running the pipeline:

bgc_prophet pipeline --genomesDir ./pathtogenomesdirectory/ --modelPath ./pathto/annotator.pt --saveIntermediate --name nameoftask --threshold 0.5 --max_gap 3 --min_count 2 --classifierPath ./pathto/classifier.pt  --classify_t 0.5

Use bgc_prophet pipeline --help for detailed parameter explanations.


Model Download


Trained models can be downloaded from the GitHub release page:

wget https://github.com/HUST-NingKang-Lab/BGC-Prophet/files/12733164/model.tar.gz

• annotator.pt: Detects BGC genes

• classifier.pt: Classifies BGCs


Step-by-Step Operations


1. Get Embeddings

Extract gene embeddings using the ESM2-8M model:

bgc_prophet extract esm2_t6_8M_UR50D ./genome.fasta ./lmdb_genomes --toks_per_batch 40960 --include mean

2. Organize Genomes

Prepare genomes for processing:

bgc_prophet organize --genomesDir ./genomesFastaDirectory/ --outputPath ./output/ --name organize --threads 10

3. Split Sequences

Split genome sequences into gene segments:

bgc_prophet split --genomesPath ./output/organize.csv --outputPath ./output/ --name split --threads 10

4. Gene Prediction

Detect BGC genes using a trained model:

bgc_prophet predict --datasetPath ./output/split.csv --modelPath ./annotator.pt --outputPath ./output/ --lmdbPath ./lmdb_genomes --name prediction --device cuda --saveIntermediate

5. Output Processing

Merge genes into BGCs and filter results:

bgc_prophet output --datasetPath ./output/split.csv --outputPath ./output/ --loadIntermediate ./output/intermediate_prediction.npy --name output --threshold 0.5 --max_gap 3 --min_count 2

6. BGC Classification

Apply a trained classifier to detected BGCs:

bgc_prophet classify --datasetPath ./output.csv --classifierPath ./pathto/classifier.pt --outputPath ./output/ --lmdbPath ./lmdb_genomes --name classify --device cuda

Final results are saved as a CSV file containing detection and classification outputs.


Publication


Deciphering the Biosynthetic Potential of Microbial Genomes Using a BGC Language Processing Neural Network Model: bioRxiv


Maintainers

Qilong Lai (Institute of Neuroscience, Chinese Academy of Sciences) – laiqilong@hust.edu.cn

Shuai Yao (Academy for Advanced Interdisciplinary Studies, Peking University) – yaoshuai@stu.pku.edu.cn

Yuguo Zha (School of Life Science and Technology, Huazhong University of Science & Technology) – hugozha@hust.edu.cn

Kang Ning (School of Life Science and Technology, Huazhong University of Science & Technology) – ningkang@hust.edu.cn

Funding

National Key R&D Program of China (Grant Nos. 2021YFA0910500, SQ2023YFA1800082, and 2018YFC0910502)

National Natural Science Foundation of China (Grant Nos. 32071465, 31871334, and 31671374)

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC