BGC-Prophet

software

posted on 2025-04-01, 07:34 authored by Haohong ZhangHaohong Zhang, Yuguo ZhaYuguo Zha, Shuai Luo, Qilong Lai, Haobo Zhang, Ying Ye, Yonghui Zhang, Hong Bai, Kang Ning

BGC-Prophet

BGC-Prophet is a deep learning approach that leverages a language processing neural network model to accurately identify known biosynthetic gene clusters (BGCs) and extrapolate novel ones. For more information, visit the GitHub repository.

Installation

BGC-Prophet can be installed using pip:

pip install bgc_prophet

Alternatively, an offline installation package is available on the GitHub release page:

pip install bgc_prophet-0.1.0-py3-none-any.whl

BGC-Prophet is developed in Python 3 and uses PyTorch for model training and inference. GPU devices are recommended for acceleration.

Usage

BGC-Prophet Pipeline

BGC-Prophet can detect and classify BGCs in genomic sequences through the following steps:

1. Extracting word embeddings for genes using the ESM2 model.

2. Organizing genomes and splitting them into gene sequences of length 128.

3. Identifying BGC genes using a trained detection model.

4. Classifying detected BGCs and saving results in a CSV file.

Example command for running the pipeline:

bgc_prophet pipeline --genomesDir ./pathtogenomesdirectory/ --modelPath ./pathto/annotator.pt --saveIntermediate --name nameoftask --threshold 0.5 --max_gap 3 --min_count 2 --classifierPath ./pathto/classifier.pt  --classify_t 0.5

Use bgc_prophet pipeline --help for detailed parameter explanations.

Model Download

Trained models can be downloaded from the GitHub release page:

wget https://github.com/HUST-NingKang-Lab/BGC-Prophet/files/12733164/model.tar.gz

• annotator.pt: Detects BGC genes

• classifier.pt: Classifies BGCs

Step-by-Step Operations

1. Get Embeddings

Extract gene embeddings using the ESM2-8M model:

bgc_prophet extract esm2_t6_8M_UR50D ./genome.fasta ./lmdb_genomes --toks_per_batch 40960 --include mean

2. Organize Genomes

Prepare genomes for processing:

bgc_prophet organize --genomesDir ./genomesFastaDirectory/ --outputPath ./output/ --name organize --threads 10

3. Split Sequences

Split genome sequences into gene segments:

bgc_prophet split --genomesPath ./output/organize.csv --outputPath ./output/ --name split --threads 10

4. Gene Prediction

Detect BGC genes using a trained model:

bgc_prophet predict --datasetPath ./output/split.csv --modelPath ./annotator.pt --outputPath ./output/ --lmdbPath ./lmdb_genomes --name prediction --device cuda --saveIntermediate

5. Output Processing

Merge genes into BGCs and filter results:

bgc_prophet output --datasetPath ./output/split.csv --outputPath ./output/ --loadIntermediate ./output/intermediate_prediction.npy --name output --threshold 0.5 --max_gap 3 --min_count 2

6. BGC Classification

Apply a trained classifier to detected BGCs:

bgc_prophet classify --datasetPath ./output.csv --classifierPath ./pathto/classifier.pt --outputPath ./output/ --lmdbPath ./lmdb_genomes --name classify --device cuda

Final results are saved as a CSV file containing detection and classification outputs.

Publication

Deciphering the Biosynthetic Potential of Microbial Genomes Using a BGC Language Processing Neural Network Model: bioRxiv

Maintainers

• Qilong Lai (Institute of Neuroscience, Chinese Academy of Sciences) – laiqilong@hust.edu.cn

• Shuai Yao (Academy for Advanced Interdisciplinary Studies, Peking University) – yaoshuai@stu.pku.edu.cn

• Yuguo Zha (School of Life Science and Technology, Huazhong University of Science & Technology) – hugozha@hust.edu.cn

• Kang Ning (School of Life Science and Technology, Huazhong University of Science & Technology) – ningkang@hust.edu.cn

BGC-Prophet

Funding

National Key R&D Program of China (Grant Nos. 2021YFA0910500, SQ2023YFA1800082, and 2018YFC0910502)

National Natural Science Foundation of China (Grant Nos. 32071465, 31871334, and 31671374)

History

Usage metrics

Categories

Keywords

Licence

Exports