BGC-Prophet
BGC-Prophet
BGC-Prophet is a deep learning approach that leverages a language processing neural network model to accurately identify known biosynthetic gene clusters (BGCs) and extrapolate novel ones. For more information, visit the GitHub repository.
Installation
BGC-Prophet can be installed using pip:
pip install bgc_prophet
Alternatively, an offline installation package is available on the GitHub release page:
pip install bgc_prophet-0.1.0-py3-none-any.whl
BGC-Prophet is developed in Python 3 and uses PyTorch for model training and inference. GPU devices are recommended for acceleration.
Usage
BGC-Prophet Pipeline
BGC-Prophet can detect and classify BGCs in genomic sequences through the following steps:
1. Extracting word embeddings for genes using the ESM2 model.
2. Organizing genomes and splitting them into gene sequences of length 128.
3. Identifying BGC genes using a trained detection model.
4. Classifying detected BGCs and saving results in a CSV file.
Example command for running the pipeline:
bgc_prophet pipeline --genomesDir ./pathtogenomesdirectory/ --modelPath ./pathto/annotator.pt --saveIntermediate --name nameoftask --threshold 0.5 --max_gap 3 --min_count 2 --classifierPath ./pathto/classifier.pt --classify_t 0.5
Use bgc_prophet pipeline --help for detailed parameter explanations.
Model Download
Trained models can be downloaded from the GitHub release page:
wget https://github.com/HUST-NingKang-Lab/BGC-Prophet/files/12733164/model.tar.gz
• annotator.pt: Detects BGC genes
• classifier.pt: Classifies BGCs
Step-by-Step Operations
1. Get Embeddings
Extract gene embeddings using the ESM2-8M model:
bgc_prophet extract esm2_t6_8M_UR50D ./genome.fasta ./lmdb_genomes --toks_per_batch 40960 --include mean
2. Organize Genomes
Prepare genomes for processing:
bgc_prophet organize --genomesDir ./genomesFastaDirectory/ --outputPath ./output/ --name organize --threads 10
3. Split Sequences
Split genome sequences into gene segments:
bgc_prophet split --genomesPath ./output/organize.csv --outputPath ./output/ --name split --threads 10
4. Gene Prediction
Detect BGC genes using a trained model:
bgc_prophet predict --datasetPath ./output/split.csv --modelPath ./annotator.pt --outputPath ./output/ --lmdbPath ./lmdb_genomes --name prediction --device cuda --saveIntermediate
5. Output Processing
Merge genes into BGCs and filter results:
bgc_prophet output --datasetPath ./output/split.csv --outputPath ./output/ --loadIntermediate ./output/intermediate_prediction.npy --name output --threshold 0.5 --max_gap 3 --min_count 2
6. BGC Classification
Apply a trained classifier to detected BGCs:
bgc_prophet classify --datasetPath ./output.csv --classifierPath ./pathto/classifier.pt --outputPath ./output/ --lmdbPath ./lmdb_genomes --name classify --device cuda
Final results are saved as a CSV file containing detection and classification outputs.
Publication
Deciphering the Biosynthetic Potential of Microbial Genomes Using a BGC Language Processing Neural Network Model: bioRxiv
Maintainers
• Qilong Lai (Institute of Neuroscience, Chinese Academy of Sciences) – laiqilong@hust.edu.cn
• Shuai Yao (Academy for Advanced Interdisciplinary Studies, Peking University) – yaoshuai@stu.pku.edu.cn
• Yuguo Zha (School of Life Science and Technology, Huazhong University of Science & Technology) – hugozha@hust.edu.cn
• Kang Ning (School of Life Science and Technology, Huazhong University of Science & Technology) – ningkang@hust.edu.cn