figshare
Browse

A Critical Evaluation of Deep-Learning Based Phylogenetic Inference Programs Using Simulated Data Sets

dataset
posted on 2025-01-06, 12:36 authored by Yixiao ZhuYixiao Zhu, Xing-Xing ShenXing-Xing Shen

This repository contains 7 optimized machine learning models (7 on CNN and 6 on PhyDL), their training data and data sets simulated for PhtDL and Tree_learning (CNN) evaluation.

Each folder corresponds to each topic in our study:

Folder "Simulated data for PhyDL inference" contains test data sets used to evaluate PhyDL models under LBA (simuLBA.LGFG, simuLBA.C20F.1050sites and simuLBA.C60F.1098sites) and LBR (simuLBR.LGFG, simuLBR.C20F.1050sites and simuLBR.C60F.1098sites) conditions. Moreover, the trees employed for simulating these datasets are documented in simuLBA.fourtaxa and simuLBR.fourtaxa, with each tree corresponding to 100 alignments. For instance, simuLBA.1050sites.2.3.16 indicates that this alignment is the 16th one simulated from tree LBA.tre.2.3. Note that aignments simulated under LG+C20+F+Γ and LG+C60+F+Γ are in PHYLIP format, alignments simulated under LG+F+Γ are in fasta format.

Folder "Simulated data for CNN inference" contains test data sets used in general evaluation of CNN (BL.INDEL.1000.tar.bz2 and BL.INDEL.1000.tar.bz2) and data sets under 13 regions in branch-length space(REGIONS.INDEL.1000.tar.bz2 and REGIONS.NOGAP.1000.tar.bz2).

Folder “CNN_Training_data” contains data sets used to train CNN_NOGAP.Ori, CNN_FA and CNN_FE. For CNN_NOGAP.Extragaps and CNN_INDEL.Extragap, training data was generated by randomly replace 10% of characters with gaps in original data (i.e., the training data of CNN_NOGAP.Ori and CNN_INDEL.Ori).Note that CNN50K (Suvorov et al., 2020) is named as CNN_INDEL.Ori here.

Folder “PhyDL_Training_data” contains training and validation data sets used to train DNN_LBA10K, DNN_LBA40K, DNN_LBR10K, DNN_LBR40K, DNN_60K and DNN_160K. Alignments are attached with reference tree using ".link_to_alignment" function in ete3 package and pickled.

Folder “CNN_models” and “PhyDL_models” contains newly trained models.


Please let me know if you have any question about them.

Email: xingxingshen@zju.edu.cn


History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC