pcbi.1011498.g002.tif (738.47 kB)

Representation of a typical workflow using the reported tools.

figure

posted on 2023-11-07, 18:22 authored by Dimitrios Vasileiou, Christos Karapiperis, Ismini Baltsavia, Anastasia Chasapi, Dag Ahrén, Paul J. Janssen, Ioannis Iliopoulos, Vasilis J. Promponas, Anton J. Enright, Christos A. Ouzounis

Pre-processing may start with a genome collection (database symbol, upper left), optionally mixed with a curated sequence resource such as UniProt (database symbol in green, upper left). To cross-index entries at the sequence level or simply identify them, MagicMatch can be used as an option. The sequence collection can be submitted to GeneCAST to mask compositional bias and prepare the query for sensitive searches (disk symbol with Q, lower left). For genome-scale analysis, species codes can be generated for the reference (target) set with cogent_utils, to create a uniformly named sequence set (disk symbol with R, lower middle, optionally mixed with UniProt or any other annotated collection). Sequence comparisons are executed with BLAST or other options with query Q vs. reference R (or in the case of all-vs-all, disk symbol in green-blue gradient, upper middle). The vertical gray line divides this pre-processing phase from the next phase, signifying the computationally intensive step or long wall-time. Two (non-mutually exclusive) output alternatives are shown: the pairs-list (in pink, upper right) or full alignments (also in pink, lower right). The former can be treated with clustt_utils that launches Tribe-MCL and generates protein families or can be used as input for network visualization with BioLayout or other similar software, while the latter can be further processed for GeneRAGE or DifFuse for multi-domain or gene-fusion detection, respectively, as well as for inspection and parsing for multiple alignments.