2025-01-14 - PAG Talk - Kmerseek - Olga Botvinnik - Seanome
Identifying functions of proteins using traditional methods is challenging as remote homologs often lack detectable sequence similarity despite maintaining functional similarity. Through exceptional manual effort, scientists have functionally annotated ~500,000 proteins in UniProtKB, but these annotations are heavily biased toward well-studied organisms, greatly undersampling the millions of species on Earth and their billions of unexplored sequences. We present a novel method for identifying remote protein homologs through data augmentation, reducing the 20-letter amino acid alphabet into a binary representation (hydrophobic vs. polar) and using short subsequences (10-30 amino acids) to identify biochemically similar protein domains. Our approach effectively expands the searchable sequence space by several orders of magnitude, with ongoing benchmarking against established orthology detection and functional annotation methods. In the marine chordate Botryllus schlosseri, our method identifies functional annotations missed by conventional tools like BLAST, HMMER, and Foldseek. For example, in the previously unannotated Botryllus histocompatibility factor (BHF), our approach reveals multiple functional domains such as nuclear localization sequences, DNA binding regions, and membrane tether interactions; these have been validated through in vitro experiments showing fluorescently tagged BHF localizing to the cellular membrane. This work establishes protein k-mer data augmentation as an effective approach for bridging major evolutionary gaps in sequence annotation, providing a foundation for characterizing the vast number of proteins with unknown function and expanding our understanding of protein diversity across the tree of life.