Variant prediction in the age of Machine Learning
The items in this collections are part of the study titiled "Variant prediction in the age of Machine Learning".
Abstract:
Over the years many computational methods have been created for the analysis of impact of single amino acid substitutions resulting from single amino acid variants (SNVs) in genome coding regions. Historically, all models have been limited by the inadequate sizes of experimentally curated datasets and by the lack of a standardized definition of impact. The emergence of protein language models (pLMs) had raised an important question: Can machines learn the language of life from the unannotated protein sequence data well enough to identify significant errors in the protein “words.” Our analysis suggests that some pLMs perform as well or better than existing supervised methods. pLM performance, however, varies by the type of impact desired as prediction. New methods of variant evaluation are particularly needed in the space where existing tools underperform. Consequently, further analysis is needed to establish their performance for “dark matter” proteins – those with no homologs in pLM training data.