Using AI to Annotate Viral Proteins

Using AI to Annotate Viral Proteins

Viruses are abundant in nature and have wide-ranging impacts as killers and manipulators of their bacterial hosts in every environment yet studied. Annotating viral proteins often involves collecting environmental samples and sequencing their amino acids to determine their structure, function, and diversity. However, annotation of “virus-like” protein sequences currently relies on alignment-based sequencing methods (i.e., comparing the sequences of newly discovered proteins with already-annotated viral proteins) that are hobbled by a severe lack of well-characterized viral proteins.

In research published online on January 29 in Nature Microbiology, Libusha Kelly, Ph.D., and colleagues investigated whether large, artificial-intelligence-based protein language models—similar to ChatGPT but for proteins—could be used to annotate viral proteins that can’t be annotated using current methods. When their ChatGPT-like model was applied to global ocean virome data, it expanded the annotated fraction of viral protein families by 29%. When applied to previously unannotated viral protein sequences, the protein language model identified an integrase (a viral enzyme that integrates viral DNA into host cell DNA) and a protein that forms part of the capsid shell of viruses, both of which are widespread in the global oceans. This novel application of language models to biology lays the groundwork for using similar models to improve protein annotation across other biological systems.

Dr. Kelly is an associate professor of systems & computational biology and of microbiology & immunology at Einstein, and a member of the National Cancer Institute–designated Montefiore Einstein Comprehensive Cancer Center.