The essential breakthrough in a fresh approach for training models in protein engineering, known as EvoRank, revolves around utilizing the natural diversity of millions of proteins shaped by evolution over extensive periods. This method extracts crucial dynamics necessary for achieving practical solutions to challenges in biotechnology.
Researchers at The University of Texas at Austin have created a new artificial intelligence model that opens the door for improved and less harmful medical treatments, along with new preventive methods. This AI model aids in designing protein-based therapies and vaccines by drawing on the logical frameworks provided by nature’s evolutionary history.
The AI innovation, called EvoRank, serves as a significant example of how AI can drive transformative changes in biomedical research and biotechnology as a whole. Scientists presented their findings at the International Conference on Machine Learning and discussed a related paper in Nature Communications concerning the use of a broader AI framework to pinpoint beneficial mutations in proteins.
A significant challenge in creating enhanced protein-based biotechnologies is the lack of sufficient experimental data about proteins to properly train AI models. These models need to understand the specific functions of proteins to engineer them for particular applications. EvoRank’s key idea is to leverage the natural variations observed in millions of proteins produced by evolution over billions of years and extract essential dynamics that lead to viable solutions for biotechnology problems.
“Nature has been refining proteins for over 3 billion years, making mutations or substituting amino acids and preserving those that benefit organisms,” said Daniel Diaz, a research scientist in computer science and co-lead of the Deep Proteins group, which comprises specialists in computer science and chemistry at UT. “EvoRank learns to evaluate the evolution we observe today, essentially distilling the principles governing protein evolution and applying these principles to guide the creation of new protein-based applications, including drug development, vaccines, and various biomanufacturing tasks.”
UT houses one of the nation’s premier AI research programs and is home to the National Science Foundation-funded Institute for Foundations of Machine Learning (IFML), directed by computer science professor Adam Klivans, who also co-leads Deep Proteins. Recently, the Advanced Research Projects Agency for Health awarded a nearly $2.5 million grant to the Deep Proteins team and vaccine researcher Jason McLellan, a UT professor of molecular biosciences, in alliance with the La Jolla Institute for Immunology. This funding will allow the UT team to apply AI to protein engineering research aimed at creating vaccines to combat herpesviruses.
“Creating proteins with capabilities that are not found in natural proteins is a persistent grand challenge in the life sciences,” Klivans explained. “This task aligns perfectly with what generative AI models excel at, as they can synthesize extensive databases of known biochemistry and then produce novel designs.”
In contrast to Google DeepMind’s AlphaFold, which utilizes AI to forecast the shape and structure of proteins according to their amino acid sequences, the Deep Proteins group’s AI systems focus on suggesting optimal modifications in proteins to enhance specific functionalities, such as improving the development of proteins into new biotechnologies.
McLellan’s laboratory is already generating various versions of viral proteins based on designs produced by AI and is currently assessing their stability and other characteristics.
“The models have proposed substitutions that we wouldn’t have initially considered,” McLellan mentioned. “They are effective, but they are not predictions we would have made, which means they are discovering new solutions for stability.”
Protein therapeutics typically present fewer side effects and can be safer and more effective than other options, with the current global market estimated at $400 billion poised to increase by over 50% in the next decade. However, the process of developing a protein-based medication is often slow, expensive, and fraught with risks. It can take a decade or more and cost upwards of $1 billion to advance from drug design to finishing clinical trials, and even then, the chances of receiving FDA approval for a new drug are only about 1 in 10. Moreover, to be therapeutically viable, proteins often need genetic modifications to ensure stability or yield levels necessary for drug development, usually determined through cumbersome trial-and-error methods in laboratories.
If EvoRank—and its related UT-developed framework, Stability Oracle—gains commercial adaptation, the industry could significantly reduce the time and costs associated with drug development while providing a roadmap to accelerate the creation of enhanced designs.
By utilizing existing databases of naturally occurring protein sequences, the creators of EvoRank aligned different variations of the same protein found in diverse organisms—from starfish to oak trees to humans—and compared them. At any specific position within the protein, several different amino acids may exist that evolution has deemed beneficial, with nature frequently selecting tyrosine 36% of the time, histidine 29% of the time, lysine 14% of the time—and critically, *never* leucine. This wealth of existing data uncovers an underlying logic in protein evolution. Researchers can eliminate options that evolution suggests would compromise the protein’s functionality, using this information to train the new machine learning algorithm. The model learns from ongoing feedback, understanding which amino acids nature favored historically during protein evolution and what is plausible in nature.
Next, Diaz aims to develop a “multicolumn” version of EvoRank that can assess the simultaneous impact of multiple mutations on a protein’s structure and stability. He is also interested in creating new tools to predict how a protein’s structure correlates with its function.
In addition to Klivans and Diaz, computer science graduate student Chengyue Gong and UT alumnus James M. Loy co-authored these studies. Tianlong Chen and Qiang Liu were contributors to EvoRank, while Jeffrey Ouyang-Zhang, David Yang, Andrew D. Ellington, and Alex G. Dimakis assisted with Stability Oracle. The research received funding from the NSF, the Defense Threat Reduction Agency, and The Welch Foundation.