Researchers have created a language model to interpret mRNA sequences and improve them for vaccine development. This model also has potential for studying molecular biology.
The same type of artificial intelligence that gained attention for writing software and passing the bar exam can now interpret the genetic code.
The genetic code contains instructions for all biological functions and follows rules similar to those of human languages. Each sequence in a genome follows complex grammar and syntax,
The concept of meaning can be transformed by making small changes. Similarly, small variations in a biological sequence can have a significant impact on the encoded forms of the sequence. Princeton University researchers, led by machine learning expert Mengdi Wang, are utilizing language models to focus on partial genome sequences and enhance them for biological and medical research. This research is already in progress. In a paper published on April 5 in the journal Nature Machine Intelligence, the authors describe a language model that employs semantic representation to design more effective sequences.
An mRNA vaccine, like the ones used for COVID-19, is a type of vaccine that uses active mRNA.
Understanding Genetic Information Flow
Scientists have a straightforward way of explaining how genetic information moves. They refer to it as the central dogma of biology. The flow goes from DNA to RNA to proteins. Proteins are responsible for creating the structures and functions of cells.
In the final step, known as translation, messenger RNA (mRNA) converts the information into proteins. However, mRNA is fascinating because only a portion of it contains the code for the protein. The remaining part does not undergo translation but controls crucial aspects of the translation process.
The effectiveness of mRNA vaccines relies on regulating protein production. To enhance this process, researchers concentrated on the untranslated region of the mRNA to enhance efficiency and ultimately improve vaccines.
By training the model on a range of species and focusing on the untranslated region, the researchers were able to create numerous optimized sequences. These sequences were then tested in lab experiments to confirm their effectiveness. The top-performing sequences surpassed existing benchmarks for vaccine development, resulting in a 33% increase in overall protein production efficiency.
Even a slight increase in protein production efficiency can have a significant impact,
The researchers say that mRNA vaccines are a big advancement for new treatments. They believe that these vaccines can not only help with COVID-19, but also protect against other diseases and cancers.
Wang, a professor of electrical and computer engineering and the leader of the study, mentioned that the model’s success could lead to new discoveries. By analyzing mRNA from different species, the model was able to uncover new information about gene regulation. Scientists think that gene regulation, which is crucial for life, is important for understanding diseases and disorders. Language models like this one could play a key role in this research.is one way to investigate a new approach.
Wang’s partners consist of researchers from the biotech company RVAC Medicines and the Stanford University School of Medicine.
Understanding Disease
This new method is not entirely different from the large language models used by current AI chat bots, but it does have some key distinctions. Instead of being trained on an extensive amount of text from the internet, this model was trained on a smaller set of sequences, and it was also trained to incorporate additional knowledge about protein production, including structural and energy-related data.
The team of researchers utilized the trained model to generate a collection of 211 new sequences. Each sequence was designed to enhance a specific function, with a primary focus on increasing translation efficiency. These proteins, similar to the spike protein targeted by COVID-19 vaccines, play a crucial role in triggering the body’s immune response to infectious diseases.
Prior studies have developed language models for decoding different biological sequences, such as proteins and DNA. However, this was the first language model to concentrate on the untranslated region of mRNA. Aside from improving overall efficiency, the model was also capable of predicting the performance of a sequence across various conditions.
Wang explained that the main difficulty in creating this language model was understanding the complete context of the available data. Training a model requires not only the raw data and its features, but also the downstream consequences of those features. For example, if a program is meant to filter spam from email, each email it trains on would be labeled “spam” or “not spam.” Throughout this process, the model develops semantic representations that enable it to identify which sequences of words indicate a “spam” label. This is where the true meaning lies.
Wang emphasized that focusing on a single narrow dataset and building a model around it was not sufficient to be useful.as-is, change the text as far as you can:
Minglan Wang, a life scientist, was facing a challenge. She needed to do something new because the model she was working with was at the forefront of biological understanding and the data she was dealing with was quite messy.
Wang explained, “Part of my dataset comes from a study that measured efficiency, while another part comes from a study that measured expression levels. Additionally, we gathered unannotated data from various sources.” Bringing all these parts together to create a cohesive and comprehensive dataset for training a sophisticated language model was a significant challenge.
Wang added, “Training a model is not just about inputting data and letting the machine work its magic. It requires careful organization and curation of the data to ensure effective training.”Putting all these sequences together and also assembling sequences with the collected labels had never been attempted before.