Chemists have recently developed a new method to identify 3D genome structures using generative artificial intelligence, which can forecast thousands of genome layouts in just minutes. This approach is significantly faster than traditional techniques currently used for structure analysis.
Every cell in your body contains the identical genetic code, but each cell selectively uses only a portion of those genes. The unique patterns of gene expression in different cells, which distinguish a brain cell from a skin cell, are partly influenced by the three-dimensional arrangement of the genetic material. This structure governs the accessibility of each gene.
Scientists at MIT have introduced an innovative approach to reveal these 3D genome structures through the application of generative AI. This method can swiftly predict thousands of genome configurations, making it more efficient compared to the current experimental techniques used in structure analysis.
This breakthrough allows researchers to more easily examine how the 3D arrangement of the genome impacts gene expression patterns and functions in individual cells.
“Our objective was to predict the three-dimensional genome structure based solely on the DNA sequence,” explains Bin Zhang, an associate professor of chemistry and the study’s senior author. “With this capability, which aligns our method with leading experimental techniques, exciting new opportunities arise.”
MIT graduate students Greg Schuette and Zhuohan Lao are the principal authors of the study published today in Science Advances.
From sequence to structure
Within the nucleus, DNA combines with proteins to form a complex called chromatin, which organizes itself across several levels, enabling cells to pack two meters of DNA into a nucleus that measures just one-hundredth of a millimeter across. Long strands of DNA coil around proteins known as histones, creating a structure akin to beads on a string.
Chemical markers called epigenetic modifications can be added to DNA at specific points, and these markers, which differ by cell type, influence how chromatin folds and how accessible the nearby genes are. These variations in chromatin shape help dictate which genes are active in different types of cells or at various times within a single cell.
For the past two decades, researchers have devised experimental methods to ascertain chromatin structures. One prominent technique, called Hi-C, works by linking adjacent DNA strands within the cell’s nucleus. Subsequently, scientists can determine the proximity of segments by breaking the DNA into small fragments and sequencing them.
This technique can be applied to large cell populations to obtain an average chromatin structure or used on individual cells to explore structures specific to those cells. However, methods like Hi-C are labor-intensive, often requiring about a week to accumulate data from a single cell.
To address these challenges, Zhang and his team created a model utilizing advancements in generative AI to provide a rapid and precise way to predict chromatin structures in individual cells. Their AI model quickly evaluates DNA sequences and forecasts the potential chromatin structures these sequences might produce.
“Deep learning excels in recognizing patterns,” Zhang notes. “It enables us to analyze lengthy DNA segments, comprising thousands of base pairs, to uncover crucial information encoded within those base pairs.”
The model, named ChromoGen, consists of two main components. The first element is a deep learning model trained to “read” genomic data, which assesses the information found in the underlying DNA sequence alongside widely accessible and cell type-specific chromatin accessibility data.
The second segment is a generative AI model that predicts biologically accurate chromatin shapes, having been educated on over 11 million chromatin configurations from experiments using Dip-C (an adaptation of Hi-C) across 16 cells derived from a human B lymphocyte line.
When combined, the first part informs the generative model about how the specific cellular environment affects the development of various chromatin structures, effectively capturing the relationship between DNA sequence and structure. For each DNA sequence, the scientists use their model to generate numerous potential structures. This is important because DNA can be quite disordered, meaning a single sequence can lead to various conformations.
“A key difficulty in predicting genome structure is that there isn’t a single correct solution; instead, there exists a range of structures for any given genome segment. Predicting this complex and high-dimensional statistical distribution presents considerable challenges,” Schuette explains.
Rapid analysis
Once fully trained, the model can produce predictions much more rapidly than Hi-C or other experimental approaches.
“While running experiments might take six months to yield a few dozen structures from a specific cell type, our model can generate a thousand structures in a particular area within just 20 minutes using a single GPU,” Schuette states.
After training, the researchers utilized the model to generate structural predictions for over 2,000 DNA sequences, comparing these predictions to experimentally derived structures. They found that the model’s predictions closely matched the experimental findings.
“We generally examine hundreds or thousands of conformations for each sequence, providing a solid representation of the diversity inherent in the structures for a specific area,” Zhang shares. “If experiments are repeated in different cells, it’s highly likely that a very different conformation will be observed. This is what our model aims to predict.”
The team also discovered that the model can accurately predict chromatin structures from cell types not included in its training data. This suggests that the model might be beneficial for analyzing how chromatin structures vary across cell types and the implications of these discrepancies for their functions. Additionally, it could be employed to investigate varying chromatin states within a single cell and how these changes influence gene expression.
Another promising avenue is examining how mutations in specific DNA sequences alter chromatin configurations, providing insight into how such mutations may contribute to diseases.
“There are numerous intriguing questions we believe this model can help address,” Zhang concludes.
The researchers have made all of their data and the model accessible for others to use.
The research was supported by the National Institutes of Health.