Our bodies consist of approximately 75 billion cells. However, what specific roles does each cell play, and how do the cells in a healthy person compare to those in someone with an illness? To uncover these differences, a significant amount of data needs to be examined and understood. Researchers are employing machine learning techniques for this purpose. Recently, they have explored self-supervised learning as a potential method for analyzing over 20 million cells.
Recently, there have been significant advancements in single-cell technology. This approach allows scientists to study tissues at the level of individual cells, enabling them to identify the various roles of different cell types. For example, this analysis can help compare healthy cells to those altered by factors such as smoking, lung cancer, or COVID-19, revealing how these conditions affect lung cell structures.
As this analysis progresses, the volume of data generated continues to grow. Researchers aim to utilize machine learning techniques to reinterpret existing datasets, extract meaningful insights from the data patterns, and apply these findings to other research areas.
Self-supervised learning as a new approach
Fabian Theis, who leads the Chair of Mathematical Modelling of Biological Systems at TUM, and his team have explored the efficacy of self-supervised learning in handling large datasets compared to traditional methods. Their findings were published recently in Nature Machine Intelligence. This specific type of machine learning utilizes unlabelled data, which means there is no need for pre-existing categorized sample data. There is, in fact, an abundance of unlabelled data available, enabling robust management of large volumes of information.
Self-supervised learning consists of two main techniques. Masked learning involves partially concealing the input data so that the model learns to predict the missing pieces. Meanwhile, contrastive learning allows the model to differentiate between similar and dissimilar data sets.
Using both techniques, the team analyzed over 20 million individual cells and compared the results with those obtained through classical learning methods. They focused on tasks like cell type prediction and gene expression reconstruction when evaluating the different approaches.
Prospects for the development of virtual cells
The study’s outcomes indicate that self-supervised learning particularly enhances performance in transfer tasks—analyzing smaller datasets by leveraging insights from a larger dataset. Additionally, results for zero-shot cell predictions—tasks carried out without prior training—are also quite encouraging. The analysis between masked and contrastive learning revealed that masked learning is more effective for large single-cell datasets.
The researchers are leveraging their findings to develop virtual cells, which are intricate computer models that represent the diversity of cells from various datasets. These models hold potential for analyzing cellular changes often observed in diseases. The insights from this study provide valuable guidance on how to train and optimize these models more effectively.