A novel artificial intelligence (AI) tool integrates information from medical imaging with textual data to forecast cancer outcomes and treatment responses.
Combining visual data (like microscopic images, X-rays, CT scans, and MRIs) with textual information (including physician notes and interdisciplinary communications) plays a crucial role in cancer treatment. While AI has been beneficial in examining images and identifying disease-related irregularities, creating computational models that successfully merge various data forms has proven challenging.
Researchers at Stanford Medicine have now created an AI model that effectively integrates both visual and textual information. After training on an extensive dataset of 50 million medical images from standard pathology slides and over 1 billion pathology-related texts, the model demonstrated superior capabilities in predicting outcomes for numerous cancer patients. It could also determine which patients with lung or gastroesophageal cancers would likely respond well to immunotherapy and identify melanoma patients who are at higher risk of cancer recurrence.
This model, called MUSK, stands for “multimodal transformer with unified mask modeling.” It differs significantly from traditional uses of AI in clinical settings, and the researchers believe it has the potential to revolutionize how AI supports patient care.
“MUSK can accurately forecast the outcomes for patients with various types and stages of cancer,” said Ruijiang Li, MD, an associate professor of radiation oncology. “We developed MUSK because, in real-world clinical practice, doctors rarely depend on a single type of data for making decisions. Our goal was to use various data types to gain deeper insights and achieve more precise predictions regarding patient outcomes.”
Dr. Li, who is affiliated with the Stanford Cancer Institute, is one of the senior authors of the study published in Nature on January 8. Postdoctoral scholars Jinxi Xiang, PhD, and Xiyue Wang, PhD, led this research.
While the use of AI tools in clinical environments has grown, their primary application has been in diagnostics (such as verifying if an image shows signs of cancer), rather than prognostics (predicting possible clinical outcomes and individual treatment effectiveness).
One challenge is that these models require extensive amounts of labeled data (like identifying a cancerous tumor on a lung tissue slide) and corresponding paired data (the clinical notes about the patient from whom the tumor was derived). However, well-organized and annotated datasets can be difficult to find.
An Accessible Tool
In AI terminology, MUSK is classified as a foundation model. Foundation models, which are pretrained on vast datasets, can be tailored with further training to accomplish specific tasks. By designing MUSK to utilize unpaired multimodal data that doesn’t adhere to traditional training standards, the researchers significantly increased the data available for the initial training phase. Following this, further specialized training requires much smaller data sets. Consequently, MUSK serves as a ready-to-use tool that physicians can adjust for specific clinical inquiries.
“The greatest unmet clinical need is for models that assist physicians in directing patient therapy,” Li explained. “Should this patient receive this medication? Or might another treatment be more suitable? Currently, doctors rely on information such as disease staging and specific genetic or protein markers to guide these decisions, but this isn’t always precise.”
The research team gathered microscopic tissue slides, relevant pathology reports, and follow-up data (including patient outcomes) from the national database, The Cancer Genome Atlas, comprising 16 major cancer types, like breast, lung, colorectal, pancreatic, kidney, bladder, and head and neck cancers. They used this data to train MUSK to predict disease-specific survival, indicating the percentage of patients who have not succumbed to a specific disease over a defined period.
MUSK’s predictions for disease-specific survival across all cancer types were accurate 75% of the time. In contrast, standard predictions based purely on a patient’s cancer stage and other clinical risk factors achieved a 64% accuracy rate.
Moreover, the researchers trained MUSK to analyze thousands of data points to identify which lung or digestive tract cancer patients would likely gain from immunotherapy.
“Right now, the main criterion for determining whether to administer a specific immunotherapy to a patient is whether their tumor expresses a protein known as PD-L1,” Li noted. “PD-L1 is a single-protein biomarker. In contrast, leveraging AI to evaluate hundreds or thousands of various data types—including tissue imaging, patient demographics, medical histories, prior treatments, and lab results derived from clinical notes—can significantly enhance our ability to ascertain who stands to benefit.”
For patients with non-small cell lung cancer, MUSK accurately identified those who would benefit from immunotherapy treatment around 77% of the time, while the conventional method using PD-L1 expression was correct only about 61% of the time.
These favorable results were echoed when the researchers trained MUSK to detect which melanoma patients were most likely to have a relapse within five years of initial treatment. In this case, the model was accurate roughly 83% of the time, which is about 12% more precise than predictions yielded by other foundation models.
“The unique aspect of MUSK is its capacity to incorporate unpaired multimodal data during pretraining, significantly enhancing the scale of data compared to the paired datasets needed by other models,” Li emphasized. “We observed that across all clinical prediction scenarios, models that integrate diverse data types consistently outperform those relying solely on imaging or text. Utilizing unpaired multimodal data within AI models like MUSK represents a significant advancement in enhancing the ability of AI to support physicians in improving patient care.”
Researchers from Harvard Medical School also contributed to this work.
The study received funding from the National Institutes of Health (grants R01CA222512, R01CA233578, R01CA269599, R01CA285456, R01CA290715, and R01DE030894), as well as from the Stanford Institute for Human-Centered Artificial Intelligence.