Researchers have discovered that a model of artificial intelligence (AI) was able to accurately answer medical quiz questions meant to evaluate health professionals’ skills in diagnosing patients using clinical images and brief textual descriptions. Nevertheless, doctors noted that the AI made errors in its image descriptions and in detailing the reasoning behind its conclusions.
A study conducted by researchers at the National Institutes of Health (NIH) revealed that an AI model accurately tackled medical quiz questions that assess medical professionals’ diagnostic abilities with clinical images and concise summaries. However, evaluations by physician-graders highlighted the AI’s inaccuracies in image descriptions and its reasoning process. The results, which illuminate the potential applications of AI in healthcare, were published in npj Digital Medicine. The research was spearheaded by scientists from NIH’s National Library of Medicine (NLM) and Weill Cornell Medicine in New York City.
“The incorporation of AI into healthcare presents significant potential as a resource for aiding medical professionals in diagnosing patients more efficiently, thereby enabling earlier treatment,” stated Stephen Sherry, Ph.D., Acting Director of NLM. “However, as this research indicates, AI has not yet reached a level of sophistication to replace the essential human experience for precise diagnoses.”
The AI and human doctors worked on questions from the New England Journal of Medicine (NEJM)’s Image Challenge, an online quiz that showcases real clinical images along with brief descriptions detailing patient symptoms, asking users to determine the correct diagnosis from a list of options.
The AI was tasked with answering a total of 207 image challenge questions and was instructed to provide a written explanation justifying each diagnosis. The instructions emphasized the need for the rationale to encompass a description of the image, a summary of pertinent medical knowledge, and a step-by-step breakdown of the reasoning process that led to each answer.
Nine physicians from varied specialties participated, first working in a “closed-book” format (without referencing any outside materials) and subsequently in an “open-book” format (with access to external resources). Afterward, the researchers presented the physicians with the correct answers alongside the AI’s responses and reasoning, before asking them to evaluate the AI’s image description, knowledge summary, and reasoning steps.
The study showed that both the AI and the doctors were generally successful in identifying the correct diagnosis. Notably, the AI model performed better than the physicians in the closed-book scenario, while the physicians outshone the AI when using open-book resources, particularly on the more challenging questions.
Crucially, physician evaluations indicated that the AI made several errors in describing the medical images and in its reasoning, even when it correctly identified the final diagnosis. For instance, when given a picture of a patient’s arm with two lesions, a human doctor could recognize that both lesions stemmed from the same condition. However, the AI, influenced by the differing angles of the lesions that created an illusion of varying colors and shapes, failed to link both lesions to a single diagnosis.
The researchers stress that these insights highlight the necessity for further evaluation of multi-modal AI technologies before they are integrated into clinical practice.
“This technology has the potential to assist clinicians in enhancing their skills with insights based on data that could lead to better clinical outcomes,” remarked Zhiyong Lu, Ph.D., a Senior Investigator at NLM and one of the study’s lead authors. “Recognizing the risks and constraints of this technology is imperative for effectively utilizing its potential in the field of medicine.”
The research involved a model named GPT-4V (Generative Pre-trained Transformer 4 with Vision), which is classified as a ‘multimodal AI model’ due to its ability to process a variety of data types, including text and images. While the researchers acknowledge that this study is relatively small, it provides insight into the capabilities of multi-modal AI in supporting healthcare professionals with their decision-making. Further studies are required to evaluate how such models compare to physicians’ diagnostic skills.
This study included contributions from partners at NIH’s National Eye Institute and NIH Clinical Center; the University of Pittsburgh; UT Southwestern Medical Center in Dallas; New York University Grossman School of Medicine; Harvard Medical School and Massachusetts General Hospital in Boston; Case Western Reserve University School of Medicine in Cleveland; University of California San Diego; and the University of Arkansas in Little Rock.