Hospitals are currently utilizing artificial intelligence (AI) to improve patient care, but does this technology truly enhance doctors’ diagnostic capabilities? A recent study provides some unexpected insights.
As hospitals increasingly implement artificial intelligence to better patient care, a recent study indicates that the use of Chat GPT Plus offers no significant improvement in diagnostic accuracy for doctors when compared to traditional methods.
Conducted by Andrew S. Parsons, MD, MPH, and his team at UVA Health, this study involved 50 physicians specializing in family medicine, internal medicine, and emergency medicine to evaluate the efficacy of Chat GPT Plus. The doctors were split into two groups; one utilized Chat GPT Plus for diagnosing complicated cases, while the other depended on conventional tools like medical reference websites (such as UpToDate©) and Google. The researchers then assessed the accuracy of diagnoses from both groups, finding them to be comparable.
Interestingly, Chat GPT on its own surpassed the performance of both groups, hinting at its potential to enhance patient care further. However, the researchers emphasize that physicians will require additional training and hands-on experience with this technology to fully exploit its benefits.
For the time being, the researchers advise using Chat GPT as a supportive tool rather than a replacement for human doctors.
“Our study reveals that AI on its own can be a powerful and effective tool for diagnosis,” noted Parsons, who teaches clinical skills to medical students and co-leads the Clinical Reasoning Research Collaborative at the University of Virginia School of Medicine. “We were surprised to discover that incorporating a human physician actually decreased diagnostic accuracy, though it did enhance efficiency. This suggests we need formal training on how to optimally use AI.”
Chat GPT for Disease Diagnosis
Chatbots, known as “large language models,” are becoming increasingly popular for their ability to engage in human-like conversations, take patient histories, show empathy, and even tackle intricate medical cases. However, they still necessitate a physician’s involvement.
Eager to explore the most effective ways to utilize this advanced tool, Parsons and his colleagues conducted a randomized, controlled trial at three top hospitals: UVA Health, Stanford, and Harvard’s Beth Israel Deaconess Medical Center.
The participating doctors diagnosed “clinical vignettes” based on actual patient care scenarios, which included detailed patient histories, physical examinations, and lab results. The researchers then evaluated the results based on diagnostic accuracy and speed.
Doctors using Chat GPT Plus achieved a median diagnostic accuracy of 76.3%, while the traditional approach yielded 73.7%. Overall, the Chat GPT group made diagnoses slightly faster—taking 519 seconds compared to 565 seconds for the conventional group.
The researchers were surprised by Chat GPT Plus’s solitary performance, which boasted a median diagnostic accuracy exceeding 92%. They believe this reflects the specific prompts utilized during the study, indicating that training on effective prompt usage could benefit physicians. Alternatively, healthcare organizations might consider purchasing predefined prompts for clinical workflow and documentation.
However, the researchers also warn that Chat GPT Plus might not perform as effectively in real-world situations, where various factors influencing clinical reasoning, such as treatment decision outcomes, come into play. They advocate for further research to evaluate large language models in these critical areas and are currently undertaking a similar study focusing on management decision-making.
“As AI becomes more integrated into healthcare, it’s vital to find ways to leverage these tools to enhance both patient care and the experience of physicians,” Parsons remarked. “This study indicates that optimizing our collaboration with AI within clinical settings requires substantial efforts.”
To build on these findings, the four study sites have established a coasts-spanning AI evaluation network termed ARiSE (AI Research and Science Evaluation) to further investigate GenAI outputs in healthcare. More details are available on the ARiSE website.