Biomimetic Flapping Wings: Harnessing Nature’s Wind Sensing through Flexible Strain Sensors

Bio-inspired wind sensing using strain sensors on flexible wings could revolutionize robotic flight control strategy. Researchers have developed a method to detect wind direction with 99% accuracy using seven strain gauges on the flapping wing and a convolutional neural network model. This breakthrough, inspired by natural strain receptors in birds and insects, opens up new
HomeSocietyAre AI Doctors the Future of Medical Conversations?

Are AI Doctors the Future of Medical Conversations?

Researchers have developed a new method to assess the capacity of AI models to make clinical decisions in realistic situations that closely resemble actual patient interactions. Their analysis indicates that while large language models perform well in diagnosing through exam-style questions, they face challenges when interpreting conversational notes. The researchers recommend a series of guidelines to enhance the efficacy of AI tools and ensure they align with real-world practices before being implemented in clinical settings.

Artificial intelligence tools like ChatGPT are being praised for their potential to reduce the workload of clinicians by helping with patient triage, medical histories, and even offering initial diagnoses.

These large language models are increasingly used by patients to understand their symptoms and medical test results.

However, despite performing well on standardized medical exams, how effective are they in scenarios that reflect everyday clinical interactions?

A new study from researchers at Harvard Medical School and Stanford University suggests they are not so effective.

Published on January 2 in Nature Medicine, the research team created an evaluation framework, dubbed CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine), and applied it to four large language models to assess their performance in settings that closely resemble actual patient interactions.

While all four models performed well on medical exam-like questions, their efficacy declined significantly when simulating real-world conversations.

This finding highlights two critical areas: the necessity for more realistic evaluations to accurately determine the readiness of AI models for clinical use and the need to improve their diagnostic capabilities based on more genuine interactions before clinical deployment.

The team believes that evaluation tools like CRAFT-MD can provide a more precise assessment of AI models for real-world applicability and help enhance their clinical performance.

“Our study uncovers an intriguing paradox — these AI models excel during medical board exams, yet they falter during the typical dynamics of a doctor’s visit,” said Pranav Rajpurkar, the senior author of the study and an assistant professor of biomedical informatics at Harvard Medical School. “The fluid nature of medical conversations, which requires the ability to ask timely questions, piece together fragmented information, and analyze symptoms, presents unique challenges that extend beyond merely answering multiple-choice questions. Transitioning from standardized testing to these natural discussions results in considerable drops in diagnostic precision, even for the most advanced AI models.”

A better test for assessing AI’s real-world performance

Currently, AI model performance is evaluated by asking them to respond to multiple-choice medical questions, typically based on national exams for medical graduates or certification tests for residents.

“This method assumes that all pertinent information is presented in a clear and concise manner, often using medical jargon that simplifies the diagnostic process, but real life is much messier,” explained Shreya Johri, study co-first author and a doctoral student at Harvard Medical School’s Rajpurkar Lab. “We need a testing framework that reflects the complexities of reality more accurately and is therefore better at predicting a model’s actual performance.”

CRAFT-MD was specifically designed to serve as such a more realistic metric.

CRAFT-MD simulates real-world interactions by assessing how well large language models can gather information regarding symptoms, medications, and family history, subsequently arriving at a diagnosis. An AI agent acts as a patient, responding in a conversational and natural manner, while another AI agent evaluates the accuracy of the final diagnosis made by the large language model. Human experts then analyze each interaction for the models’ ability to collect essential patient information, diagnose accurately when presented with fragmented data, and adhere to prompts.

The researchers utilized CRAFT-MD to evaluate four AI models—both commercial and open-source—across 2,000 clinical scenarios that reflect common conditions in primary care and cover 12 medical specialties.

All models exhibited limitations, particularly in conducting clinical conversations and reasoning based on input from patients. This shortcoming affected their capacity to take comprehensive medical histories and provide suitable diagnoses. For example, they often struggled to ask pertinent questions needed for thorough patient history, overlooked crucial information during data gathering, and had difficulty synthesizing disparate pieces of information. The accuracy of these models diminished when presented with open-ended information compared to multiple-choice options. Additionally, they performed worse during dynamic back-and-forth dialogues—typical of real-world conversations—relative to structured conversations.

Recommendations for enhancing AI’s real-world effectiveness

Based on their findings, the research team offers several recommendations for both developers of AI models and regulators responsible for evaluating and approving these systems.

These recommendations include:

  • Employ conversational, open-ended questions that more accurately reflect the unstructured nature of doctor-patient interactions during AI tools’ design, training, and testing
  • Assess models based on their ability to ask the right questions and gather essential information
  • Create models capable of following multiple conversations and integrating the gathered information
  • Design AI models that can combine textual data (such as conversation notes) with non-textual data (like images and EKGs)
  • Develop advanced AI agents that can understand non-verbal communication such as facial expressions, tone of voice, and body language

Moreover, they recommend incorporating both AI agents and human experts in the evaluation process, as depending solely on human evaluators can be labor-intensive and costly. CRAFT-MD demonstrated efficiency, processing 10,000 conversations in 48 to 72 hours, compared to nearly 500 hours that a human-based approach would require for patient simulations and around 650 hours for expert evaluations. Using AI evaluators as the primary method also reduces the risk of exposing real patients to untested AI tools.

The researchers anticipate that CRAFT-MD will be periodically updated and optimized to incorporate advancements in patient-AI interactions.

“As a physician scientist, my interest lies in AI models that can effectively and ethically enhance clinical practices,” said Roxana Daneshjou, co-senior author and an assistant professor of Biomedical Data Science and Dermatology at Stanford University. “CRAFT-MD provides a framework that closely aligns with real-world interactions, thereby advancing our ability to assess AI model performance in healthcare.”

Authorship, funding, disclosures

Other authors include Jaehwan Jeong and Hong-Yu Zhou from Harvard Medical School; Benjamin A. Tran from Georgetown University; Daniel I. Schlessinger from Northwestern University; Shannon Wongvibulsin from UCLA; Leandra A. Barnes, Zhuo Ran Cai, and David Kim from Stanford University; and Eliezer M. Van Allen from Dana-Farber Cancer Institute.

This work was supported by the HMS Dean’s Innovation Award as well as a Microsoft Accelerate Foundation Models Research grant awarded to Pranav Rajpurkar, with additional support for SJ through the IIE Quad Fellowship.

Daneshjou disclosed receiving personal fees from DWA, Pfizer, L’Oreal, VisualDx, stock options from MDAlgorithms and Revea outside the submitted work, and has a pending patent for TrueImage. Schlessinger co-founded FixMySkin Healing Balms, holds shares in Appiell Inc. and K-Health, and consults with multiple companies including Abbvie and Sanofi. Van Allen is an advisor to companies like Enara Bio and Manifold Bio, holds equity in various firms, and has filed for institutional patents. He also serves on the editorial board of Science Advances.