Meals on Wheels Celebrates 50 Years of Nourishing Lives and Building Community for Seniors

Meals on Wheels rolling at 50, bringing food, connections, sunshine to seniors WEST CREEK, New Jersey − Regina Cippel has some simple advice for a long life: Rest when you need to. Don't demand a lot, and be thankful for what you have. Work for as long as you can. Be helpful to others. And
HomeHealthAre AI Chatbots Suitable for Hospitals? Exploring Benefits & Challenges

Are AI Chatbots Suitable for Hospitals? Exploring Benefits & Challenges

Large language models demonstrate excellence in passing medical exams, but utilizing them for diagnoses would be incredibly reckless at present. Medical chatbots often rush through diagnoses, fail to follow guidelines, and jeopardize patient safety. A study conducted by a team from the Technical University of Munich (TUM) assessed the suitability of artificial intelligence (AI) for routine clinical use. Despite existing limitations, the researchers acknowledge the potential of this technology. They have introduced a methodology to evaluate the reliability of future medical chatbots.

Large language models may perform exceptionally in medical exams, yet using them for diagnoses is currently deemed highly irresponsible. Medical chatbots tend to rush through diagnoses, deviate from established guidelines, and pose risks to patients. A team from the Technical University of Munich (TUM) has systematically examined the viability of incorporating this form of artificial intelligence (AI) into everyday clinical practice. Despite current limitations, the researchers recognize the promise of the technology. They have put forth a method to assess the dependability of forthcoming medical chatbots.

Large language models are sophisticated computer programs trained with extensive amounts of text. Specifically customized versions of the technology behind ChatGPT are now capable of acing final exams in medical studies. But can such AI effectively replace doctors in a medical emergency setting? Can it accurately order tests, provide correct diagnoses, and devise treatment plans based on patient symptoms?

An interdisciplinary team led by Daniel Rückert, Professor of Artificial Intelligence in Healthcare and Medicine at TUM, tackled this question in the prestigious journal Nature Medicine. For the first time, doctors and AI experts collaborated to scrutinize the efficacy of various versions of the open-source large language model Llama 2 in making diagnoses.

Simulating the process from emergency room to treatment

To evaluate the capabilities of these intricate algorithms, the researchers utilized anonymized patient data from a U.S. clinic. They selected 2400 cases from a larger dataset, all involving patients presenting with abdominal pain in the emergency room. Each case description concluded with one of four diagnoses and a corresponding treatment plan. All diagnostic data, ranging from medical history and blood work to imaging studies, was provided.

“We structured the data in a manner that enabled the algorithms to simulate authentic hospital procedures and decision-making processes,” explained Friederike Jungmann, an assistant physician in the radiology department at TUM’s Klinikum rechts der Isar and the lead author of the study alongside computer scientist Paul Hager. “The program only had access to the same information available to real doctors. For instance, it had to independently decide whether to order a blood test and utilize this information to guide subsequent decisions, ultimately formulating a diagnosis and treatment plan.”

The team discovered that none of the large language models consistently requested all required tests. In fact, the accuracy of the programs’ diagnoses diminished with increased case information. They frequently deviated from treatment protocols, sometimes ordering tests that could have serious health repercussions for actual patients.

Comparison with human doctors

In the latter part of the study, the researchers compared AI-generated diagnoses for a subset of the data with diagnoses provided by four doctors. While the human doctors were accurate in 89% of the diagnoses, the best-performing large language model achieved only 73%. Each model exhibited varying proficiency levels in diagnosing different conditions. In a severe instance, a model accurately diagnosed gallbladder inflammation in just 13% of cases.

Another notable drawback that disqualifies these programs for routine usage is their lack of consistency: the diagnosis produced by a large language model was influenced by factors such as the sequence of information received. Furthermore, linguistic nuances impacted the outcome—for example, whether the program was prompted for a ‘Main Diagnosis,’ a ‘Primary Diagnosis,’ or a ‘Final Diagnosis.’ In clinical practice, these terms are usually interchangeable.

Unverified use of ChatGPT

The team deliberately refrained from testing the commercial large language models from OpenAI (ChatGPT) and Google for two primary reasons. Firstly, the provider of the hospital data prohibited using these models due to data protection concerns. Secondly, experts strongly recommend employing open-source software exclusively for healthcare applications. “Deploying open-source models allows hospitals to wield adequate control and knowledge to ensure patient safety. Understanding the training data used for these models is crucial when conducting evaluations. Companies typically guard their training data, making unbiased assessments challenging,” stated Paul Hager. “Moreover, building critical medical infrastructure reliant on external services with the liberty to update and modify models as they please poses risks. Worst-case scenario, a service vital to numerous clinics could shut down due to profitability issues.”

Rapid Technological Advancement

The advancements in this technology are progressing swiftly. “There’s a conceivable future where a large language model could excel in deriving diagnoses from medical histories and test results,” remarks Prof. Daniel Rückert. “Consequently, we have made our evaluation platform accessible to all research groups interested in assessing large language models in a clinical setting.” Rückert envisions a significant role for this technology in aiding physicians, such as facilitating case discussions. However, users must remain cognizant of the technology’s limitations and peculiarities when developing applications,” emphasized the medical AI expert.