Jeep Wrangler Bids Farewell to Manual Windows, Signaling the End of an Automotive Tradition

Jeep Wrangler ditches manual windows, marking the end of an era for automakers Compared to the original Jeep — you know, the military vehicle — the 2025 Wrangler JL is a spaceship, even though by modern standards it's a very old-school vehicle when compared to, say, the Ford Bronco or Toyota 4Runner. But father time
HomeHealthEnsuring Accuracy: The Importance of Human Oversight in AI Advancements

Ensuring Accuracy: The Importance of Human Oversight in AI Advancements

State-of-the-art artificial intelligence systems, also known as large language models (LLMs), are not very good at coding for medical purposes, according to researchers. Their study highlights the need to improve and validate these technologies before using them in a clinical setting. The researchers collected a list of over 27,000 different diagnosis and procedure codes from a year of regular care at the Mount Sinai Health System, while making sure to exclude any identifiable patient information. They then used the descriptions for each code to ask models from OpenAI, Google, and Meta to produce the most accurate medical codes. The codes generated by the models were then compared with the original codes.The researchers at the Icahn School of Medicine at Mount Sinai found that the large language models (LLMs), which are state-of-the-art artificial intelligence systems, do not perform well as medical coders. The study, published in the April 19 online issue of NEJM AI, highlights the importance of refining and validating these technologies before using them in clinical settings. The study analyzed a list of over 27,000 unique diagnosis and procedure codes from a year of routine care in the Mount Sinai Health System.

Researchers utilized the description for each code to prompt models from OpenAI, Google, and Meta to produce the most precise medical codes, while ensuring patient data was not included. The resulting codes were then compared with the original codes, and any errors were thoroughly analyzed for patterns.

The researchers found that all of the large language models studied, such as GPT-4, GPT-3.5, Gemini-pro, and Llama-2-70b, exhibited limited accuracy (below 50 percent) in accurately reproducing the original medical codes. This highlighted a significant inadequacy in their utility for medical coding. GPT-4 demonstrated the highest level of performance, achieving the highest exact match accuracy among the models tested.ates for ICD-9-CM (45.9 percent), ICD-10-CM (33.9 percent), and CPT codes (49.8 percent).

GPT-4 also had the highest rate of generating codes that, although incorrect, still conveyed the correct meaning. For instance, when provided with the ICD-9-CM description “nodular prostate without urinary obstruction,” GPT-4 produced a code for “nodular prostate,” illustrating its relatively sophisticated understanding of medical terminology. However, despite these technically accurate codes, there were still an unacceptably large number of errors.

The next best-performing model, GPT-3.5, tended to be more vague. It had the highestA significant number of inaccurately generated codes were found to be correct but more general than the specific codes. For example, when given the ICD-9-CM description “unspecified adverse effect of anesthesia,” GPT-3.5 produced a code for “other specified adverse effects, not elsewhere classified.”

“Our research highlights the crucial importance of thorough evaluation and improvement before implementing AI technologies in sensitive operational areas such as medical coding,” explains study corresponding author Ali Soroush, MD, MS, Assistant Professor of Data-Driven and Digital Medicine (D3M), and Medicine (Gastroenterology), at Icahn Mount Sinai. The researchers caution that while AI has great potential, it needs to be approached with care and continuous development to ensure its reliability and effectiveness in health care. One potential use for these models in the health care industry is automating the assignment of medical codes for reimbursement and research based on clinical text. Previous studies have shown that newer large language models have difficulty with numerical tasks, but their accuracy in assigning medical codes from clinical text had not been thoroughly investigated across different models.ays co-senior author Eyal Klang, MD, Director of the D3M’s Generative AI Research Program. “So, we wanted to see if these models could accurately match a medical code to its official text description.”

The researchers suggested that combining LLMs with expert knowledge could automate the extraction of medical codes, which could improve billing accuracy and lower administrative costs in health care.

“This study highlights the current strengths and limitations of AI in health care, highlighting the importance of careful consideration and further improvement before implementation.”Co-senior author Girish Nadkarni, MD, MPH, Irene and Dr. Arthur M. Fishberg Professor of Medicine at Icahn Mount Sinai, Director of The Charles Bronfman Institute of Personalized Medicine, and System Chief of D3M, explained that the widespread adoption of LLM technology is crucial. However, the researchers acknowledge that the study’s artificial task may not accurately reflect real-world scenarios. The research team’s next step is to create personalized LLM tools for precise medical data extraction and billing code assignment to enhance the quality and efficiency of healthcare operations. The study is titled “Generative Large Language Mod.”The article titled “Poor Medical Coders: A Benchmarking Analysis of Medical Code Querying” was authored by several individuals affiliated with Icahn Mount Sinai, including Benjamin S. Glicksberg, Eyal Zimlichman, Yiftach Barash, Robert Freeman, and Alexander W. Charney. Additional support for this research was provided by the AGA Research Foundation’s 2023 AGA-Amgen Fellowship to-Faculty Transition Award AGA2023-32-06 and an NIH UL1TR004419 award. The researchers emphasize that the study was conducted with the utmost integrity.Without the use of any Protected Health Information (“PHI”).Â