An experiment was conducted to evaluate six generative large language models against students enrolled in an online introductory course on biomedical and health informatics. Results indicated that these AI models surpassed up to three-quarters of the students’ performance in the course.
William Hersh, M.D., a long-time educator of medical and clinical informatics at Oregon Health & Science University, became intrigued by the increasing role of artificial intelligence in education. He questioned how AI would perform in his classroom.
As a result, he decided to conduct an experiment.
He evaluated six generative large language AI models, such as ChatGPT, in an online version of his esteemed introductory course in biomedical and health informatics, to see how they measured up against actual students. The findings, published in the journal npj Digital Medicine, indicated that these AI models performed better than around three-quarters of the human participants.
“This raises concerns about potential cheating, but there’s a more significant issue at play,” Hersh expressed. “How can we be sure that our students are genuinely learning and mastering the information and skills needed for their future careers?”
As a professor specializing in medical informatics and clinical epidemiology in the OHSU School of Medicine, Hersh is particularly sensitive to emerging technologies. “The integration of technology in education is not new,” he noted, reflecting on his own high school experience in the 1970s, which saw the shift from slide rules to calculators.
However, he emphasized that the advent of generative AI signifies a dramatic leap forward.
“Clearly, everyone should possess a fundamental level of knowledge in their respective fields,” Hersh stated. “What foundational knowledge should we expect individuals to have in order to think critically?”
Large-language models
Hersh, along with co-author Kate Fultz Hollis, an informatician at OHSU, analyzed the knowledge assessment scores of 139 students enrolled in the 2023 introductory course on biomedical and health informatics. They submitted course-related assessment materials to six generative AI models. Depending on the model, the AI achieved scores ranging from the 50th to 75th percentile on multiple-choice questions, as well as short-answer questions on quizzes and a final exam.
“The findings from this study pose major questions for the future of student evaluation across numerous academic fields,” the authors asserted.
This research marks the first instance of a thorough comparison between large-language models and students over the entirety of an academic course within the biomedical sector. Hersh and Fultz Hollis noted that knowledge-based subjects like this one may be particularly suitable for generative AI, unlike more participative courses that focus on developing complex skills and capabilities.
Hersh reminisced about his medical school days.
“During my time as a medical student, one of my supervisors advised me to have all the knowledge memorized,” he recollected. “Even in the 1980s, that was a significant challenge. The extent of medical knowledge has long exceeded an individual’s capacity to memorize everything.”
Keeping the human element
Nonetheless, he believes there is a delicate balance between effectively utilizing technological resources for enhancing education and becoming overly dependent, to the detriment of real learning. The core aim of an academic health center like OHSU is to produce healthcare professionals adept at patient care and proficient in utilizing clinical data and information in practical scenarios.
In that regard, he stressed that medicine will always require a human touch.
“Healthcare professionals often handle straightforward tasks, but there are also complex situations demanding critical judgment calls,” he explained. “In those circumstances, having a broader viewpoint is advantageous, even if one doesn’t have every detail memorized.”
As the fall term approaches, Hersh is not particularly concerned about cheating.
“I refresh the course content annually,” he noted. “In any scientific sphere, advancements occur constantly, and generative AI may not always be current with all developments. This means we will need to create newer or more nuanced assessments that won’t yield straightforward answers from ChatGPT.”