Large language models (LLMs)—a form of AI specialized in text analysis—have been shown to predict the outcomes of proposed neuroscience research more accurately than human specialists, according to a study by researchers at University College London (UCL). The results reveal that LLMs, which are trained on extensive text datasets, can extract patterns from scientific literature, allowing them to forecast scientific results beyond human capacity.
A new study led by researchers at UCL reveals that large language models, a type of AI that analyzes text, can more accurately predict the results of proposed neuroscience studies than human experts.
Published in Nature Human Behaviour, the study shows that LLMs trained on large text corpora can identify patterns in scientific papers, which enables them to anticipate scientific outcomes with remarkable precision.
The researchers emphasize this indicates the significant potential of LLMs as valuable resources for advancing research, surpassing mere knowledge retrieval.
Dr. Ken Luo, the lead author from UCL Psychology & Language Sciences, stated: “With the rise of generative AI like ChatGPT, much research has concentrated on the question-answering capabilities of LLMs, highlighting their exceptional talent in summarizing vast amounts of learned information. However, instead of focusing solely on their ability to retrieve past data, we aimed to see if LLMs can synthesize knowledge to predict future results.”
“Advancing science typically relies on experimentation through trial and error, which can be time-consuming and costly. Even the most adept researchers might miss critical insights found in the literature. Our research aims to determine whether LLMs can spot patterns in extensive scientific texts and forecast experimental results.”
The international team initiated their research by creating a tool called BrainBench to assess the predictive capabilities of LLMs in the field of neuroscience.
BrainBench features various pairs of neuroscience study abstracts, where one abstract represents a genuine study describing the research background, methodologies, and actual results, while the second abstract contains identical background and methods but presents plausible yet incorrect results crafted by experts in neuroscience.
The team tested 15 different general-purpose LLMs against 171 human neuroscience experts (all of whom were screened to confirm their expertise) to determine whether the AI or the human could accurately identify the abstract with the real study results.
All LLMs surpassed the human experts, achieving an average accuracy of 81% compared to the experts’ 63%. Even when the researchers focused on responses from the most knowledgeable individuals in a specific area of neuroscience (based on self-assessed expertise), the experts’ accuracy was still lower, at 66%. The researchers also discovered that when LLMs demonstrated higher confidence in their answers, they were more likely to be correct.* This finding suggests a promising future where human experts might work alongside well-tuned AI models.
Moreover, the researchers refined an existing LLM (a variant of Mistral, an open-source model) by training it specifically on neuroscience literature. This newly developed model, named BrainGPT, excelled in predicting study results, achieving 86% accuracy compared to 83% for the general-purpose Mistral.
Professor Bradley Love, a senior author from UCL Psychology & Language Sciences, remarked: “Given our findings, we believe it won’t be long until scientists utilize AI tools to design optimal experiments for their questions. Although our focus was neuroscience, our methodology is broadly applicable across all scientific domains.”
“What is truly remarkable is the predictive power of LLMs regarding neuroscience literature. This success implies much of science is not as novel as believed but aligns with existing patterns found in research. It raises a question about whether scientists are being innovative and exploratory enough.”
Dr. Luo added: “Building upon our findings, we are creating AI tools to aid researchers. We envision a future wherein scientists can present their proposed experiment designs and expected outcomes, with AI providing predictions about the likelihood of various results. This would facilitate quicker iterations and enhance decision-making in experimental design.”
The study received support from the Economic and Social Research Council (ESRC), Microsoft, and a Royal Society Wolfson Fellowship, and included researchers from UCL, University of Cambridge, University of Oxford, Max Planck Institute for Neurobiology of Behavior (Germany), Bilkent University (Turkey), along with various institutions across the UK, US, Switzerland, Russia, Germany, Belgium, Denmark, Canada, Spain, and Australia.
Note:
* When presented with two abstracts, the LLM assesses the likelihood of each by assigning a perplexity score that reflects how surprising each is based on its learned knowledge and the context (background and method). The researchers gauged LLMs’ confidence by measuring the difference in perceived surprise between real and fake abstracts—the greater the difference, the higher the confidence, which correlated with a greater likelihood that the LLM identified the correct abstract.