A new study from researchers at UCL found that large language models like ChatGPT, which power popular generative AI platforms, provided varying responses when asked to answer the same reasoning test. The study also discovered that these models did not improve when given additional context. The findings were published in Royal Society Open Science and tested the most advanced Large Language Models (LLMs) using cognitive techniques.The article discusses how psychology tests are used to measure the reasoning capacity of artificial intelligence (AI). The importance of understanding the thought process of these AIs before assigning them tasks, especially those involving decision-making, is emphasized. The article also mentions the increasing sophistication of LLMs, which power generative AI apps like ChatGPT, and the concerns surrounding their ability to create realistic text, images, audio, and video. Additionally, it highlights the AI’s tendency to fabricate information, provide inconsistent responses, and make mistakes in simple math problems.In a recent study, researchers from UCL examined the ability of seven LLMs to engage in rational reasoning. The authors defined a rational agent as one that reasons according to the rules of logic and probability, while an irrational agent does not adhere to these rules. The LLMs were subjected to 12 common cognitive psychology tests to assess their reasoning skills, including the Wason task, the Linda problem, and the Monty Hall problem. It is worth noting that the ability of humans to solve these tasks is low, with only 14% of participants successfully solving the Linda problem in recent studies.to answer the Wason task correctly, the models’ consistent irrationality and basic mistakes indicate a lack of true understanding of the task at hand. Even the best-performing model, GPT-4, only achieved a 90% accuracy rate, leaving room for improvement. Additionally, the wide variation in performance among different models suggests that there is still much to learn about how to effectively design and train AI systems for logical reasoning tasks.
In order to correctly respond to the Wason task, it’s unlikely that the reason for success would be due to a lack of understanding of what a vowel is.
Olivia Macmillan-Scott, the primary author of the UCL Computer Science study, stated, “Based on our study results and other research on Large Language Models, it’s evident that these models do not yet ‘think’ like humans.”
“However, the model with the largest dataset, GPT-4, showed significant improvement compared to other models, indicating rapid progress. Nonetheless, it’s challenging to determine how this specific model reasons since it operates as a closed system. It’s possible that there are other tools in use that you wouldn’t expect.”The capabilities of these models are extremely surprising, especially for people who have found in its predecessor GPT-3.5.” Some models refused to answer the tasks on ethical grounds, even though the questions were innocent. This is probably due to safeguarding parameters that are not functioning as intended. The researchers also provided additional context for the tasks, which has been shown to improve people’s responses. However, the LLMs tested did not show any consistent improvement. Professor Mirco Musolesi, senior author of the study from UCL Computer Science, said: “The capabilities of these models are extremely surprising, especially for people who have.I have been working with computers for many years. One interesting thing is that we still don’t fully understand how Large Language Models behave and why they sometimes get answers right or wrong. We now have ways to adjust these models, but it raises the question: if we try to fix their problems by teaching them, are we also passing on our own flaws? It’s fascinating how these LLMs make us think about our reasoning and biases, and whether we actually want perfectly rational machines. Do we want something that makes mistakes like we do, or do we want them to be flawless? The models tested were GPT-4 and GPT-3.5, Google Bard, Claude 2, Llama 2 7b, Llama 2 13b and Llama 2 70b.
1 Stein E. (1996). Without Good Reason: The Rationality Debate in Philosophy and Cognitive Science. Clarendon Press.
2 These tasks and their solutions are available online. An example is the Wason task:
The Wason task
Consider the following rule: If a card has a vowel on one side, it has an even number on the other side.
You are presented with four cards:
- E
- K
- 4
- 7
Which of these cards must be turned over to check the rule?
Response: a) E and d) 7, because these are the only options that can break the rule.