Artificial Intelligence has become proficient in language, art generation, and even defeating chess grandmasters. However, can it unravel the complexities of abstract reasoning—those challenging visual puzzles that often stump humans? Researchers are evaluating AI’s cognitive capabilities by pushing multi-modal large language models (MLLMs) to tackle visual problems typically associated with human IQ testing. The outcome? An insight into AI’s progress and the areas where it still encounters difficulties.
Artificial Intelligence has become adept in language, art creation, and even outsmarting chess champions. But can it understand abstract reasoning—those complicated visual puzzles that can confuse humans? Researchers from the USC Viterbi School of Engineering’s Information Sciences Institute (ISI) are assessing AI’s cognitive skills, testing multi-modal large language models (MLLMs) on visual tasks once deemed exclusive to human IQ assessments. The result provides a window into AI’s advancements and where it continues to face challenges.
USC Viterbi ISI Research Assistants Kian Ahrabian and Zhivar Sourati recently explored whether MLLMs could handle nonverbal abstract reasoning—tasks that necessitate both visual comprehension and logical thought. They shared their findings at the upcoming Conference on Language Modeling (COLM 2024) held in Philadelphia, PA from October 7-9, 2024.
Jay Pujara, a research associate professor of computer science at USC Viterbi and co-author of the study, stated, “Every day we are inundated with new news about AI’s capabilities, which can often be quite unexpected. Our understanding of what new AI models can achieve is still quite limited, and until we comprehend these limitations, we cannot enhance AI to be better, safer, and more practical. This study aids in illuminating a previously unclear aspect of AI’s struggles.”
The Challenge: Can AI See and Think?
“We aimed to determine if this latest generation of large models, capable of processing images, can think independently,” Ahrabian clarified. “For instance, if you observe a yellow circle transforming into a blue triangle, can the model use the same pattern in another situation?”
To investigate, the team evaluated 24 different MLLMs on puzzles derived from Raven’s Progressive Matrices, a well-respected measure of abstract reasoning. They discovered that open-source models had significant difficulties. “They performed terribly. They couldn’t derive anything valuable from it,” Ahrabian candidly remarked.
In contrast, closed-source models, like GPT-4V—developed by private entities and not available for public modification—showed better results. These models usually benefit from more advanced training resources, including larger datasets and superior computing power, providing them a clear advantage. “We noticed some noteworthy outcomes with closed-source models,” Ahrabian added, “Notably, GPT-4V exhibited a reasonable capability for reasoning, though it’s still not flawless.”
Where the AI Stumbles
A vital aspect of the study involved pinpointing where these models were falling short. A significant concern was the AI’s capacity to accurately process visual data. “We aimed to find out whether the models could perceive finer details—like colors or lines intersecting—and if that was contributing to their errors,” Ahrabian explained.
To narrow down the issue, the researchers supplied comprehensive textual descriptions of the images, ensuring that the models had all the pertinent information in a different format. “Even when we excluded the visual aspect and provided text alone, many models still struggled to reason effectively,” Sourati noted. This highlighted a crucial realization: the challenge was not solely tied to visual processing but also stemmed from reasoning itself. This allowed the team to better understand the failing points and direct their future enhancements.
The Path Forward: Improving AI’s Reasoning
One encouraging strategy that the researchers examined was “Chain of Thought prompting,” which involves guiding the AI to methodically work through reasoning tasks. This approach resulted in considerable advancements in specific instances. “By offering hints to the models, we observed performance improvements of up to 100%,” Ahrabian remarked.
Despite the persisting challenges, the researchers remain hopeful. The conclusions from the study emphasize both the limitations of current AI and the promising prospects for future development. As these models evolve, USC’s research could play a pivotal role in creating AI that not only comprehends but also reasons—narrowing the gap between machine cognition and human thought.
New Research at a New Conference
Ahrabian and Sourati, Ph.D. candidates within the Thomas Lord Department of Computer Science, presented their paper, “The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models,” at COLM this week, which marks the first edition of the conference.
Pujara, who also directs the Center on Knowledge Graphs at ISI, remarked, “AI is undergoing a significant transformation with the rise of language models. The emergence of new conferences like COLM to nurture this evolution is an excellent initiative to promote collaboration and inspire students keen on contributing to this swiftly progressing domain.”