People form opinions on how well a large language model (LLM) performs based on their previous experiences. When there’s a discrepancy between what a person believes and the capabilities of an LLM, even a highly proficient model can falter in real-world applications.
One of the key strengths of large language models (LLMs) lies in their versatility. The same machine learning model can assist a graduate student in drafting an email or help a doctor in diagnosing cancer.
However, this broad applicability also poses challenges for systematic evaluation. Creating a benchmark dataset to evaluate a model against every conceivable question is simply not feasible.
In a recent paper, MIT researchers proposed a different strategy. They suggest that since humans decide when to use large language models, assessing a model’s performance must consider how individuals form beliefs about its abilities.
For instance, a graduate student needs to judge if the model can assist in drafting a specific email, while a clinician must evaluate which cases would best benefit from consulting the model.
Building on this concept, the researchers established a framework for evaluating an LLM based on its alignment with human beliefs regarding its performance on specific tasks.
They present a human generalization function, which models how people adjust their beliefs about an LLM’s abilities after interacting with it. The researchers then assess how well LLMs align with this function.
Their findings reveal that when models do not align well with the human generalization function, users might either overestimate or underestimate the model’s capabilities, potentially leading to unexpected failures. Interestingly, due to this misalignment, more advanced models can perform worse than simpler ones in critical situations.
“These tools are exciting because they are versatile, but we have to consider the human collaboration aspect,” says study co-author Ashesh Rambachan, an assistant professor of economics and principal investigator at the Laboratory for Information and Decision Systems (LIDS).
Joining Rambachan are lead author Keyon Vafa, a postdoc at Harvard University, and Sendhil Mullainathan, an MIT professor in Electrical Engineering, Computer Science, and Economics, who is also part of LIDS. Their research is set to be presented at the International Conference on Machine Learning.
Human Generalization
As we interact with others, we develop beliefs about their knowledge and abilities. For instance, if your friend often corrects grammar, you might assume they are also proficient at sentence construction, even if you’ve never tested this idea.
“Language models can seem very human-like. We aimed to demonstrate that this principle of human generalization applies to how people perceive language models,” says Rambachan.
To start, the researchers defined the human generalization function, which encompasses asking questions, observing responses from a person or an LLM, and then inferring how that person or model might respond to related questions.
If someone notices that an LLM can accurately answer questions about matrix inversion, they may conclude it can also handle simple arithmetic questions. If the model is not in sync with this assumption—meaning it doesn’t perform well on questions a human expects it to respond correctly to—it might fail when used.
With this definition established, the researchers created a survey to gauge how people generalize their experiences with LLMs and others.
Survey participants were shown various questions that a person or LLM got right or wrong, followed by questions on whether they believed that person or LLM would correctly answer a related question. This survey resulted in a dataset of nearly 19,000 instances illustrating human generalizations about LLM performance across 79 varied tasks.
Assessing Misalignment
The team found that participants performed well at predicting whether a human who answered one question correctly would correctly answer a related question, but they struggled more with generalizing about LLM performance.
“Human generalization is applied to language models, but it breaks down because LLMs do not exhibit expertise patterns in the same way humans do,” Rambachan explains.
Moreover, people showed a greater tendency to revise their beliefs about an LLM after incorrect answers compared to when it provided correct answers. They also believed that the performance of LLMs on straightforward questions had minimal impact on their performance in more complex scenarios.
In cases where incorrect responses were given greater importance, simpler models outperformed advanced models like GPT-4.
“Better language models can mislead users into thinking they will perform well on related questions when they might actually not,” Rambachan states.
A potential reason for the difficulty humans have in generalizing about LLMs might stem from their relative novelty; people have much less experience with LLMs compared to interactions with other individuals.
“Over time, as we interact more with language models, we may see improvements,” he adds.
With this future in mind, researchers plan to conduct more studies on how human beliefs about LLM capabilities evolve over time through interactions. They also aim to explore how to incorporate human generalization into the development of LLMs.
“When we train these algorithms or adjust them based on human feedback, we need to consider the human generalization function when measuring performance,” he emphasizes.
Meanwhile, the researchers hope their dataset can serve as a benchmark to examine how LLMs perform in relation to the human generalization function, potentially enhancing the effectiveness of models used in real-world applications.
This research was partially funded by the Harvard Data Science Initiative and the Center for Applied AI at the University of Chicago Booth School of Business.