Linguistics and computer science experts have identified some key reasons behind the ineffective performance of AI large language models in mimicking human conversations.
In your daily conversations, pay attention to the natural moments that allow others to join in. Timing can be crucial—if someone speaks at the wrong moment, they might come off as too aggressive, overly shy, or simply awkward.
The exchange of ideas in a conversation has a social aspect, and while humans manage this fairly well most of the time, AI language systems struggle significantly with it.
A team of researchers from Tufts University has explored the underlying reasons for this gap in AI’s conversational abilities and has proposed potential improvements to make them better conversationalists.
During verbal interactions, people generally avoid talking over one another and take turns to speak and listen. Each participant picks up various cues to identify what linguists term “transition relevant places” (TRPs). These are common points in conversation where one person can either let the current speaker continue or take their turn to share their thoughts.
JP de Ruiter, who specializes in psychology and computer science, mentions that there was a long-standing belief that “paraverbal” elements of speech—like tone, emphasis, pauses, and visual signals—were the primary indicators of TRPs.
“That helps a little bit,” de Ruiter explains, “but if you strip away the words and present only the prosody—the rhythm and melody of speech as if spoken through a sock—people struggle to identify TRPs.”
Conversely, if you provide just the linguistic content in a monotone voice, study participants can still recognize many of the same TRPs found in natural discussions.
“What we’ve realized is that the most crucial signal for turn-taking in conversation is the actual language content. The other signals are much less important,” de Ruiter states.
AI excels at recognizing patterns in content; however, when de Ruiter, graduate student Muhammad Umair, and research assistant professor Vasanth Sarathy evaluated transcribed conversations with a large language model, the AI’s ability to identify TRPs was far inferior to that of humans.
This limitation arises because AI is trained primarily on written text collected from the internet, including Wikipedia articles, online forums, corporate websites, and news articles, covering a vast array of topics. However, this dataset lacks a considerable amount of transcribed spoken language, which tends to be unscripted, employs simpler language and shorter sentences, and differs structurally from written content.
Since AI wasn’t “raised” on conversations, it struggles to engage in dialogue in a more organic, human-like way.
The researchers considered the possibility of refining a language model that was trained primarily on written text by additionally training it on a smaller dataset of conversational exchanges, aiming for more natural dialogue. However, they discovered that there remained obstacles in achieving human-like conversational abilities.
They also warned that there might be an inherent limitation preventing AI from having a natural conversation. “We assume these large language models understand the content accurately, but that may not be true,” Sarathy noted. “They predict the next word based on superficial statistical patterns, while turn-taking requires deeper contextual understanding throughout a conversation.”
“There is a chance these limitations can be addressed by pre-training large language models with a more extensive collection of naturally occurring spoken language,” said Umair, who focuses on human-robot interactions in his PhD research and is the leading author of the studies. “Though we have made available a new training dataset that aids AI in recognizing opportunities for speech within natural dialogue, gathering such extensive data to train current AI models poses a significant challenge. There simply isn’t as much conversational audio or transcripts compared to written material available online.”