Who excels in recognizing speech, humans or machines? A recent study indicates that modern automatic speech recognition (ASR) systems perform impressively well in noisy environments, occasionally outshining human abilities. Nonetheless, these systems require extensive training on vast datasets, whereas humans can develop similar skills in a shorter timeframe.
In recent years, automatic speech recognition (ASR) technology has made significant strides, particularly for languages like English that have a large number of speakers. Before 2020, it was generally believed that humans were far superior to machines in understanding spoken language. However, some of the latest ASR systems are beginning to reach levels of human accuracy. The main aim of these systems has been to reduce error rates, irrespective of human performance in similar settings, as even humans struggle with perfect recognition in noisy atmospheres.
A newly conducted study by UZH computational linguistics expert Eleanor Chodroff and her colleague Chloe Patman from Cambridge University evaluated two widely-used ASR systems: Meta’s wav2vec 2.0 and OpenAI’s Whisper. They assessed how effectively these systems could recognize speech amid speech-shaped noise (consistent background noise) or typical pub noise, and they considered scenarios where speakers wore or didn’t wear cotton face masks.
OpenAI’s system shines — with one exception
The findings revealed that humans still had a slight advantage over both ASR systems. However, OpenAI’s latest large ASR model, Whisper large-v3, exceeded human performance across all testing scenarios except in the natural pub noise condition, where it matched human abilities. This achievement highlights Whisper’s knack for understanding the acoustic characteristics of speech and translating them into the intended message (or sentence). “This was astonishing as the sentences presented lacked context, making it challenging to predict specific words based on earlier words,” remarks Eleanor Chodroff.
Extensive training requirements
A closer examination of these ASR systems reveals the extraordinary capabilities of humans. While both systems utilize deep learning, Whisper, which showed the best performance, demands an enormous amount of training data. Meta’s wav2vec 2.0 was trained using 960 hours (equivalent to 40 days) of English audio, while the standard Whisper system was trained on over 75 years of speech data. The version that surpassed human performance was trained on an astounding 500 years of continuous speech. “Humans can achieve similar levels of performance in just a few years,” Chodroff states. “However, significant challenges persist for ASR in nearly all other languages.”
Diverse error patterns
The research also highlights the different types of mistakes made by humans and ASR systems. English speakers consistently produced grammatically correct sentences but often resorted to sentence fragments instead of striving to transcribe every single word. Conversely, wav2vec 2.0 occasionally generated nonsensical outputs under challenging conditions. While Whisper tended to create full grammatical sentences, it was also more prone to introducing entirely incorrect information to fill in gaps.