Engineers have created an AI system that enables a person wearing headphones to quickly enroll a speaker by looking at them for a few seconds. Once enrolled, the system can then play the speaker’s voice in real time, even in noisy environments as they move around. Noise-canceling headphones have become adept at creating a blank auditory canvas, but allowing specific sounds from the wearer’s surroundings to come through the erasure continues to pose a challenge for researchers. For example, the newest version of Apple’s AirPods Pro can automatically adjust sound levels for wearers by detecting when they are in a crowded place.Time and real-time communication are essential in today’s fast-paced world. However, it can be challenging to hear and understand a speaker in a noisy environment. This can be frustrating, especially if the listener has little control over whom to listen to or when this happens.
To address this issue, a team from the University of Washington has developed an artificial intelligence system called “Target Speech Hearing.” This innovative system allows a user wearing headphones to focus on a person speaking for three to five seconds to “enroll” them. Once enrolled, the system cancels out all other sounds in the environment and plays only the enrolled speaker’s voice in real time. This means that the listener can move around in noisy places and still hear the speaker, even if they are no longer facing them.
The team shared their findings at the ACM CHI Conference on Human Factors in Computing Systems in Honolulu on May 14. The development of this AI system has the potential to greatly improve the listening experience for individuals in noisy environments.The proof-of-concept device for modifying auditory perception is accessible for others to utilize and expand upon. This system is not currently available for commercial use. According to senior author Shyam Gollakota, a professor in the Paul G. Allen School of Computer Science & Engineering at UW, AI is often associated with web-based chatbots, but this project focuses on using AI to personalize the auditory experience for headphone users. The device allows users to clearly hear a single speaker in a noisy environment with multiple conversations. To use the system, the individual simply needs to wear the off-the shelf headphones.-shelf headphones with built-in microphones have a button that, when pressed, can pick up the sound of the person speaking while the listener moves their head. The microphones are designed to capture the speaker’s voice within a 16-degree margin of error on both sides of the headset. The headphones then transmit this signal to a built-in computer with machine learning software, which learns the unique vocal patterns of the desired speaker. As the speaker continues to talk, the system becomes more adept at focusing on their voice, even as the listener moves around. This allows the system to continue playing back the speaker’s voice with improved accuracy over time.
The system was tested on 21 subjects, and the results showed that the clarity of the enrolled speaker’s voice was rated nearly twice as high as the unfiltered audio on average by the participants.
This study builds on the team’s previous “semantic hearing” research, where users could choose specific sound classes they wanted to hear, such as birds or voices, and cancel out other sounds in the environment.
At the moment, the TSH system is only able to enroll one speaker at a time, and it can only do so when there are no other loud voices coming from the same direction as the target speaker’s voice. If a user is not satisfied with the result, the system can be adjusted.
When the sound quality of a speaker is not up to standard, they can conduct another registration to enhance the clarity.
The team is currently working on expanding the system to include earbuds and hearing aids in the future.
Other contributors to the study included Bandhav Veluri, Malek Itani, and Tuochao Chen, who are doctoral students at the Allen School, as well as Takuya Yoshioka, the research director at AssemblyAI. This research received funding from a Moore Inventor Fellow award, a Thomas J. Cabel Endowed Professorship, and a UW CoMotion Innovation Gap Fund.