Voice Search Tech Leap: Hinglish ASR Breakthrough

In a significant stride towards enhancing voice search technology, researchers Abhinav Goyal and Nikesh Garera have developed a novel approach to improve the accuracy and efficiency of Automatic Speech Recognition (ASR) systems, particularly for streaming applications like Voice Search. Their work focuses on creating accurate, low-latency ASR models for Hinglish, a blend of Hindi and English, which is widely used in India and other regions.

The researchers tackled a common challenge in streaming ASR systems: the trade-off between low latency and high accuracy. Traditional models like Long Short-Term Memory (LSTM) and Connectionist Temporal Classification (CTC) based systems often struggle with accuracy due to their reliance on future audio frames, which are not available in real-time streaming scenarios. Goyal and Garera explored various modifications to the vanilla LSTM training process to boost accuracy without compromising the streaming capabilities.

One of the critical aspects they addressed was end-of-speech (EOS) detection, which is crucial for streaming applications to provide real-time feedback. They introduced a simple yet effective training and inference strategy for end-to-end CTC models that enables joint ASR and EOS detection. This approach eliminates the need for a separate voice activity detection (VAD) model, reducing latency and improving overall performance.

The researchers evaluated their model on Flipkart’s Voice Search platform, which handles a massive volume of approximately 6 million queries per day. The results were impressive, with their model achieving a word error rate (WER) of 3.69% without EOS and 4.78% with EOS. Moreover, the new model reduced search latency by approximately 1300 milliseconds, a significant 46.64% reduction compared to an independent VAD model.

The practical applications of this research are substantial, particularly in the realm of voice search and voice assistant technologies. By improving the accuracy and reducing the latency of ASR systems, this work paves the way for more responsive and efficient voice-based applications. For music and audio production, such advancements could enhance voice-controlled interfaces, making them more reliable and user-friendly. Additionally, accurate and low-latency ASR systems can improve real-time transcription services, benefiting musicians, producers, and audio engineers who rely on precise and timely transcriptions for their work. Read the original research paper here.

Scroll to Top