Speech LLMs Fall Short of Human Perception in Groundbreaking HPSU Study

In the rapidly evolving landscape of speech technology, researchers have made significant strides with Speech Large Language Models (Speech LLMs), particularly in tasks like Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER). However, a critical question remains: can these models truly match human-level auditory perception, especially when it comes to understanding the subtle nuances of real-world spoken language, such as latent intentions and implicit emotions? This question has been largely underexplored until now.

A team of researchers, including Chen Li, Peiji Yang, Yicheng Zhong, Jianxing Yu, Zhisheng Wang, Zihao Gou, Wenqing Chen, and Jian Yin, has introduced a groundbreaking benchmark called Human-level Perception in Spoken Speech Understanding (HPSU). This benchmark is designed to thoroughly evaluate the human-level perceptual and understanding capabilities of Speech LLMs. HPSU comprises over 20,000 expert-validated spoken language understanding samples in both English and Chinese, covering a wide range of tasks from basic speaker attribute recognition to complex inference of latent intentions and implicit emotions.

One of the significant challenges in developing such a benchmark is the scarcity of data and the high cost of manual annotation in real-world scenarios. To address this, the researchers developed a semi-automatic annotation process. This innovative approach fuses audio, textual, and visual information to enable precise speech understanding and labeling, thereby enhancing both annotation efficiency and quality.

The researchers systematically evaluated various open-source and proprietary Speech LLMs using the HPSU benchmark. The results were eye-opening: even the top-performing models fell considerably short of human capabilities in understanding genuine spoken interactions. This gap highlights the need for further development and refinement of Speech LLMs to achieve human-level perception and cognition.

The introduction of HPSU is a significant step forward in the field of speech technology. It provides a comprehensive framework for evaluating the capabilities of Speech LLMs, guiding future research and development. As we move towards more sophisticated and nuanced speech understanding technologies, HPSU will be an invaluable tool in ensuring that these models can truly comprehend the complexities of human communication. This benchmark not only challenges the current state of the art but also sets a clear path for future advancements, pushing the boundaries of what Speech LLMs can achieve.

Scroll to Top