Synthetic Speech Dataset Revolutionizes Voice Command Tech

In the ever-evolving world of voice technology, the need for efficient, on-device keyword spotting (KWS) systems is becoming increasingly important. These systems, which allow devices to recognize and respond to specific commands, are crucial for creating private, low-latency, and energy-efficient voice interfaces. However, the development of these systems has been hampered by a significant challenge: the lack of specialized, multi-command training datasets.

Traditionally, data for KWS systems is collected through human recording. This process is not only costly and slow but also lacks scalability. To address this issue, researchers Lu Gan and Xi Li have introduced a novel, multilingual voice command dataset called SYNTTS-COMMANDS. This dataset is entirely generated using state-of-the-art Text-to-Speech (TTS) synthesis, leveraging the CosyVoice 2 model and speaker embeddings from public corpora.

The researchers created a collection of English and Chinese commands, demonstrating that synthetic speech can effectively replace human-recorded audio for training KWS classifiers. Extensive benchmarking across a range of efficient acoustic models showed that the synthetic dataset enables exceptional accuracy, achieving up to 99.5% on English and 98% on Chinese command recognition.

This research is a significant step forward in the field of TinyML (Tiny Machine Learning), addressing the data bottleneck that has long been a constraint. By providing a practical, scalable foundation for building on-device KWS systems, this work paves the way for more advanced, private, and efficient voice interfaces on resource-constrained edge devices.

The dataset and source code are publicly available on GitHub, inviting further exploration and development from the broader research community. This open-access approach underscores the collaborative spirit of the field and the shared goal of advancing voice technology.

Related Posts