M3-TTS Revolutionizes Non-Autoregressive Speech Synthesis

Text-to-speech (TTS) technology has come a long way, but there’s still room for improvement, especially in the realm of non-autoregressive (NAR) TTS. NAR TTS promises faster and more efficient speech synthesis, but it’s often held back by the need for precise length alignment between text and audio. This alignment is crucial for natural and expressive speech, but current methods rely on duration modeling or pseudo-alignment, which can limit the naturalness and computational efficiency of the synthesized speech.

Enter M3-TTS, a new NAR TTS paradigm developed by a team of researchers led by Xiaopeng Wang. M3-TTS stands for Multi-modal DiT Alignment & Mel-latent, and it’s built on a multi-modal diffusion transformer (MM-DiT) architecture. The key innovation here is the use of joint diffusion transformer layers for cross-modal alignment. This allows M3-TTS to achieve stable monotonic alignment between text and speech sequences of varying lengths, without the need for pseudo-alignment.

But M3-TTS doesn’t stop at alignment. It also uses single diffusion transformer layers to enhance the modeling of acoustic details, ensuring that the synthesized speech is not just natural but also high-fidelity. To top it off, the framework integrates a mel-vae codec, which provides a threefold acceleration in training.

The researchers put M3-TTS to the test on the Seed-TTS and AISHELL-3 benchmarks. The results were impressive. M3-TTS achieved state-of-the-art NAR performance, with the lowest word error rates of 1.36% for English and 1.31% for Chinese. It also maintained competitive naturalness scores, proving that it can deliver both accuracy and naturalness in speech synthesis.

The code and demos for M3-TTS will be available at the project’s website, making it accessible for other researchers and developers to explore and build upon. This is a significant step forward for NAR TTS, and it will be exciting to see how this technology evolves and what new possibilities it unlocks in the field of speech synthesis.

Scroll to Top