In a significant leap forward for audio synthesis, researchers have introduced UniMoE-Audio, a groundbreaking model that unifies speech and music generation within a dynamic framework, promising to revolutionize the way we create and interact with sound.
The team behind UniMoE-Audio, led by Zhenyu Liu and colleagues from various institutions, tackled a persistent challenge in the auditory domain: the separation of speech and music generation. This division has hindered the development of universal audio synthesis models due to inherent task conflicts and severe data imbalances. To bridge this gap, the researchers proposed a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. This architecture introduces a Top-P routing strategy for dynamic expert number allocation, allowing the model to adaptively allocate resources based on the complexity of the task at hand. The hybrid expert design comprises routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping, ensuring efficient and effective learning.
To address the data imbalance issue, the researchers devised a three-stage training curriculum. The first stage, Independent Specialist Training, leverages original datasets to instill domain-specific knowledge into each “proto-expert” without interference. In the second stage, MoE Integration and Warmup, these specialists are incorporated into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of a balanced dataset. The final stage, Synergistic Joint Training, trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy.
Extensive experiments demonstrated that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks but also shows superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. The model’s ability to dynamically allocate resources and adapt to different tasks makes it a powerful tool for various audio generation applications.
For music and audio production, UniMoE-Audio opens up new possibilities. Its unified approach to speech and music generation can streamline workflows, enabling creators to seamlessly integrate different audio elements. The model’s dynamic-capacity framework ensures efficient resource allocation, making it suitable for real-time applications and complex compositions. Moreover, the superior performance in both speech and music generation tasks suggests that UniMoE-Audio could be a valuable tool for sound designers, composers, and audio engineers, pushing the boundaries of creative expression in the auditory domain. The researchers’ findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation, paving the way for more innovative and integrated audio synthesis solutions. Read the original research paper here.



