The world of audio generation has seen significant strides, but it’s been largely stuck in mono, lacking the spatial immersion that could make it truly immersive. Enter ViSAudio, a groundbreaking end-to-end framework that generates binaural spatial audio directly from silent video. This is a game-changer, as it bypasses the traditional two-stage pipeline that often leads to error accumulation and spatio-temporal inconsistencies.
The team behind ViSAudio, including Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, and Dahua Lin, has introduced the BiAudio dataset to support this task. This dataset comprises approximately 97K video-binaural audio pairs, spanning diverse real-world scenes and camera rotation trajectories. The dataset was constructed through a semi-automated pipeline, ensuring a wide range of scenarios for training and testing the model.
ViSAudio employs conditional flow matching with a dual-branch audio generation architecture. Two dedicated branches model the audio latent flows, while a conditional spacetime module ensures consistency between channels and preserves distinctive spatial characteristics. This design allows for precise spatio-temporal alignment between the generated audio and the input video, adapting effectively to viewpoint changes, sound-source motion, and diverse acoustic environments.
The results are impressive. ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations. It generates high-quality binaural audio with spatial immersion that feels natural and coherent. This breakthrough could revolutionize the way we experience audio in virtual reality, video editing, and beyond.
The project website, https://kszpxxzmc.github.io/ViSAudio-project, offers more details and examples of ViSAudio’s capabilities. It’s an exciting development that pushes the boundaries of what’s possible in audio generation, bringing us one step closer to fully immersive, spatially aware soundscapes.



