In the world of multimedia production, the ability to manipulate audio with precision is paramount. Producers often need to handle audio tracks individually for editing, mixing, and creative control. However, current methods for generating audio from video often fall short, producing mixed sounds from multiple sources rather than isolating specific sounds as intended. This is where the innovative work of researchers Junwon Lee, Juhan Nam, and Jiyoung Lee comes into play.
The team introduces a novel task: text-conditioned selective video-to-audio (V2A) generation. This approach aims to produce only the user-intended sound from a multi-object video. The challenge lies in the fact that visual features are often entangled, and region cues or prompts frequently fail to specify the sound source accurately. To tackle this, the researchers propose SelVA, a text-conditioned V2A model that uses text prompts as explicit selectors of the target sound source.
SelVA modulates the video encoder to distinctly extract prompt-relevant video features. The model employs supplementary tokens that promote cross-attention by suppressing text-irrelevant activations, achieving efficient parameter tuning. This results in robust semantic and temporal grounding, ensuring that the generated audio aligns well with the visual content both in meaning and timing.
One of the significant hurdles in training such models is the lack of mono audio track supervision. To overcome this, SelVA uses a self-augmentation scheme, enhancing its ability to generate high-quality, isolated audio tracks. The effectiveness of SelVA is evaluated on VGG-MONOAUDIO, a curated benchmark of clean single-source videos. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization.
The implications of this research are profound for the multimedia production industry. By enabling precise control over individual sound sources in a video, SelVA opens up new possibilities for creative editing and mixing. Producers can now isolate and manipulate specific sounds with greater accuracy, leading to more refined and customized audio experiences.
The code and demo for SelVA are available at https://jnwnlee.github.io/selva-demo/, inviting developers, producers, and enthusiasts to explore and experiment with this groundbreaking technology. As the field of multimedia production continues to evolve, innovations like SelVA are poised to shape the future of audio and video editing, offering unprecedented levels of control and creativity.



