In a significant stride towards enhancing the fidelity of video-to-audio generation, researchers have identified and addressed a critical yet overlooked issue in this burgeoning field. The phenomenon, termed “Insertion Hallucination” (IH), occurs when models generate acoustic events, such as speech or music, that lack corresponding visual sources in the video. This systemic risk, driven by dataset biases like off-screen sounds, has been entirely undetected by current evaluation metrics that focus primarily on semantic and temporal alignment.
To tackle this challenge, the research team, comprising Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, and Yiwei Wang, developed a systematic evaluation framework. This framework employs a majority-voting ensemble of multiple audio event detectors to identify instances of IH. The researchers also introduced two novel metrics, IH@vid and IH@dur, to quantify the prevalence and severity of this issue. IH@vid represents the fraction of videos with hallucinations, while IH@dur indicates the fraction of hallucinated duration.
Building on this foundation, the team proposed Posterior Feature Correction (PFC), a novel training-free inference-time method designed to mitigate IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. This approach ensures that the generated audio is more faithful to the visual content of the video.
Experiments conducted on several mainstream video-to-audio (V2A) benchmarks revealed that state-of-the-art models suffer from severe IH. However, the PFC method demonstrated a remarkable reduction in both the prevalence and duration of hallucinations by over 50% on average. Notably, this improvement was achieved without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization.
The practical applications of this research are substantial for the music and audio production industries. As video-to-audio generation technology becomes more integrated into creative workflows, the reliability and faithfulness of these models become paramount. By addressing IH, the PFC method ensures that generated audio is more accurate and contextually appropriate, enhancing the overall quality of multimedia content. This breakthrough paves the way for more reliable and faithful V2A models, ultimately benefiting creators and consumers alike. Read the original research paper here.



