FoleyBench: Revolutionizing Video-to-Audio for Film Sound Design

In the world of film post-production, AR/VR, and sound design, the ability to generate audio from video is becoming increasingly important. This process, known as video-to-audio generation (V2A), is particularly crucial for creating Foley sound effects that are synchronized with on-screen actions. Foley, a technique used in filmmaking, involves creating custom sound effects to enhance the auditory experience of a scene. However, the current evaluation methods for V2A models do not align well with the practical applications of Foley, due to the lack of a benchmark specifically designed for this purpose.

A recent study by researchers Satvik Dixit, Koichi Saito, Zhi Zhong, Yuki Mitsufuji, and Chris Donahue sheds light on this issue. They found that 74% of videos from past evaluation datasets have poor audio-visual correspondence. Moreover, these datasets are dominated by speech and music, which are not the primary focus of Foley sound effects. To address this gap, the researchers introduced FoleyBench, the first large-scale benchmark explicitly designed for Foley-style V2A evaluation.

FoleyBench contains 5,000 (video, ground-truth audio, text caption) triplets, each featuring visible sound sources with audio causally tied to on-screen events. The dataset was built using an automated, scalable pipeline applied to in-the-wild internet videos from YouTube and Vimeo. Compared to past datasets, FoleyBench offers stronger coverage of sound categories specifically designed for Foley sound. Each clip is further labeled with metadata capturing source complexity, UCS/AudioSet category, and video length, enabling fine-grained analysis of model performance and failure modes.

The researchers benchmarked several state-of-the-art V2A models on FoleyBench, evaluating them on audio quality, audio-video alignment, temporal synchronization, and audio-text consistency. The introduction of FoleyBench is a significant step forward in the field of V2A, as it provides a more accurate and relevant benchmark for evaluating models used in Foley sound effect creation. This will not only improve the evaluation process but also drive the development of more sophisticated and effective V2A models tailored to the needs of film post-production, AR/VR, and sound design.

Related Posts