Creating interactive audio-visual content like video games has always been a complex task, even for humans. It requires a team of artists, developers, and designers working together for months to produce something polished and engaging. So, it’s no surprise that AI, despite its prowess in generating text, audio, images, and videos, has struggled with this challenge. Current large language models (LLMs) can create simple JavaScript games and animations, but they lack automated evaluation metrics and fall short when it comes to complex, multi-agent content.
To address these issues, a team of researchers led by Alexia Jolicoeur-Martineau has developed a new metric and a multi-agent system. The metric, called AVR-Eval, is a relative measure of multimedia content quality using Audio-Visual Recordings (AVRs). It works by having an omni-modal model, which processes text, video, and audio, compare the AVRs of two pieces of content. A text model then reviews these evaluations to determine which content is superior. The researchers have shown that AVR-Eval can effectively distinguish between good content and broken or mismatched content.
The multi-agent system, dubbed AVR-Agent, is designed to generate JavaScript code from a bank of multimedia assets, including audio, images, and 3D models. The coding agent in this system selects relevant assets, generates multiple initial codes, and then uses AVR-Eval to identify the best version. It then iteratively improves this code through omni-modal agent feedback from the AVR.
The researchers conducted experiments on games and animations using AVR-Eval, which calculates the win rate of content A against content B. They found that content generated by AVR-Agent had a significantly higher win rate against content made through one-shot generation. However, the models struggled to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit greatly from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively.
This research highlights the fundamental differences between human and machine content creation approaches. It also underscores the need for further advancements in AI to bridge this gap and unlock the full potential of interactive audio-visual content generation.



