In the realm of audio technology, neural audio codecs have made significant strides, particularly for mono and stereo signals. However, the spatial audio domain has remained largely unexplored—until now. Researchers Parthasaarathy Sudarsanam, Sebastian Braun, and Hannes Gamper have introduced the FOA Tokenizer, the first discrete neural spatial audio codec designed for first-order ambisonics (FOA). This innovative technology builds upon the WavTokenizer architecture, extending it to support four-channel FOA signals. The team has also introduced a novel spatial consistency loss to ensure that directional cues are preserved in the reconstructed signals, even under highly compressed representations.
The FOA Tokenizer compresses 4-channel FOA audio at 24 kHz into just 75 discrete tokens per second, achieving an impressive bit rate of 0.9 kbps. To evaluate its performance, the researchers tested the codec on a variety of audio conditions, including simulated reverberant mixtures, non-reverberant clean speech, and FOA mixtures with real room impulse responses. The results were promising, with mean angular errors of 13.76 degrees, 3.96 degrees, and 25.83 degrees, respectively, across the three conditions. These findings demonstrate the codec’s ability to accurately reconstruct spatial audio, making it a valuable tool for applications that require high-quality, low-bitrate spatial audio compression.
Beyond its compression capabilities, the FOA Tokenizer also offers practical benefits for downstream spatial audio tasks. The discrete latent representations derived from the codec provide useful features for applications such as sound event localization and detection. The researchers demonstrated this by using the codec’s features on the STARSS23 real recordings, showcasing its potential for enhancing various spatial audio applications.
For music and audio production, the FOA Tokenizer opens up new possibilities for creating immersive, high-quality audio experiences with minimal data usage. Its ability to preserve directional cues and accurately reconstruct spatial audio makes it an invaluable tool for producers, engineers, and artists looking to explore the full potential of spatial audio. As the technology continues to evolve, we can expect to see even more innovative applications emerge, further pushing the boundaries of what’s possible in the world of audio.



