In the realm of anomalous sound detection (ASD), the challenge of maintaining robustness against distribution shifts, such as encountering unseen low-signal-to-noise ratio (SNR) input mixtures of machine and noise types, is a significant hurdle. Current state-of-the-art systems typically extract embeddings from an adapted audio encoder and detect anomalies via nearest-neighbor search. However, fine-tuning these systems on noisy machine sounds often leads to a denoising effect, which inadvertently suppresses noise and reduces the system’s ability to generalize under mismatched mixtures or inconsistent labeling.
An alternative approach involves training-free systems with frozen self-supervised learning (SSL) encoders. These systems avoid the denoising issue and exhibit strong first-shot generalization. Yet, their performance can drop when mixture embeddings deviate from clean-source embeddings. To address this, researchers Phurich Saengthong, Tomoya Nishida, Kota Dohi, Natsuo Yamashita, and Yohei Kawaguchi have proposed a novel strategy aimed at improving SSL backbones. Their approach, termed “retain-not-denoise,” focuses on better preserving information from mixed sound sources.
The proposed method combines a multi-label audio tagging loss with a mixture alignment loss. This combination aligns student mixture embeddings to convex teacher embeddings of clean and noise inputs. The idea is to retain the mixture information rather than denoising it, thereby enhancing the system’s robustness. Controlled experiments conducted on stationary, non-stationary, and mismatched noise subsets have demonstrated improved robustness under distribution shifts. This approach narrows the gap toward achieving oracle mixture representations, which are ideal representations that capture all relevant information from the mixed sound sources.
The practical implications of this research are substantial for the field of audio processing and anomalous sound detection. By improving the robustness of ASD systems to distribution shifts, this method can enhance the reliability and accuracy of sound detection in real-world scenarios. This is particularly important in applications such as industrial monitoring, where detecting anomalies in machine sounds can prevent equipment failures and reduce downtime. Additionally, the retain-not-denoise strategy could find applications in other areas of audio processing, such as speech recognition and environmental sound classification, where maintaining the integrity of mixed sound sources is crucial.
In summary, the retain-not-denoise strategy represents a significant advancement in the field of anomalous sound detection. By addressing the limitations of current systems and improving their robustness to distribution shifts, this approach paves the way for more reliable and accurate sound detection in various real-world applications. The research conducted by Saengthong and colleagues highlights the importance of preserving mixture information and offers a promising direction for future developments in audio processing technology.



