Researchers from Australia’s national science agency CSIRO, Federation University Australia, and RMIT University have developed a groundbreaking method to enhance the detection of audio deepfakes. The new technique, Rehearsal with Auxiliary-Informed Sampling (RAIS), is specifically designed to combat the growing threat of audio deepfakes, which pose significant risks in cybercrime, including bypassing voice-based biometric authentication systems, impersonation, and disinformation.
RAIS determines whether an audio clip is real or artificially generated, maintaining its performance over time as new attack types emerge. This innovation is crucial in a landscape where deepfake technology is advancing rapidly, with newer techniques often bearing little resemblance to older ones. The urgency of this development is underscored by recent incidents, such as the AI-cloned voice of Italy’s Defence Minister earlier this year. In this case, the cloned voice requested a €1 million ransom from prominent business leaders, convincing some to pay. Such incidents highlight the critical need for robust audio deepfake detectors.
The challenge of keeping detection systems updated without forgetting previously learned patterns is addressed by RAIS. “We want these detection systems to learn the new deepfakes without having to train the model again from scratch. If you just fine-tune on the new samples, it will cause the model to forget the older deepfakes it knew before,” explained joint author Dr. Kristen Moore from CSIRO’s Data61. RAIS solves this by automatically selecting and storing a small, diverse set of past examples, including subtle audio traits that humans may overlook. This helps the AI learn new deepfake styles without forgetting old ones.
RAIS employs a smart selection process powered by a network that generates ‘auxiliary labels’ for each audio sample. These labels help identify a diverse and representative set of audio samples to retain and rehearse. By incorporating extra labels beyond simple ‘fake’ or ‘real’ tags, RAIS ensures a richer mix of training data, improving its ability to remember and adapt over time. This method outperforms others, achieving the lowest average error rate of 1.95 percent across a sequence of five experiences. It remains effective with a small memory buffer and is designed to maintain accuracy as attacks become more sophisticated. The code for RAIS is available on GitHub, ensuring accessibility for further development and implementation.
“The audio deepfakes are evolving rapidly, and traditional detection methods can’t keep up,” said Falih Gozi Febrinanto, a recent PhD graduate of Federation University Australia. “RAIS helps the model retain what it has learned and adapt to new attacks. Overall, it reduces the risk of forgetting and enhances its ability to detect deepfakes.” Dr. Moore added, “Our approach not only boosts detection performance but also makes continual learning practical for real-world applications. By capturing the full diversity of audio signals, RAIS sets a new standard for efficiency and reliability.”
This development could significantly shape the future of cybersecurity and authentication systems. As deepfake technology continues to evolve, the ability to detect and mitigate these threats becomes increasingly critical. RAIS represents a step forward in this ongoing battle, offering a more reliable and adaptable solution to the challenges posed by audio deepfakes. The full paper, “Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake Detection,” provides further insights into this innovative approach and its potential applications.



