RoCo: Robust Code for Fast and Effective Proactive Defense against Voice Cloning Attack
Abstract
R
ecent progress in neural speech synthesis has enabled voice cloning systems that can imitate a target speaker’s
voice from only a few seconds of audio, posing severe security threats such as phishing, misinformation, and privacy
breaches. Proactive defenses that inject adversarial perturbations into speech signals have been studied to mitigate
these threats, but existing methods often suffer from slow defended speech generation and weak robustness against mod
ern speech enhancement techniques. To overcome these challenges, we propose RoCo1, a codec-based proactive defense
that generates a dedicated perturbation code and integratesit with the latent codes of the original speech extracted by a
neural codec model. RoCo enables fast defended speech generation while maintaining robustness to perturbation removal.
Experiments on state-of-the-art zero-shot synthesis models demonstrate that RoCo consistently achieves higher defense
success rates and stronger robustness than prior methods, including resilience against speech enhancement
Model Overview
ㅤRoCo framework: The original speech is reconstructed into defended speech using a codec-based synthesis model. During reconstruction, the perturbation code is optimized with the STE method using the Target Loss, and its optimization stops once a predefined threshold is reached. At the embedding stage, the SNR Loss is applied to preserve naturalness.