RoCo: Robust Code for Fast and Effective Proactive Defense against Voice Cloning Attack

Abstract

R ecent progress in neural speech synthesis has enabled voice cloning systems that can imitate a target speaker’s voice from only a few seconds of audio, posing severe security threats such as phishing, misinformation, and privacy breaches. Proactive defenses that inject adversarial perturbations into speech signals have been studied to mitigate these threats, but existing methods often suffer from slow defended speech generation and weak robustness against mod ern speech enhancement techniques. To overcome these challenges, we propose RoCo1, a codec-based proactive defense that generates a dedicated perturbation code and integratesit with the latent codes of the original speech extracted by a neural codec model. RoCo enables fast defended speech generation while maintaining robustness to perturbation removal. Experiments on state-of-the-art zero-shot synthesis models demonstrate that RoCo consistently achieves higher defense success rates and stronger robustness than prior methods, including resilience against speech enhancement

Model Overview

Model Overview Image

ㅤRoCo framework: The original speech is reconstructed into defended speech using a codec-based synthesis model. During reconstruction, the perturbation code is optimized with the STE method using the Target Loss, and its optimization stops once a predefined threshold is reached. At the embedding stage, the SNR Loss is applied to preserve naturalness.

AVC Samples

Type Real voice Defense voice
Defended voice compared with real voice
Fake voice generated using defended audio
Defended voice after speech enhancement
Fake voice generated after speech enhancement

AVC Samples

Type Real voice Defense voice
Defended voice compared with real voice
Fake voice generated using defended audio
Defended voice after speech enhancement
Fake voice generated after speech enhancement

SV2TTS Samples

Type Real voice Defense voice
Defended voice compared with real voice
Fake voice generated using defended audio
Defended voice after speech enhancement
Fake voice generated after speech enhancement

YourTTS Samples

Type Real voice Defense voice
Defended voice compared with real voice
Fake voice generated using defended audio
Defended voice after speech enhancement
Fake voice generated after speech enhancement