RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations

Abstract

W ith the advancement of AI-based speech synthesis technologies such as Deep Voice, the risk of voice spoofing attacks —such as voice phishing and fake news through unauthorized use of others' voices— is increasing significantly. Existing defenses that inject adversarial perturbations directly into audio signals show limited effectiveness, as these perturbations are easily neutralized by speech enhancement techniques. To address this limitation, we propose RoVo (Robust Voice), a novel proactive defense method that injects adversarial perturbations into high-dimensional embedding vectors of audio signals, subsequently reconstructing them into protected speech. This approach effectively mitigates speech synthesis attacks and demonstrates strong resilience even against speech enhancement models, which represent a secondary threat.
ㅤIn extensive experiments, RoVo improved the Defense Success Rate (DSR) by more than 70% compared to unprotected speech, consistently across four state-of-the-art speech synthesis models. Notably, RoVo achieved a DSR of up to 99.5% against a commercial speaker-verification API, effectively neutralizing speech synthesis attacks. Furthermore, RoVo's perturbations remained robust under strong speech enhancement conditions, clearly outperforming traditional methods. A user study further confirmed that RoVo maintains both the naturalness and usability of protected speech, underscoring its effectiveness in complex and evolving threat scenarios.

Model Overview

Model Overview Image

ㅤOverview of the proposed adversarial embedding defense mechanism. The system protects against unauthorized voice synthesis and downstream misuse such as vishing, fake news, and fraudulent verification, even in the presence of noise removal or speech enhancement techniques.

ㅤThe architecture of RoVo, the proposed voice protection framework. RoVo applies adversarial perturbations directly to the embedding space generated by the Neural Codec Encoder. The perturbed embeddings are processed by the Neural Codec Decoder to reconstruct the protected voice while maintaining naturalness.

AVC Samples

Type Real voice Defense voice
Defended voice compared with real voice
Fake voice generated using defended audio
Defended voice after speech enhancement
Fake voice generated after speech enhancement

AVC Samples

Type Real voice Defense voice
Defended voice compared with real voice
Fake voice generated using defended audio
Defended voice after speech enhancement
Fake voice generated after speech enhancement

RTVC Samples

Type Real voice Defense voice
Defended voice compared with real voice
Fake voice generated using defended audio
Defended voice after speech enhancement
Fake voice generated after speech enhancement

* Fake voice generated after speech enhancement demonstrates that the applied noise reduction and defense techniques successfully hindered the realistic generation of fake audio.

RTVC Samples

Type Real voice Defense voice
Defended voice compared with real voice
Fake voice generated using defended audio
Defended voice after speech enhancement
Fake voice generated after speech enhancement

YourTTS Samples

Type Real voice Defense voice
Defended voice compared with real voice
Fake voice generated using defended audio
Defended voice after speech enhancement
Fake voice generated after speech enhancement