
S_attacker
S_user
Y_user
We construct white-box targeted attacks on modern speech enhancement (SE). Under psychoacoustic masking and an ℓ₂ budget, we shape a perturbation δ so that the enhanced output ŝ conveys an attacker-chosen phrase while remaining hard to hear. We compare predictive SE (direct map, complex mask) to diffusion-based SE and study how sampler stochasticity and design choices affect robustness.
The same pair is shown across all models for like-for-like comparison.
References: S_attacker (clean target phrase), S_user (user’s clean), Y_user (user's noisy observation),
delta — adversarial residual added to Y_user (perturbation only), and Y_user+delta (attacked mix). Output: S_hat — the SE result for the attacked mix.
How to listen:
(1) Establish the baseline by playing S_attacker and Y_user to learn the target words and the user’s voice.
(2) For any experimental setting (λ masking tolerance, ε ℓ₂ radius) and model (Direct Map, Masking/CRM, Diffusion), compare S_hat to S_attacker to judge targeted success — focus on the words/semantics, not timbre.
(3) Compare Y_user+delta vs. Y_user to assess audibility — δ should be hard to hear if masking holds; optionally play delta alone to hear residual energy.
Use the arrows to switch pairs; the same pair is shown across all experiments below.Notation:
• λ — masking tolerance (dB): relaxes the psychoacoustic mask so more δ energy can pass; larger λ usually makes attacks easier but more audible.
• ε — global ℓ₂ radius: total energy budget for δ; larger ε permits stronger perturbations.
S_attacker
S_user
Y_user
S_attacker
S_user
Y_user
S_attacker
S_user
Y_user
S_attacker
S_user
Y_user
S_attacker
S_user
Y_user
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
Diffusion-SE ablations change one parameter at a time with fixed defaults elsewhere.
Defaults: NFE=25, σ_max=0.5, stochastic noise schedule.
We only change one knob per row to isolate effects.
S_attacker
S_user
Y_user
S_attacker
S_user
Y_user
S_attacker
S_user
Y_user
S_attacker
S_user
Y_user
S_attacker
S_user
Y_user
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
S_hat
delta
Y_user+delta
Paper: https://arxiv.org/abs/2509.21087 · Code: https://github.com/sp-uhh/se-adversarial-attack
@misc{makarov2025modernspeechenhancementsystems, title = {Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?}, author = {Rostislav Makarov and Lea Schönherr and Timo Gerkmann}, year = {2025}, eprint = {2509.21087}, archivePrefix= {arXiv}, primaryClass = {eess.AS}, url = {https://arxiv.org/abs/2509.21087}, }