Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

arXiv GitHub

Rostislav Makarov · Lea Schönherr · Timo Gerkmann
Signal Processing (SP), University of Hamburg, Germany · CISPA Helmholtz Center for Information Security, Germany

Overview

We construct white-box targeted attacks on modern speech enhancement (SE). Under psychoacoustic masking and an ℓ₂ budget, we shape a perturbation δ so that the enhanced output ŝ conveys an attacker-chosen phrase while remaining hard to hear. We compare predictive SE (direct map, complex mask) to diffusion-based SE and study how sampler stochasticity and design choices affect robustness.

Key results


Audio demos

The same pair is shown across all models for like-for-like comparison.
References: S_attacker (clean target phrase), S_user (user’s clean), Y_user (user's noisy observation),
delta — adversarial residual added to Y_user (perturbation only), and Y_user+delta (attacked mix). Output: S_hat — the SE result for the attacked mix.
How to listen:
(1) Establish the baseline by playing S_attacker and Y_user to learn the target words and the user’s voice.
(2) For any experimental setting (λ masking tolerance, ε ℓ₂ radius) and model (Direct Map, Masking/CRM, Diffusion), compare S_hat to S_attacker to judge targeted success — focus on the words/semantics, not timbre.
(3) Compare Y_user+delta vs. Y_user to assess audibility — δ should be hard to hear if masking holds; optionally play delta alone to hear residual energy.
Use the arrows to switch pairs; the same pair is shown across all experiments below.Notation:
• λ — masking tolerance (dB): relaxes the psychoacoustic mask so more δ energy can pass; larger λ usually makes attacks easier but more audible.
• ε — global ℓ₂ radius: total energy budget for δ; larger ε permits stronger perturbations.

pair
reference spectrogram

S_attacker

S_user

Y_user

reference spectrogram

S_attacker

S_user

Y_user

reference spectrogram

S_attacker

S_user

Y_user

reference spectrogram

S_attacker

S_user

Y_user

reference spectrogram

S_attacker

S_user

Y_user

Direct Map
Plain optimization (λ=-, ε=∞)
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=-, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=40, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=20, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=10, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=0, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=20, ε=∞
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=20, ε=20
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=20, ε=15
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=20, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=20, ε=6
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=20, ε=3
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

CRM
Plain optimization (λ=-, ε=∞)
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=40, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=20, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=0, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

Diffusion
Plain optimization (λ=-, ε=∞)
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=40, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=20, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

λ=10, ε=10
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta


Diffusion Ablation

Diffusion-SE ablations change one parameter at a time with fixed defaults elsewhere.
Defaults: NFE=25, σ_max=0.5, stochastic noise schedule.
We only change one knob per row to isolate effects.

pair
reference spectrogram

S_attacker

S_user

Y_user

reference spectrogram

S_attacker

S_user

Y_user

reference spectrogram

S_attacker

S_user

Y_user

reference spectrogram

S_attacker

S_user

Y_user

reference spectrogram

S_attacker

S_user

Y_user

Diffusion
Plain optimization (λ=-, ε=∞)
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

Fixed Noise Schedule (λ=-, ε=∞)
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

Evaluation with NFEs=15 (λ=-, ε=∞)
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

Evaluation with NFEs=35 (λ=-, ε=∞)
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

Retrain with σ_max=0.3 (λ=-, ε=∞)
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

Retrain with σ_max=0.7 (λ=-, ε=∞)
experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta

experiment spectrogram

S_hat

delta

Y_user+delta


Citation

Paper: https://arxiv.org/abs/2509.21087 · Code: https://github.com/sp-uhh/se-adversarial-attack

@misc{makarov2025modernspeechenhancementsystems,
  title        = {Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?},
  author       = {Rostislav Makarov and Lea Schönherr and Timo Gerkmann},
  year         = {2025},
  eprint       = {2509.21087},
  archivePrefix= {arXiv},
  primaryClass = {eess.AS},
  url          = {https://arxiv.org/abs/2509.21087},
}