Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

arXiv GitHub

Rostislav Makarov · Lea Schönherr · Timo Gerkmann
Signal Processing (SP), University of Hamburg, Germany · CISPA Helmholtz Center for Information Security, Germany

Introduction

Speech enhancement (SE) is built into everyday tools. Phones and laptops use it to clean calls, earbuds suppress background noise, conferencing platforms rely on it for intelligible meetings, and studios and assistive technologies use it to clarify recorded speech. These systems do more than pass audio through - they reconstruct speech from noisy microphone signals.

This page demonstrates a concrete risk that follows from that flexibility. If a very small, carefully shaped signal is added to the microphone input, the enhanced output can be nudged toward attacker-chosen words, while most listeners hear little or no change in the input itself. The added signal is designed to sit mostly under what human hearing tends to ignore (psychoacoustic masking) and to remain very low in overall energy. In practice, it becomes easy for the model to use but hard for a person to notice.


How It Works

The microphone records the user’s speech in noisy conditions - this is Yuser. An attacker adds a tiny perturbation δ (delta), crafted to be masked by the existing sound scene, producing Yuser + δ. The SE system enhances this attacked mix and outputs Ŝ. The intent is that Ŝ leans toward an attacker’s phrase Sattacker rather than the user’s original words.

Diagram showing S_user + noise → Y^user; add δ → Y^user + δ; enhancement → Ŝ closer to S^attacker.
Caption: A small, masked perturbation (δ) added before enhancement can steer the model’s output.

Short example

SE applied to noisy input (no attack):

Yuser
apply SE system
Yuser (no attack)

Adversarial steering:

Yuser + δ
apply SE system
Yuser (with an attack)
δ

Files

What You Can Hear

More audio examples can be found at the end of the page.


Approach and Findings

We add a very small signal, δ, before enhancement and keep it tightly constrained. Two simple controls describe this constraint: a masking tolerance (λ) that allows a limited amount of energy to slip under the hearing mask, and an energy budget (ε) that caps the total strength of δ. We iteratively adjust δ so that, after enhancement, the model’s output Ŝ aligns more closely with a chosen phrase Sattacker, while most of δ remains hidden under masking and never exceeds the budget.

In controlled tests, the effect is audible in the output and barely noticeable in the input when budgets are small. As λ or ε increase, steering becomes stronger and also more audible. This is a trade-off: small settings demonstrate the risk quietly, but larger settings make the effect clearer but harder to hide. To study this behavior, we trained three models representing two classes of speech enhancement.

Predictive approach. It is a common approach to directly regress noisy input to the clean speech estimatation. Usually can be done via:

Generative approach. The common approaches for generative modeling are: GANs and Diffusion Models. Here, we focus on diffusion:

Key results


Audio Demos

This gallery presents four attacker–user pairs, you can switch the active pair (1–4) using the arrow controls in the setting’s header. All files within a setting update to the selected pair.

References: Sattacker — Clean reference of the attacker’s target phrase, Suser (user’s clean speech) and Yuser (user's noisy observation).
Adversarial artifacts: δ — the perturbation alone, and Yuser + δ the attacked mix fed to the model.
Output:Yuser — the enhanced result for the attacked mix.

What You Can Explore
• Open any experimental label (e.g., “λ=20, ε=10”) and play ŜYuser to hear how strongly the words align with the attacker’s phrase at that setting. Compare with a reference audio Sattacker.
• Compare Yuser + δ with Yuser to gauge audibility; if masking is effective, the difference is subtle.
• Solo δ to hear exactly what was added.
• Compare different models: DM, CRM, Diff for same settings.

Notation
• λ — Masking tolerance (dB). Relaxing the mask lets more of δ pass; larger λ usually strengthens steering but raises audibility.
• ε — Global ℓ₂ radius (total energy budget) for δ; larger ε permits stronger perturbations.
pair
reference spectrogram

Sattacker

Suser

Yuser

reference spectrogram

Sattacker

Suser

Yuser

reference spectrogram

Sattacker

Suser

Yuser

reference spectrogram

Sattacker

Suser

Yuser

All methods
Plain optimization (λ=–, ε=∞)DMCRMDiff
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

Sorted by energy ε
λ=20, ε=3DM
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

λ=20, ε=6DM
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

λ=0, ε=10DMCRM
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

λ=10, ε=10DMDiff
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

λ=20, ε=10DMCRMDiff
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

λ=40, ε=10DMCRMDiff
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

λ=–, ε=10DM
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

λ=20, ε=15DM
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

λ=20, ε=20DM
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

λ=20, ε=∞DM
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ


Diffusion Ablation

Diffusion-SE ablations change one parameter at a time with fixed defaults elsewhere. Defaults: NFE=25, σ_max=0.5, stochastic noise schedule. We only change one knob per row to isolate effects.

pair
reference spectrogram

Sattacker

Suser

Yuser

reference spectrogram

Sattacker

Suser

Yuser

reference spectrogram

Sattacker

Suser

Yuser

reference spectrogram

Sattacker

Suser

Yuser

Diffusion
Plain optimization (λ=–, ε=∞)Diff
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

Fixed Noise Schedule (λ=–, ε=∞)Diff
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

Evaluation with NFEs=15 (λ=–, ε=∞)Diff
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

Evaluation with NFEs=35 (λ=–, ε=∞)Diff
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

Retrain with σ_max=0.3 (λ=–, ε=∞)Diff
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

Retrain with σ_max=0.7 (λ=–, ε=∞)Diff
experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ

experiment spectrogram

Yuser

δ

Yuser + δ


Citation

Paper: https://arxiv.org/abs/2509.21087 · Code: https://github.com/sp-uhh/se-adversarial-attack

@misc{makarov2025modernspeechenhancementsystems,
  title        = {Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?},
  author       = {Rostislav Makarov and Lea Schönherr and Timo Gerkmann},
  year         = {2025},
  eprint       = {2509.21087},
  archivePrefix= {arXiv},
  primaryClass = {eess.AS},
  url          = {https://arxiv.org/abs/2509.21087},
}