Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Rostislav Makarov · Lea Schönherr · Timo Gerkmann

Signal Processing (SP), University of Hamburg, Germany · CISPA Helmholtz Center for Information Security, Germany

Introduction

Speech enhancement (SE) is built into everyday tools. Phones and laptops use it to clean calls, earbuds suppress background noise, conferencing platforms rely on it for intelligible meetings, and studios and assistive technologies use it to clarify recorded speech. These systems do more than pass audio through - they reconstruct speech from noisy microphone signals.

This page demonstrates a concrete risk that follows from that flexibility. If a very small, carefully shaped signal is added to the microphone input, the enhanced output can be nudged toward attacker-chosen words, while most listeners hear little or no change in the input itself. The added signal is designed to sit mostly under what human hearing tends to ignore (psychoacoustic masking) and to remain very low in overall energy. In practice, it becomes easy for the model to use but hard for a person to notice.

How It Works

The microphone records the user’s speech in noisy conditions - this is Y^user. An attacker adds a tiny perturbation δ (delta), crafted to be masked by the existing sound scene, producing Y^user + δ. The SE system enhances this attacked mix and outputs Ŝ. The intent is that Ŝ leans toward an attacker’s phrase S^attacker rather than the user’s original words.

Diagram showing S_user + noise → Y^user; add δ → Y^user + δ; enhancement → Ŝ closer to S^attacker. — Caption: A small, masked perturbation (δ) added before enhancement can steer the model’s output.

Short example

SE applied to noisy input (no attack):

Y^user

apply SE system

Ŝ^{Y^user} (no attack)

Adversarial steering:

Y^user + δ

apply SE system

Ŝ^{Y^user+δ} (with an attack)

Files

Y^user — Noisy microphone input (model input without attack).
Ŝ^{Y^user} (no attack) — The model’s enhanced output for Y^user.
Y^user + δ — Attacked mix (input plus the perturbation).
δ — The perturbation.
Ŝ^{Y^user+δ} (with an attack) — The model’s enhanced output for Y^user + δ.

What You Can Hear

In Ŝ^{Y^user+δ}, notice words semantically drifting toward S^attacker.
Y^user + δ should remain very close to the original Y^user.
δ alone shows the tiny perturbation that was added.

More audio examples can be found at the end of the page.

Approach and Findings

We add a very small signal, δ, before enhancement and keep it tightly constrained. Two simple controls describe this constraint: a masking tolerance (λ) that allows a limited amount of energy to slip under the hearing mask, and an energy budget (ε) that caps the total strength of δ. We iteratively adjust δ so that, after enhancement, the model’s output Ŝ aligns more closely with a chosen phrase S^attacker, while most of δ remains hidden under masking and never exceeds the budget.

In controlled tests, the effect is audible in the output and barely noticeable in the input when budgets are small. As λ or ε increase, steering becomes stronger and also more audible. This is a trade-off: small settings demonstrate the risk quietly, but larger settings make the effect clearer but harder to hide. To study this behavior, we trained three models representing two classes of speech enhancement.

Predictive approach. It is a common approach to directly regress noisy input to the clean speech estimatation. Usually can be done via:

DM Direct Mapping — is a common approach that directly maps the noisy spectrogram Y to a clean estimate Ŝ with a Neural Network
CRM Masking — A mask network predicts a bounded Complex Ratio Mask and applies it multiplicatively to the input

Generative approach. The common approaches for generative modeling are: GANs and Diffusion Models. Here, we focus on diffusion:

Diff Diffusion-based speech enhancement — iteratively denoises a signal conditioned on the noisy input.

Key results

Predictive speech enhancement (direct mapping, complex ratio masking) is comparatively easy to steer under small energy ε budgets.
Diffusion-based speech enhancement is more resistant; at similar budgets, target alignment weakens and the perturbation hides better.
Introducing stochasticity in the reverse diffusion further improves robustness; removing randomness or reducing the number of sampling steps makes attacks easier.

Audio Demos

This gallery presents four attacker–user pairs, you can switch the active pair (1–4) using the arrow controls in the setting’s header. All files within a setting update to the selected pair.

References: S^attacker — Clean reference of the attacker’s target phrase, S^user (user’s clean speech) and Y^user (user's noisy observation).
Adversarial artifacts: δ — the perturbation alone, and Y^user + δ the attacked mix fed to the model.
Output: Ŝ^{Y^user+δ} — the enhanced result for the attacked mix.

What You Can Explore
• Open any experimental label (e.g., “λ=20, ε=10”) and play Ŝ^{Y^user+δ} to hear how strongly the words align with the attacker’s phrase at that setting. Compare with a reference audio S^attacker.
• Compare Y^user + δ with Y^user to gauge audibility; if masking is effective, the difference is subtle.
• Solo δ to hear exactly what was added.
• Compare different models: DM, CRM, Diff for same settings.

Notation
• λ — Masking tolerance (dB). Relaxing the mask lets more of δ pass; larger λ usually strengthens steering but raises audibility.
• ε — Global ℓ₂ radius (total energy budget) for δ; larger ε permits stronger perturbations.

pair

S^attacker

S^user

Y^user

S^attacker

S^user

Y^user

S^attacker

S^user

Y^user

S^attacker

S^user

Y^user

All methods

Plain optimization (λ=–, ε=∞)DMCRMDiff

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Sorted by energy ε

λ=20, ε=3DM

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

λ=20, ε=6DM

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

λ=0, ε=10DMCRM

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

λ=10, ε=10DMDiff

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

λ=20, ε=10DMCRMDiff

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

λ=40, ε=10DMCRMDiff

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

λ=–, ε=10DM

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

λ=20, ε=15DM

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

λ=20, ε=20DM

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

λ=20, ε=∞DM

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Diffusion Ablation

Diffusion-SE ablations change one parameter at a time with fixed defaults elsewhere. Defaults: NFE=25, σ_max=0.5, stochastic noise schedule. We only change one knob per row to isolate effects.

pair

S^attacker

S^user

Y^user

S^attacker

S^user

Y^user

S^attacker

S^user

Y^user

S^attacker

S^user

Y^user

Diffusion

Plain optimization (λ=–, ε=∞)Diff

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Fixed Noise Schedule (λ=–, ε=∞)Diff

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Evaluation with NFEs=15 (λ=–, ε=∞)Diff

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Evaluation with NFEs=35 (λ=–, ε=∞)Diff

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Retrain with σ_max=0.3 (λ=–, ε=∞)Diff

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Retrain with σ_max=0.7 (λ=–, ε=∞)Diff

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Ŝ^{Y^user+δ}

Y^user + δ

Citation

Paper: https://arxiv.org/abs/2509.21087 · Code: https://github.com/sp-uhh/se-adversarial-attack

@misc{makarov2025modernspeechenhancementsystems,
  title        = {Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?},
  author       = {Rostislav Makarov and Lea Schönherr and Timo Gerkmann},
  year         = {2025},
  eprint       = {2509.21087},
  archivePrefix= {arXiv},
  primaryClass = {eess.AS},
  url          = {https://arxiv.org/abs/2509.21087},
}