Sattacker
Suser
Yuser
Speech enhancement (SE) is built into everyday tools. Phones and laptops use it to clean calls, earbuds suppress background noise, conferencing platforms rely on it for intelligible meetings, and studios and assistive technologies use it to clarify recorded speech. These systems do more than pass audio through - they reconstruct speech from noisy microphone signals.
This page demonstrates a concrete risk that follows from that flexibility. If a very small, carefully shaped signal is added to the microphone input, the enhanced output can be nudged toward attacker-chosen words, while most listeners hear little or no change in the input itself. The added signal is designed to sit mostly under what human hearing tends to ignore (psychoacoustic masking) and to remain very low in overall energy. In practice, it becomes easy for the model to use but hard for a person to notice.
The microphone records the user’s speech in noisy conditions - this is Yuser. An attacker adds a tiny perturbation δ (delta), crafted to be masked by the existing sound scene, producing Yuser + δ. The SE system enhances this attacked mix and outputs Ŝ. The intent is that Ŝ leans toward an attacker’s phrase Sattacker rather than the user’s original words.
SE applied to noisy input (no attack):
Adversarial steering:
More audio examples can be found at the end of the page.
We add a very small signal, δ, before enhancement and keep it tightly constrained. Two simple controls describe this constraint: a masking tolerance (λ) that allows a limited amount of energy to slip under the hearing mask, and an energy budget (ε) that caps the total strength of δ. We iteratively adjust δ so that, after enhancement, the model’s output Ŝ aligns more closely with a chosen phrase Sattacker, while most of δ remains hidden under masking and never exceeds the budget.
In controlled tests, the effect is audible in the output and barely noticeable in the input when budgets are small. As λ or ε increase, steering becomes stronger and also more audible. This is a trade-off: small settings demonstrate the risk quietly, but larger settings make the effect clearer but harder to hide. To study this behavior, we trained three models representing two classes of speech enhancement.
Predictive approach. It is a common approach to directly regress noisy input to the clean speech estimatation. Usually can be done via:
Sattacker
Suser
Yuser
Sattacker
Suser
Yuser
Sattacker
Suser
Yuser
Sattacker
Suser
Yuser
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
Diffusion-SE ablations change one parameter at a time with fixed defaults elsewhere. Defaults: NFE=25, σ_max=0.5, stochastic noise schedule. We only change one knob per row to isolate effects.
Sattacker
Suser
Yuser
Sattacker
Suser
Yuser
Sattacker
Suser
Yuser
Sattacker
Suser
Yuser
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
ŜYuser+δ
δ
Yuser + δ
Paper: https://arxiv.org/abs/2509.21087 · Code: https://github.com/sp-uhh/se-adversarial-attack
@misc{makarov2025modernspeechenhancementsystems,
title = {Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?},
author = {Rostislav Makarov and Lea Schönherr and Timo Gerkmann},
year = {2025},
eprint = {2509.21087},
archivePrefix= {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2509.21087},
}