Investigating Training Objectives for Generative Speech Enhancement
Generative speech enhancement has recently shown promising advancements in improving speech quality in noisy environments. Multiple diffusion-based frameworks exist, each employing distinct training objectives and learning techniques. This paper aims at explaining the differences between these frameworks by focusing our investigation on score-based generative models and Schrödinger bridge. We conduct a series of comprehensive experiments to compare their performance and highlight differing training behaviors. Furthermore, we propose a novel perceptual loss function tailored for the Schrödinger bridge framework, demonstrating enhanced performance and improved perceptual quality of the enhanced speech signals. All experimental code and pre-trained models are publicly available to facilitate further research and development in this domain.
We evaluate the performance of the proposed models on the VoiceBank-Demand (VB-DMD) dataset. The results are summarized in the table below. Models M1-M4 are based on score-based generative models for speech enhancement (SGMSE) , while M5-M8 are based on the Schrödinger bridge framework .
Model | POLQA | PESQ | SI-SDR | ESTOI | DNSMOS |
---|---|---|---|---|---|
Noisy | 3.11 | 1.97 | 8.4 | 0.79 | 3.09 |
Conv-TasNet | 3.56 | 2.63 | 19.1 | 0.85 | 3.37 |
MetricGAN+ | 3.72 | 3.13 | 8.5 | 0.83 | 3.37 |
SE-MAMBA | 4.33 | 3.56 | 19.7 | 0.89 | 3.58 |
SGMSE+ | 3.95 | 2.93 | 17.3 | 0.87 | 3.56 |
M1 | 3.93 | 2.84 | 17.7 | 0.86 | 3.54 |
M2 | 3.96 | 2.90 | 18.0 | 0.86 | 3.55 |
M3 | 3.86 | 2.77 | 17.8 | 0.86 | 3.51 |
M4 (EDM2) | 3.87 | 2.87 | 18.0 | 0.86 | 3.54 |
M5 | 4.15 | 2.91 | 19.4 | 0.88 | 3.59 |
M6 | 4.15 | 3.70 | 8.3 | 0.86 | 3.44 |
M7 | 4.25 | 3.50 | 14.1 | 0.87 | 3.55 |
M8 | 4.20 | 3.44 | 15.3 | 0.87 | 3.58 |
Select an audio file:
Noisy:
Clean:
Conv-TasNet :
MetricGAN+ :
SE-MAMBA :
SGMSE+ :
M1:
M2:
M3:
M4 (EDM2):
M5:
M6:
M7:
M8:
@article{richter2024investigating,
title={Investigating Training Objectives for Generative Speech Enhancement},
author={Julius Richter and Danilo de Oliveira and Timo Gerkmann},
journal={arXiv preprint arXiv:2409.10753},
year={2024}
}