Investigating Training Objectives for Generative Speech Enhancement

Julius Richter, Danilo de Oliveira, Timo Gerkmann

Generative speech enhancement has recently shown promising advancements in improving speech quality in noisy environments. Multiple diffusion-based frameworks exist, each employing distinct training objectives and learning techniques. This paper aims at explaining the differences between these frameworks by focusing our investigation on score-based generative models and Schrödinger bridge. We conduct a series of comprehensive experiments to compare their performance and highlight differing training behaviors. Furthermore, we propose a novel perceptual loss function tailored for the Schrödinger bridge framework, demonstrating enhanced performance and improved perceptual quality of the enhanced speech signals. All experimental code and pre-trained models are publicly available to facilitate further research and development in this domain.

Results on VB-DMD (matched)

We evaluate the performance of the proposed models on the VoiceBank-Demand (VB-DMD) dataset. The results are summarized in the table below. Models M1-M4 are based on score-based generative models for speech enhancement (SGMSE) , while M5-M8 are based on the Schrödinger bridge framework .

**Table: Speech Enhancement Performance on VB-DMD.** Values indicate the mean of the metrics.
Model	POLQA	PESQ	SI-SDR	ESTOI	DNSMOS
Noisy	3.11	1.97	8.4	0.79	3.09
Conv-TasNet	3.56	2.63	19.1	0.85	3.37
MetricGAN+	3.72	3.13	8.5	0.83	3.37
SE-MAMBA	4.33	3.56	19.7	0.89	3.58
SGMSE+	3.95	2.93	17.3	0.87	3.56
M1	3.93	2.84	17.7	0.86	3.54
M2	3.96	2.90	18.0	0.86	3.55
M3	3.86	2.77	17.8	0.86	3.51
M4 (EDM2)	3.87	2.87	18.0	0.86	3.54
M5	4.15	2.91	19.4	0.88	3.59
M6	4.15	3.70	8.3	0.86	3.44
M7	4.25	3.50	14.1	0.87	3.55
M8	4.20	3.44	15.3	0.87	3.58

Audio Examples

Select an audio file:

Noisy:

Clean:

Conv-TasNet :

MetricGAN+ :

SE-MAMBA :

SGMSE+ :

M1:

M2:

M3:

M4 (EDM2):

M5:

M6:

M7:

M8:

Results on EARS-WHAM (mismatched)

**Table: Speech Enhancement Performance on EARS-WHAM.** Values indicate the mean of the metrics.
Model	POLQA	PESQ	SI-SDR	ESTOI	DNSMOS
Noisy	1.82	1.25	6.0	0.49	2.74
M1	2.35	1.85	6.8	0.60	3.67
M2	2.49	1.90	6.8	0.61	3.74
M3	1.12	1.05	-3.4	0.33	2.11
M5	2.67	1.86	7.9	0.63	3.72
M6	2.58	2.33	1.8	0.60	3.19
M7	2.86	2.27	5.2	0.62	3.53
M8	2.61	2.05	5.9	0.64	3.74

Citation

@article{richter2024investigating,
    title={Investigating Training Objectives for Generative Speech Enhancement},
    author={Julius Richter and Danilo de Oliveira and Timo Gerkmann},
    journal={arXiv preprint arXiv:2409.10753},
    year={2024}
}

Investigating Training Objectives for Generative Speech Enhancement

Results on VB-DMD (matched)

Audio Examples

Results on EARS-WHAM (mismatched)

Citation

References