EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation
Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo
Gerkmann
Abstract
We release the EARS (Expressive Anechoic Recordings of Speech) dataset, a high-quality
speech dataset comprising 107 speakers from diverse backgrounds, totalling in more than 100 hours of clean, anechoic
speech data. The dataset covers a large range of different speaking styles, including emotional speech, different
reading styles, non-verbal sounds, and conversational freeform speech. We benchmark various methods for speech
enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
In addition, we conduct a listening test with 20 participants for the speech enhancement task, where a generative
method is preferred. We introduce a blind test set that allows for automatic online evaluation of uploaded data.
Dataset download links and automatic evaluation server can be found online.
EARS Dataset
The EARS dataset is characterized by its scale, diversity, and high recording quality. In Table 1, we list
characteristics of the EARS dataset in comparison to other speech datasets.
Table 1: Speech datasets. In contrast to existing datasets, the EARS dataset is of higher
recording quality, large, and more diverse. †contains files with limited bandwidth.
EARS contains 100 h of anechoic speech recordings at 48 kHz from over 100 English speakers with high demographic
diversity. The dataset spans the full range of human speech, including reading tasks in seven different reading
styles, emotional reading and freeform speech in 22 different emotions, conversational speech, and non-verbal sounds
like laughter or coughing. Reading tasks feature seven styles (regular, loud, whisper, fast, slow, high pitch, and
low pitch). Additionally, the dataset features unconstrained freeform speech and speech in 22 different emotional
styles. We provide transcriptions of the reading portion and meta-data of the speakers (gender, age, race, first
language).
Audio Examples
Here we present a few audio examples from the EARS dataset.
p002/emo_adoration_sentences.wav
p008/emo_contentment_sentences.wav
p010/emo_cuteness_sentences.wav
p011/emo_anger_sentences.wav
p012/rainbow_05_whisper.wav
p014/rainbow_04_loud.wav
p016/rainbow_03_regular.wav
p017/rainbow_08_fast.wav
p018/vegetative_eating.wav
p019/vegetative_yawning.wav
p020/freeform_speech_01.wav
Benchmarks
The EARS dataset enables various speech processing tasks to be evaluated in a controlled and comparable way. Here, we
present benchmarks for speech enhancement and dereverberation tasks.
EARS-WHAM
For the task of speech enhancement, we construct the EARS-WHAM dataset, which mixes speech from the EARS dataset
with real noise recordings from the WHAM! dataset [7]. More details can
be found in the paper.
Results
Table 2: Results on EARS-WHAM. Values indicate the mean of the metrics over the test set.
The best results are highlighted in bold.
Here we present audio examples for the speech enhancement task. Below we show the noisy input, processed files for
Conv-TasNet [8],
CDiffuSE [9],
Demucs [10],
SGMSE+ [11],
and the clean ground truth.
We create a blind test set for which we only publish the noisy audio files but not the clean ground truth. It
contains 743 files (2 h) from six speakers (3 male, 3 female) that are not part of the EARS dataset and noise
especially recorded for this test set.
Results
Table 3: Results for the blind test. Values indicate the mean of the metrics over the test
set. The best results are highlighted in bold.
Here we present audio examples for the blind test set. Below we show the noisy input, processed files for
Conv-TasNet [8],
CDiffuSE [9],
Demucs [10],
and SGMSE+ [11].
This demo showcases the denoising capabilities of SGMSE+ [11] trained using the EARS-WHAM dataset.
The red frame represents the noisy input audio, while the green frame indicates the enhanced, noise-reduced output.
Dereverberation (EARS-Reverb)
For the task of dereverberation, we use real recorded room impulse responses (RIRs) from multiple public datasets
[12, 13, 14, 15, 16, 17, 18]. We generate
reverberant speech by convolving the clean speech with the RIR. More details can be found in the
paper.
Results
Table 4: Results on EARS-Reverb. Values indicate the mean of the metrics over the test set. The
best results are highlighted in bold.
Here we present audio examples for the dereverberation task. Below we show the reverberant input, processed files
for SGMSE+ [11], and the clean ground truth.
If you use the dataset or any derivative of it, please cite our
paper:
@inproceedings{richter2024ears,
title={{EARS}: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation},
author={Julius Richter and Yi-Chiao Wu and Steven Krenn and Simon Welker and Bunlong Lay and Shinjii Watanabe and Alexander Richard and Timo Gerkmann},
booktitle={ISCA Interspeech},
pages={4873--4877},
year={2024}
}
References
[1]H. Dubey, A. Aazami, V. Gopal, B. Naderi, S. Braun, R. Cutler, H. Gamper, M. Golestaneh, and R. Aichner, “ICASSP 2023 deep noise suppression challenge,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
[2]V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2015, pp. 5206–5210.
[3]K. Ito and L. Johnson, “The LJ Speech Dataset,” 2017. [Online]. Available: https://keithito.com/LJ-Speech-Dataset/
[4]J. S. Garofolo, “TIMIT acoustic phonetic continuous speech corpus,” Linguistic Data Consortium, 1993.
[5]J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” 2019. [Online]. Available: https://datashare.ed.ac.uk/handle/10283/3443
[6]J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) Complete - Linguistic Data Consortium,” 1993. [Online]. Available: https://catalog.ldc.upenn.edu/LDC93s6a
[7]G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending speech separation to noisy environments,” in ISCA Interspeech, 2019, pp. 1368–1372.
[8]Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
[9]Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, “Conditional diffusion probabilistic model for speech enhancement,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 7402–7406.
[10]S. Rouard, F. Massa, and A. Défossez, “Hybrid transformers for music source separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2023.
[11]J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351–2364, 2023.
[12]J. Eaton, N. D. Gaubitch, A. H. Moore, and P. A. Naylor, “Estimation of room acoustic parameters: The ACE challenge,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 10, pp. 1681–1693, 2016.
[13]M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in IEEE International Conference on Digital Signal Processing, 2009.
[14]K. Prawda, S. J. Schlecht, and V. V ̈alim ̈aki, “Robust selection of clean swept-sine measurements in non-stationary noise,” The Journal of the Acoustical Society of America, vol. 151, no. 3, pp. 2117–2126, 2022.
[15]D. Fejgin, W. Middelberg, and S. Doclo, “BRUDEX database: Binaural room impulse responses with uniformly distributed external microphones,” in Proc. ITG Conference on Speech Communication, 2023, pp. 126–130.
[16]D. D. Carlo, P. Tandeitnik, C. Foy, N. Bertin, A. Deleforge, and S. Gannot, “dEchorate: a calibrated room impulse response dataset for echo-aware signal processing,” EURASIP Journal on Audio, Speech, and Music Processing, 2021.
[17]S. V. Amengual Gari, B. Sahin, D. Eddy, and M. Kob, “Open database of spatial room impulse responses at Detmold university of music,” in Audio Engineering Society Convention 149, 2020.
[18]“A sonic Palimpsest: Revisiting Chatham historic dockyards.” [Online]. Available: https://research.kent.ac.uk/sonic-palimpsest/impulse-responses/