A flow-based full-band general audio codec with high perceptual quality
Simon Welker, Matthew Le, Ricky T. Q. Chen, Wei-Ning Hsu, Timo Gerkmann, Alexander Richard, Yi-Chiao Wu
We propose FlowDec [1], a neural full-band audio codec for general audio sampled at 48 kHz that combines non-adversarial codec training with a stochastic postfilter based on a novel conditional flow matching method. Compared to the prior work ScoreDec which is based on score matching, we generalize from speech to general audio and move from 24 kbit/s to as low as 4 kbit/s, while improving output quality and reducing the required postfilter DNN evaluations from 60 to 6 without any fine-tuning or distillation techniques. We provide theoretical insights and geometric intuitions for our approach in comparison to ScoreDec as well as another recent work that uses flow matching, and conduct ablation studies on our proposed components. We show that FlowDec is a competitive alternative to the recent GAN-dominated stream of neural codecs, achieving FAD scores better than those of the established GAN-based codec DAC and listening test scores that are on par, and producing qualitatively more natural reconstructions for speech and harmonic structures in music.
Below we present audio examples from FlowDec compared to the official 44.1kHz DAC [2], the retrained 48kHz DAC-75 [1], ScoreDec [3], EnCodec's 48kHz model and its 24kHz model [4], Multi-Band Diffusion at 24kHz [5], and the clean target.
For FlowDec, DAC, and DAC-75, we show both examples at a higher bitrate (7.0 - 7.75 kbps) as well as at a lower bitrate (4.3 - 4.5 kbps). You can choose between Speech, Music and Audio Files.
Select an audio file:
Clean:
FlowDec-75m (7.5 kbps) [1]:
FlowDec-75m (4.5 kbps) [1]:
DAC-75 (7.5 kbps) [1]:
DAC-75 (4.5 kbps) [1]:
Official DAC (44.1kHz, 7.75 kbps) [2]:
Official DAC (44.1kHz, 4.30 kbps) [2]:
ScoreDec (7.5 kbps) [3]:
EnCodec (48 kHz, 6.0 kbps) [4]:
EnCodec (24 kHz, 6.0 kbps) [4]:
Multi-Band Diffusion (24 kHz, 6.0 kbps) [4]:
Below we present audio examples from a low-feature-rate variant of FlowDec (FlowDec-25s) and a retrained DAC (DAC-25). These operate at a feature rate of 25 Hz instead of around 75 Hz, and are aimed at generative audio tasks where low feature rates are desirable for long-range sequence modeling.
Select an audio file:
Clean:
FlowDec-25s (4.0 kbps) [1]:
DAC-25 (4.0 kbps) [1]:
FlowDec achieves the following metrics at 7.5 kbps with 6 NFE (neural network evaluations), compared to ScoreDec and other baselines. There is almost no drop in performance for FlowDec at 6 NFE compared to 50 NFE, while ScoreDec shows a significant drop. We have used FlowDec at 6 NFE for all other evaluations and the audio examples shown above.
FAD x100 | SI-SDR | fwSSNR | |
---|---|---|---|
FlowDec (NFE = 50) | 1.34 | 7.41 | 15.65 |
FlowDec (NFE = 6) | 1.62 | 7.55 | 15.46 |
ScoreDec (NFE = 50) | 5.73 | 7.50 | 14.45 |
ScoreDec (NFE = 6) | 145.30 | -27.23 | 3.15 |
DAC-75 | 4.15 | 9.23 | 16.21 |
DAC 44.1kHz (7.75 kbps) | 6.00 | 9.30 | 16.46 |
If you use our models, methods, or any derivatives thereof, please cite our paper:
@inproceedings{
welker2025flowdec,
title={FlowDec: A flow-based full-band general audio codec with high perceptual quality},
author={Simon Welker and Matthew Le and Ricky T. Q. Chen and Wei-Ning Hsu and Timo Gerkmann and Alexander Richard and Yi-Chiao Wu},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=uxDFlPGRLX}
}