ArrayDPS-Refine: Generative Refinement of Predictive Multi-channel Speech Enhancement

Zhongweiyang Xu*, Ashutosh Pandey†, Juan Azcarreta†, Zhaoheng Ni†, Sanjeel Parekh†;, Buye Xu†

* University of Illinois Urbana-Champaign, † Meta Reality Lab

Brief Overview

Multi-channel speech enhancement aims to extract the clean speech using multi-channel noisy mixtures. Most existing deep learning methods are predictive, which are sensitive to non-linear distortions caused by neural network architectures and strong environmental noise. Motivated by ArrayDPS for unsupervised multi-channel source separation, we propose ArrayDPS-Refine to refine predictive model's enhanced speech using a clean speech diffusion prior, in a training-free, generative, and array-agnostic manner. ArrayDPS-Refine first estimates noise spatial covariance matrix (SCM) using the predictive model's enhanced speech, and then uses the estimated noise SCM for diffusion posterior sampling. The method directly refines any predictive model without retraining. We show that ArrayDPS-Refine is able to consistently improve multiple popular predictive models (including both state-of-the-art waveform and STFT domain models) in both perceptual-based and intelligibility-based metrics.

Figure 1 below shows our ArrayDPS-Refine Pipeline. An predictive model first estimates the clean speech. Then the estimated clean speech and then noisy mixtures are used to estimate the noise spatial covariance matrix (SCM). The estimated noise SCM is then used for diffusion posterior sampling, which further samples the final clean speech output.

Figure 1: ArrayDPS-Refine Pipeline

Results

Audio Demo (Headphone or Earphone Highly Recommended!)

Here are some audio samples. For each audio sample, the following demos are provided:


TADRN and TADRN Refined

Methods sample 1 sample 2 sample 3
Mixture
Clean
TADRN
Refined TADRN (ξ=0.4)
Refined TADRN (ξ=0.6)
Refined TADRN (ξ=0.8)
Refined TADRN (ξ=1.0)
Refined TADRN (ξ=1.2)

FaSNet-TAC and FaSNet-TAC Refined

Methods sample 1 sample 2 sample 3
Mixture
Clean
FaSNet-TAC
Refined FaSNet-TAC (ξ=0.4)
Refined FaSNet-TAC (ξ=0.6)
Refined FaSNet-TAC (ξ=0.8)
Refined FaSNet-TAC (ξ=1.0)
Refined FaSNet-TAC (ξ=1.2)

USES2 and USES2 Refined

Methods sample 1 sample 2 sample 3
Mixture
Clean
USES2
Refined USES2 (ξ=0.4)
Refined USES2 (ξ=0.6)
Refined USES2 (ξ=0.8)
Refined USES2 (ξ=1.0)
Refined USES2 (ξ=1.2)

References

[1] A. Pandey, B. Xu, A. Kumar, J. Donley, P. Calamia, and D. Wang, “Time-domain ad-hoc array speech enhancement using a triple-path network,” in Proc. Interspeech 2022, 2022, pp. 729–733.

[2] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6394–6398.

[3] W. Zhang, J.-w. Jung, and Y. Qian, “Improving design of input condition invariant speech enhancement,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 696–10 700.