Multi-channel speech enhancement aims to extract the clean speech using multi-channel noisy mixtures. Most existing deep learning methods are predictive, which are sensitive to non-linear distortions caused by neural network architectures and strong environmental noise. Motivated by ArrayDPS for unsupervised multi-channel source separation, we propose ArrayDPS-Refine to refine predictive model's enhanced speech using a clean speech diffusion prior, in a training-free, generative, and array-agnostic manner. ArrayDPS-Refine first estimates noise spatial covariance matrix (SCM) using the predictive model's enhanced speech, and then uses the estimated noise SCM for diffusion posterior sampling. The method directly refines any predictive model without retraining. We show that ArrayDPS-Refine is able to consistently improve multiple popular predictive models (including both state-of-the-art waveform and STFT domain models) in both perceptual-based and intelligibility-based metrics.
Figure 1 below shows our ArrayDPS-Refine Pipeline. An predictive model first estimates the clean speech. Then the estimated clean speech and then noisy mixtures are used to estimate the noise spatial covariance matrix (SCM). The estimated noise SCM is then used for diffusion posterior sampling, which further samples the final clean speech output.
Here are some audio samples. For each audio sample, the following demos are provided:
TADRN and TADRN Refined
| Methods | sample 1 | sample 2 | sample 3 |
|---|---|---|---|
| Mixture | |||
| Clean | |||
| TADRN | |||
| Refined TADRN (ξ=0.4) | |||
| Refined TADRN (ξ=0.6) | |||
| Refined TADRN (ξ=0.8) | |||
| Refined TADRN (ξ=1.0) | |||
| Refined TADRN (ξ=1.2) |
FaSNet-TAC and FaSNet-TAC Refined
| Methods | sample 1 | sample 2 | sample 3 |
|---|---|---|---|
| Mixture | |||
| Clean | |||
| FaSNet-TAC | |||
| Refined FaSNet-TAC (ξ=0.4) | |||
| Refined FaSNet-TAC (ξ=0.6) | |||
| Refined FaSNet-TAC (ξ=0.8) | |||
| Refined FaSNet-TAC (ξ=1.0) | |||
| Refined FaSNet-TAC (ξ=1.2) |
USES2 and USES2 Refined
| Methods | sample 1 | sample 2 | sample 3 |
|---|---|---|---|
| Mixture | |||
| Clean | |||
| USES2 | |||
| Refined USES2 (ξ=0.4) | |||
| Refined USES2 (ξ=0.6) | |||
| Refined USES2 (ξ=0.8) | |||
| Refined USES2 (ξ=1.0) | |||
| Refined USES2 (ξ=1.2) |
[1] A. Pandey, B. Xu, A. Kumar, J. Donley, P. Calamia, and D. Wang, “Time-domain ad-hoc array speech enhancement using a triple-path network,” in Proc. Interspeech 2022, 2022, pp. 729–733.
[2] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6394–6398.
[3] W. Zhang, J.-w. Jung, and Y. Qian, “Improving design of input condition invariant speech enhancement,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 696–10 700.