We propose Uni-ArrayDPS, a novel diffusion-based refinement framework for unified multi-channel speech enhancement and separation. Existing methods for multi-channel speech enhancement/separation are mostly discriminative and are highly effective at producing high-SNR outputs. However, they can still generate unnatural speech with non-linear distortions caused by the neural network and regression-based objectives. To address this issue, we propose Uni-ArrayDPS, which refines the outputs of any strong discriminative model using a speech diffusion prior. Uni-ArrayDPS is generative, array-agnostic, and training-free, and supports both enhancement and separation. Given a discriminative model's enhanced/separated speech, we use it, together with the noisy mixtures, to estimate the noise spatial covariance matrix (SCM). We then use this SCM to compute the likelihood required for diffusion posterior sampling of the clean speech source(s). Uni-ArrayDPS requires only a pre-trained clean-speech diffusion model as a prior and does not require additional training or fine-tuning, allowing it to generalize directly across tasks (enhancement/separation), microphone array geometries, and discriminative model backbones. Extensive experiments show that Uni-ArrayDPS consistently improves a wide range of discriminative models for both enhancement and separation tasks. We also report strong results on a real-world dataset.
Figure 1 below shows our Uni-ArrayDPS Pipeline. A discriminative model first estimates the clean speech sources. Then the estimated clean speech sources and the noisy mixtures are used to estimate the noise spatial covariance matrix (SCM). The estimated noise SCM is then used for diffusion posterior sampling, which further samples the final clean speech outputs.
Three sets of demos are provided: simulated enhancement, separation, and RealMan enhancement.
Speech Enhancement Demo (Simulated)
| Methods | sample_0000 | sample_0001 | sample_0002 |
|---|---|---|---|
| Noisy | |||
| Clean | |||
| FaSNet-TAC [2] | |||
| Refined FaSNet-TAC (α=1) | |||
| Refined FaSNet-TAC (α=0.5) | |||
| TADRN [1] | |||
| Refined TADRN (α=1) | |||
| Refined TADRN (α=0.5) | |||
| USES [3] | |||
| Refined USES (α=1) | |||
| Refined USES (α=0.5) | |||
| SpatialNet [4] | |||
| Refined SpatialNet (α=1) | |||
| Refined SpatialNet (α=0.5) |
Speech Separation Demo
| Methods | sample_0000 | sample_0001 | sample_0002 |
|---|---|---|---|
| Mixture (ch0) | |||
| Clean (2 sources) |
src1
src2
|
src1
src2
|
src1
src2
|
| FaSNet-TAC [2] |
src1
src2
|
src1
src2
|
src1
src2
|
| Refined FaSNet-TAC (α=1) |
src1
src2
|
src1
src2
|
src1
src2
|
| Refined FaSNet-TAC (α=0.5) |
src1
src2
|
src1
src2
|
src1
src2
|
| USES [3] |
src1
src2
|
src1
src2
|
src1
src2
|
| Refined USES (α=1) |
src1
src2
|
src1
src2
|
src1
src2
|
| Refined USES (α=0.5) |
src1
src2
|
src1
src2
|
src1
src2
|
| SpatialNet [4] |
src1
src2
|
src1
src2
|
src1
src2
|
| Refined SpatialNet (α=1) |
src1
src2
|
src1
src2
|
src1
src2
|
| Refined SpatialNet (α=0.5) |
src1
src2
|
src1
src2
|
src1
src2
|
RealMan Enhancement Demo
| Methods | sample 0 | sample 1 | sample 2 |
|---|---|---|---|
| Noisy | |||
| Clean | |||
| FaSNet-TAC [2] | |||
| Refined FaSNet-TAC (α=1) | |||
| Refined FaSNet-TAC (α=0.5) | |||
| TADRN [1] | |||
| Refined TADRN (α=1) | |||
| Refined TADRN (α=0.5) | |||
| USES [3] | |||
| Refined USES (α=1) | |||
| Refined USES (α=0.5) | |||
| SpatialNet [4] | |||
| Refined SpatialNet (α=1) | |||
| Refined SpatialNet (α=0.5) |
[1] A. Pandey, B. Xu, A. Kumar, J. Donley, P. Calamia, and D. Wang, “Time-domain ad-hoc array speech enhancement using a triple-path network,” in Proc. Interspeech 2022, 2022, pp. 729–733.
[2] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6394–6398.
[3] W. Zhang, J.-w. Jung, and Y. Qian, “Improving design of input condition invariant speech enhancement,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 696–10 700.
[4] C. Quan and X. Li, “SpatialNet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 1310-1323.