Unified Diffusion Refinement for Multi-Channel Speech Enhancement and Separation

Zhongweiyang Xu*, Ashutosh Pandey†, Juan Azcarreta†, Zhaoheng Ni†, Sanjeel Parekh†, Buye Xu†, Romit Roy Choudhury*

* University of Illinois Urbana-Champaign, † Meta Reality Lab

Brief Overview

We propose Uni-ArrayDPS, a novel diffusion-based refinement framework for unified multi-channel speech enhancement and separation. Existing methods for multi-channel speech enhancement/separation are mostly discriminative and are highly effective at producing high-SNR outputs. However, they can still generate unnatural speech with non-linear distortions caused by the neural network and regression-based objectives. To address this issue, we propose Uni-ArrayDPS, which refines the outputs of any strong discriminative model using a speech diffusion prior. Uni-ArrayDPS is generative, array-agnostic, and training-free, and supports both enhancement and separation. Given a discriminative model's enhanced/separated speech, we use it, together with the noisy mixtures, to estimate the noise spatial covariance matrix (SCM). We then use this SCM to compute the likelihood required for diffusion posterior sampling of the clean speech source(s). Uni-ArrayDPS requires only a pre-trained clean-speech diffusion model as a prior and does not require additional training or fine-tuning, allowing it to generalize directly across tasks (enhancement/separation), microphone array geometries, and discriminative model backbones. Extensive experiments show that Uni-ArrayDPS consistently improves a wide range of discriminative models for both enhancement and separation tasks. We also report strong results on a real-world dataset.

Figure 1 below shows our Uni-ArrayDPS Pipeline. A discriminative model first estimates the clean speech sources. Then the estimated clean speech sources and the noisy mixtures are used to estimate the noise spatial covariance matrix (SCM). The estimated noise SCM is then used for diffusion posterior sampling, which further samples the final clean speech outputs.

Figure 1: Uni-ArrayDPS Pipeline

Audio Demos (Headphone or Earphone Highly Recommended!)

Three sets of demos are provided: simulated enhancement, separation, and RealMan enhancement.


Speech Enhancement Demo (Simulated)

Methods sample_0000 sample_0001 sample_0002
Noisy
Clean
FaSNet-TAC [2]
Refined FaSNet-TAC (α=1)
Refined FaSNet-TAC (α=0.5)
TADRN [1]
Refined TADRN (α=1)
Refined TADRN (α=0.5)
USES [3]
Refined USES (α=1)
Refined USES (α=0.5)
SpatialNet [4]
Refined SpatialNet (α=1)
Refined SpatialNet (α=0.5)

Speech Separation Demo

Methods sample_0000 sample_0001 sample_0002
Mixture (ch0)
Clean (2 sources)
src1
src2
src1
src2
src1
src2
FaSNet-TAC [2]
src1
src2
src1
src2
src1
src2
Refined FaSNet-TAC (α=1)
src1
src2
src1
src2
src1
src2
Refined FaSNet-TAC (α=0.5)
src1
src2
src1
src2
src1
src2
USES [3]
src1
src2
src1
src2
src1
src2
Refined USES (α=1)
src1
src2
src1
src2
src1
src2
Refined USES (α=0.5)
src1
src2
src1
src2
src1
src2
SpatialNet [4]
src1
src2
src1
src2
src1
src2
Refined SpatialNet (α=1)
src1
src2
src1
src2
src1
src2
Refined SpatialNet (α=0.5)
src1
src2
src1
src2
src1
src2

RealMan Enhancement Demo

Methods sample 0 sample 1 sample 2
Noisy
Clean
FaSNet-TAC [2]
Refined FaSNet-TAC (α=1)
Refined FaSNet-TAC (α=0.5)
TADRN [1]
Refined TADRN (α=1)
Refined TADRN (α=0.5)
USES [3]
Refined USES (α=1)
Refined USES (α=0.5)
SpatialNet [4]
Refined SpatialNet (α=1)
Refined SpatialNet (α=0.5)

TADRN and TADRN Refined

Methods sample 1 sample 2 sample 3
Mixture
Clean
TADRN
Refined TADRN (ξ=0.4)
Refined TADRN (ξ=0.6)
Refined TADRN (ξ=0.8)
Refined TADRN (ξ=1.0)
Refined TADRN (ξ=1.2)

FaSNet-TAC and FaSNet-TAC Refined

Methods sample 1 sample 2 sample 3
Mixture
Clean
FaSNet-TAC
Refined FaSNet-TAC (ξ=0.4)
Refined FaSNet-TAC (ξ=0.6)
Refined FaSNet-TAC (ξ=0.8)
Refined FaSNet-TAC (ξ=1.0)
Refined FaSNet-TAC (ξ=1.2)

USES2 and USES2 Refined

Methods sample 1 sample 2 sample 3
Mixture
Clean
USES2
Refined USES2 (ξ=0.4)
Refined USES2 (ξ=0.6)
Refined USES2 (ξ=0.8)
Refined USES2 (ξ=1.0)
Refined USES2 (ξ=1.2)

References

[1] A. Pandey, B. Xu, A. Kumar, J. Donley, P. Calamia, and D. Wang, “Time-domain ad-hoc array speech enhancement using a triple-path network,” in Proc. Interspeech 2022, 2022, pp. 729–733.

[2] Y. Luo, Z. Chen, N. Mesgarani, and T. Yoshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6394–6398.

[3] W. Zhang, J.-w. Jung, and Y. Qian, “Improving design of input condition invariant speech enhancement,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10 696–10 700.

[4] C. Quan and X. Li, “SpatialNet: Extensively learning spatial information for multichannel joint speech separation, denoising and dereverberation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 1310-1323.