SpatialCodec: Neural Spatial Speech Coding

[code][pdf]

Zhongweiyang Xu*, Yong Xu‡, Vinay Kothapally‡, Heming Wang♯, Muqiao Yang†;, Dong Yu‡

* University of Illinois Urbana-Champaign, † Carnegie Melon University, ‡ Tencent AI Lab, Bellevue, USA, ♯ The Ohio State University

Brief Overview

In this work, we address the challenge of efficiently compressing microphone array recorded audio while maintaining crucial spatial cues. We propose a neural spatial audio coding framework that achieves high compression ratios, leveraging both single channel sub-band and spatial codec. Our approach encompasses two phases: first, a neural sub-band codec is designed to encode the reference channel with small bit rates, and second, a SpatialCodec captures relative spatial information for accurate multichannel reconstruction at the decoder end. We also propose a few novel evaluation metrics in our framework to measure how well the spatial cues are preserved. This project pioneers the application of neural networks to multi-microphone speech coding, demonstrating decent performance. The introduced framework holds vast potential across diverse domains including telecommunication and immersive audio systems, likely motivating further exploration in this research area through our framework. We train and evaluate our framework on synthesized reverberated speech, which contains rich spatial information.

Figure 1 below shows our double branch framework. The first branch codes a reference channel audio while the second branch codes the spatial information. Assume we want to code M channels. The first branch takes the reference channel audio's STFT, encode it, and then reconstruct the STFT. It's is trained with single-channel recording using a combination of reconstruction and adversarial loss. The second branch (SpatialCodec) takes both spatial and spectral features (Spatial Covariance Matrix) as input, encode it, and then reconstruct (M-1) complex ratio filter. The filter is applied on the first branch's output STFT to reconstruct is trained wtih pure snr loss. (M-1) channel STFT. These (M-1) channels and the first branch's output together gives a M channel reconstruction. The second branch is trained with SNR loss while assuming the original reference channel audio is available on the decoder side. However, in inference, the second branch's complex ratio filter is applied on the reconstructed reference channel.

Figure 1: Our Double Branch Pipline, Training and Inference

Results

The methods we evaluate are as follows (we are using M=8 channels):

The metrics we use are as follows:

Audio Demo

Here are some audio samples. For each mutli-channel sample, we first demo each channel indepedently. Then we beamform towards the groundtruth DOA and play the beamform results. Last, we show the spatial feature visualization of the sample.


Sample 1: Spatial Feature Visualization

Figure 2: Spatial Feature Visualization

Sample 1: All-Channel Outputs

Methods C1 C2 C3 C4 C5 C6 C7 C8
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 1: Beamforming Outputs

Methods beamform result
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
ENCODEC (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 2: Spatial Feature Visualization

Figure 2: Spatial Feature Visualization

Sample 2: All-Channel Outputs

Methods C1 C2 C3 C4 C5 C6 C7 C8
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 2: Beamforming Outputs

Methods beamform result
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
ENCODEC (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 3: Spatial Feature Visualization

Figure 2: Spatial Feature Visualization

Sample 3: All-Channel Outputs

Methods C1 C2 C3 C4 C5 C6 C7 C8
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 3: Beamforming Outputs

Methods beamform result
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
ENCODEC (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 4: Spatial Feature Visualization

Figure 2: Spatial Feature Visualization

Sample 4: All-Channel Outputs

Methods C1 C2 C3 C4 C5 C6 C7 C8
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 4: Beamforming Outputs

Methods beamform result
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
ENCODEC (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

References

[1] Ralph O. Schmidt, “Multiple emitter location and signal pa- rameter estimation,” 1986.

[2] Ryan Corey and Andrew Singer, Relative Transfer Function Estimation from Speech Keywords, pp. 238–247, 06 2018.