Brief Overview

In this work, we address the challenge of efficiently compressing microphone array recorded audio while maintaining crucial spatial cues. We propose a neural spatial audio coding framework that achieves high compression ratios, leveraging both single channel sub-band and spatial codec. Our approach encompasses two phases: first, a neural sub-band codec is designed to encode the reference channel with small bit rates, and second, a SpatialCodec captures relative spatial information for accurate multichannel reconstruction at the decoder end. We also propose a few novel evaluation metrics in our framework to measure how well the spatial cues are preserved. This project pioneers the application of neural networks to multi-microphone speech coding, demonstrating decent performance. The introduced framework holds vast potential across diverse domains including telecommunication and immersive audio systems, likely motivating further exploration in this research area through our framework. We train and evaluate our framework on synthesized reverberated speech, which contains rich spatial information.

Figure 1 below shows our double branch framework. The first branch codes a reference channel audio while the second branch codes the spatial information. Assume we want to code M channels. The first branch takes the reference channel audio's STFT, encode it, and then reconstruct the STFT. It's is trained with single-channel recording using a combination of reconstruction and adversarial loss. The second branch (SpatialCodec) takes both spatial and spectral features (Spatial Covariance Matrix) as input, encode it, and then reconstruct (M-1) complex ratio filter. The filter is applied on the first branch's output STFT to reconstruct is trained wtih pure snr loss. (M-1) channel STFT. These (M-1) channels and the first branch's output together gives a M channel reconstruction. The second branch is trained with SNR loss while assuming the original reference channel audio is available on the decoder side. However, in inference, the second branch's complex ratio filter is applied on the reconstructed reference channel.

Figure 1: Our pipline, training and inference — Figure 1: Our Double Branch Pipline, Training and Inference

Results

The methods we evaluate are as follows (we are using M=8 channels):

REVERB CLEAN - The original multi-channel audio we want to code and reconstruct, or the groundtruth.
OPUS12 - Use OPUS to code each channel independently. 12 kbps per channel, which means 8*12=96 kbps in totall
OPUS6 - Use OPUS to code each channel independently. 6 kbps per channel, which means 8*6=48 kbps in totall
ENCODEC - Use Encodec to code each channel independently. 6 kbps per channel, which means 8*6=48 kbps in totall
HIFI-CODEC - Use HiFi-Codec to code each channel independently. 6 kbps per channel, which means 8*6=48 kbps in totall
SUB-BAND CODEC (Ours without SpatialCodec) - Use our designed SUB-BAND CODEC to code each channel independently. 6 kbps per channel, which means 8*6=48 kbps in totall
MIMO E2E (OURS) - Blackbox end to end multi-channel in multi-channel out model, 12kbps
ENCODEC+SPATIALCODEC (OURS) - Our framework. Using Encodec as the first branch and our SpatialCodec as second branch. 6+6=12kbps
HIFICODEC+SPATIALCODEC (OURS) - Our framework. Using HiFi-Codec as the first branch and our SpatialCodec as second branch. 6+6=12kbps
SUB-BAND CODEC+SPATIALCODEC (OURS) - Our framework. Using Sub-band Codec as the first branch and our SpatialCodec as second branch. 6+6=12kbps

The metrics we use are as follows:

DOA ERR - DOA estimation error using MUSIC algorithm [1]
RTF ERR - Relative transfer function error [2]
SPATIAL SIM - Spatial Similarity as defined in our paper.
BEAM SNR - SNR of beamformed audio, use groundtruth signals' beamforming result as referece
BEAM PESQ - PESQ of beamformed audio, use groundtruth signals' beamforming result as referece
BEAM SIG - DNSMOS SIG_raw value of beamformed audio
BEAM BAK - PDNSMOS BAK_raw value of beamformed audio

Audio Demo

Here are some audio samples. For each mutli-channel sample, we first demo each channel indepedently. Then we beamform towards the groundtruth DOA and play the beamform results. Last, we show the spatial feature visualization of the sample.

Sample 1: Spatial Feature Visualization

Sample 1: All-Channel Outputs

Methods	C1	C2	C3	C4	C5	C6	C7	C8
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 1: Beamforming Outputs

Methods	beamform result
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
ENCODEC (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 2: Spatial Feature Visualization

Sample 2: All-Channel Outputs

Methods	C1	C2	C3	C4	C5	C6	C7	C8
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 2: Beamforming Outputs

Methods	beamform result
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
ENCODEC (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 3: Spatial Feature Visualization

Sample 3: All-Channel Outputs

Methods	C1	C2	C3	C4	C5	C6	C7	C8
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 3: Beamforming Outputs

Methods	beamform result
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
ENCODEC (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 4: Spatial Feature Visualization

Sample 4: All-Channel Outputs

Methods	C1	C2	C3	C4	C5	C6	C7	C8
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)

Sample 4: Beamforming Outputs

Methods	beamform result
Reverberate Clean (Groudtruth)
OPUS12 (Channel Independent)
OPUS6 (Channel Independent)
ENCODEC (Channel Independent)
HIFICODEC (Channel Independent)
SUB-BAND CODEC (Channel Independent)
MIMO E2E
ENCODEC+SpatialCodec (OURS)
HIFICODEC+SpatialCodec (OURS)
SUB-BAND CODEC +SpatialCodec (OURS)