In this work, we address the challenge of efficiently compressing microphone array recorded audio while maintaining crucial spatial cues. We propose a neural spatial audio coding framework that achieves high compression ratios, leveraging both single channel sub-band and spatial codec. Our approach encompasses two phases: first, a neural sub-band codec is designed to encode the reference channel with small bit rates, and second, a SpatialCodec captures relative spatial information for accurate multichannel reconstruction at the decoder end. We also propose a few novel evaluation metrics in our framework to measure how well the spatial cues are preserved. This project pioneers the application of neural networks to multi-microphone speech coding, demonstrating decent performance. The introduced framework holds vast potential across diverse domains including telecommunication and immersive audio systems, likely motivating further exploration in this research area through our framework. We train and evaluate our framework on synthesized reverberated speech, which contains rich spatial information.
Figure 1 below shows our double branch framework. The first branch codes a reference channel audio while the second branch codes the spatial information. Assume we want to code M channels. The first branch takes the reference channel audio's STFT, encode it, and then reconstruct the STFT. It's is trained with single-channel recording using a combination of reconstruction and adversarial loss. The second branch (SpatialCodec) takes both spatial and spectral features (Spatial Covariance Matrix) as input, encode it, and then reconstruct (M-1) complex ratio filter. The filter is applied on the first branch's output STFT to reconstruct is trained wtih pure snr loss. (M-1) channel STFT. These (M-1) channels and the first branch's output together gives a M channel reconstruction. The second branch is trained with SNR loss while assuming the original reference channel audio is available on the decoder side. However, in inference, the second branch's complex ratio filter is applied on the reconstructed reference channel.
The methods we evaluate are as follows (we are using M=8 channels):
The metrics we use are as follows:
Here are some audio samples. For each mutli-channel sample, we first demo each channel indepedently. Then we beamform towards the groundtruth DOA and play the beamform results. Last, we show the spatial feature visualization of the sample.
Sample 1: Spatial Feature Visualization
Sample 1: All-Channel Outputs
Methods | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 |
---|---|---|---|---|---|---|---|---|
Reverberate Clean (Groudtruth) | ||||||||
OPUS12 (Channel Independent) | ||||||||
OPUS6 (Channel Independent) | ||||||||
HIFICODEC (Channel Independent) | ||||||||
SUB-BAND CODEC (Channel Independent) | ||||||||
MIMO E2E | ||||||||
ENCODEC+SpatialCodec (OURS) | ||||||||
HIFICODEC+SpatialCodec (OURS) | ||||||||
SUB-BAND CODEC +SpatialCodec (OURS) |
Sample 1: Beamforming Outputs
Methods | beamform result |
---|---|
Reverberate Clean (Groudtruth) | |
OPUS12 (Channel Independent) | |
OPUS6 (Channel Independent) | |
ENCODEC (Channel Independent) | |
HIFICODEC (Channel Independent) | |
SUB-BAND CODEC (Channel Independent) | |
MIMO E2E | |
ENCODEC+SpatialCodec (OURS) | |
HIFICODEC+SpatialCodec (OURS) | |
SUB-BAND CODEC +SpatialCodec (OURS) |
Sample 2: Spatial Feature Visualization
Sample 2: All-Channel Outputs
Methods | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 |
---|---|---|---|---|---|---|---|---|
Reverberate Clean (Groudtruth) | ||||||||
OPUS12 (Channel Independent) | ||||||||
OPUS6 (Channel Independent) | ||||||||
HIFICODEC (Channel Independent) | ||||||||
SUB-BAND CODEC (Channel Independent) | ||||||||
MIMO E2E | ||||||||
ENCODEC+SpatialCodec (OURS) | ||||||||
HIFICODEC+SpatialCodec (OURS) | ||||||||
SUB-BAND CODEC +SpatialCodec (OURS) |
Sample 2: Beamforming Outputs
Methods | beamform result |
---|---|
Reverberate Clean (Groudtruth) | |
OPUS12 (Channel Independent) | |
OPUS6 (Channel Independent) | |
ENCODEC (Channel Independent) | |
HIFICODEC (Channel Independent) | |
SUB-BAND CODEC (Channel Independent) | |
MIMO E2E | |
ENCODEC+SpatialCodec (OURS) | |
HIFICODEC+SpatialCodec (OURS) | |
SUB-BAND CODEC +SpatialCodec (OURS) |
Sample 3: Spatial Feature Visualization
Sample 3: All-Channel Outputs
Methods | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 |
---|---|---|---|---|---|---|---|---|
Reverberate Clean (Groudtruth) | ||||||||
OPUS12 (Channel Independent) | ||||||||
OPUS6 (Channel Independent) | ||||||||
HIFICODEC (Channel Independent) | ||||||||
SUB-BAND CODEC (Channel Independent) | ||||||||
MIMO E2E | ||||||||
ENCODEC+SpatialCodec (OURS) | ||||||||
HIFICODEC+SpatialCodec (OURS) | ||||||||
SUB-BAND CODEC +SpatialCodec (OURS) |
Sample 3: Beamforming Outputs
Methods | beamform result |
---|---|
Reverberate Clean (Groudtruth) | |
OPUS12 (Channel Independent) | |
OPUS6 (Channel Independent) | |
ENCODEC (Channel Independent) | |
HIFICODEC (Channel Independent) | |
SUB-BAND CODEC (Channel Independent) | |
MIMO E2E | |
ENCODEC+SpatialCodec (OURS) | |
HIFICODEC+SpatialCodec (OURS) | |
SUB-BAND CODEC +SpatialCodec (OURS) |
Sample 4: Spatial Feature Visualization
Sample 4: All-Channel Outputs
Methods | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 |
---|---|---|---|---|---|---|---|---|
Reverberate Clean (Groudtruth) | ||||||||
OPUS12 (Channel Independent) | ||||||||
OPUS6 (Channel Independent) | ||||||||
HIFICODEC (Channel Independent) | ||||||||
SUB-BAND CODEC (Channel Independent) | ||||||||
MIMO E2E | ||||||||
ENCODEC+SpatialCodec (OURS) | ||||||||
HIFICODEC+SpatialCodec (OURS) | ||||||||
SUB-BAND CODEC +SpatialCodec (OURS) |
Sample 4: Beamforming Outputs
Methods | beamform result |
---|---|
Reverberate Clean (Groudtruth) | |
OPUS12 (Channel Independent) | |
OPUS6 (Channel Independent) | |
ENCODEC (Channel Independent) | |
HIFICODEC (Channel Independent) | |
SUB-BAND CODEC (Channel Independent) | |
MIMO E2E | |
ENCODEC+SpatialCodec (OURS) | |
HIFICODEC+SpatialCodec (OURS) | |
SUB-BAND CODEC +SpatialCodec (OURS) |
[1] Ralph O. Schmidt, “Multiple emitter location and signal pa- rameter estimation,” 1986.
[2] Ryan Corey and Andrew Singer, Relative Transfer Function Estimation from Speech Keywords, pp. 238–247, 06 2018.