Multi-Source Music Generation with Latent Diffusion

[code] [pdf]

Zhongweiyang Xu, Debottam Dutta;, Yu-Lin Wei;, Romit Roy Choudhury;

University of Illinois Urbana-Champaign

Brief Overview

Most music generative models generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music mixture as different instrument sources (piano, drums, bass, and guitar), which allows the generation of music composition by generating different instruments. Despite its capabilities, MSDM is unable to generate songs with rich melodies and often generates empty sounds. Also, waveform diffusion introduces significant Gaussian noise artifacts, which compromises audio quality. In response, we introduce a multi-source latent diffusion model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source's unique characteristics in a "source latent" that our diffusion model models jointly. This approach significantly enhances the total and partial generation of music by leveraging the VAE’s latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Fréchet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems.

Figure 1: MLSDM architecture, Training and Inference

Audio Demo

Here are some audio samples, the models configurations are shown below:


Total Generation Sample 1

Methods Bass Drums Piano Guitar Mixture
MSDM [1]
ISLDM
MSLDM (ours)
MSLDM-Large (Ours)
MixLDM

Total Generation Sample 2

Methods Bass Drums Piano Guitar Mixture
MSDM [1]
ISLDM
MSLDM (ours)
MSLDM-Large (Ours)
MixLDM

Partial Generation, condition on piano and guitar, generate bass and drums

Methods Condition Generated Mixture
MSDM [1]
ISLDM
MSLDM
MSLDM-Large

Partial Generation, condition on drums and guitar, generate bass and piano

Methods Condition Generated Mixture
MSDM [1]
ISLDM
MSLDM
MSLDM-Large

Partial Generation, condition on bass and guitar, generate drums and piano

Methods Condition Generated Mixture
MSDM [1]
ISLDM
MSLDM
MSLDM-Large

References

[1] Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, Emanuele Rodolà, “Multi-Source Diffusion Models for Simultaneous Music Generation and Separation ”, 2024.