Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion

Technion - Israel Institute of Technology

Abstract

Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion on pre-trained diffusion models. The first, adopted from the image domain, allows text-based editing. The second, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody.


Video Overview

For people in a hurry. Images were generated by DALL-E 2 and Copilot.


1. Samples of Editing

We present samples of audio editing using our proposed methods. The samples are organized into two sections: text-based editing and unsupervised editing.

  1.1. Samples of Text-Based Editing

# Source Prompt Target Prompt Original Audio Edited Audio Edit Tstart
1 A recording of a sneaky jazz song. A recording of a tense classical music score. 110
2 A recording of a hard rock song. A recording of a jazz song. 100
3 A recording of a happy upbeat classical music piece. A recording of a happy upbeat arcade game soundtrack. 100
4 Trumpets playing alongside a piano, bass and drums in an upbeat old-timey cool jazz song. A banjo playing alongside a piano, bass and drums in an upbeat old-timey cool country song. 90
5 —— A recording of a dark techno song. 90


1.2. Samples of Unsupervised Editing

For the unsupervised editing, we split the samples into two sections.
The first section (Strength changes) shows how the same direction applied with different strengths changes the audio sequentially.
The second section (PC direction changes) shows how removing or adding a direction removes or adds a concept.

  1.2.1. Various Samples (Strength changes)

# Inversion Prompt Original Audio Edited Audio +PC Edited Audio +2PC PC Interpretation Edit Parameters
1 A high quality recording of flutes and a trumpet playing. Melody change t'∈[200, -1]
Specific t=80 used
PCs 1+2+3
2 A recording of a calm country song. Remove singer t'∈[150, -1]
Specific t=115 used
PCs 1+2+3
3 Just drums t'∈[150, -1]
Specific t=80 used
PCs 1+2+3
4 A recording of a scary classical music piece. Melody change t'∈[150, 50]
Specific t=95 used
PCs 1+2+3
5 A trumpet and a saxophone playing a cool jazz melody, with an accompaniment of a piano, bass and drums. Melody change t'∈[135, 95]
PCs 1+2+3


  1.2.2. Various Samples (PC direction changes)

# Inversion Prompt Edited Audio -γPC Original Audio Edited Audio +γPC PC Interpretation Edit Parameters
1 A high quality recording of a man singing and drums, guitar and bass playing a song, and later a woman is singing. Lead Guitar/Singers emphasis t'∈[115, 80]
PC #1
2 A high quality recording of a man singing and drums, guitar and bass playing a song, and later a woman is singing. Singers/Drums emphasis t'∈[115, 80]
PC #2
3 A high quality recording of a man singing with a rock band accompaniment. Drum-beats style t'∈[200, -1]
Specific t=80 used
PC #1



2. Comparisons to Other Methods

2.1. Comparisons of Text-Based Editing

  2.1.2. Music Samples

# Source Prompt Target Prompt Original Audio Ours SDEdit
Tstart=100
Tstart=70
Tstart=40
MusicGen DDIM Inversion
1 A recording of a rock song. A recording of Arabic music.
Tstart=110


2 A recording of an upbeat rock song. A recording of an arcade game soundtrack.
Tstart=100


3 A recording of a dark techno song.
Tstart=90


4 A high quality recording of wind instruments and strings playing. A high quality recording of a piano playing.
Tstart=70




  2.1.2. Audio Samples

# Source Prompt Target Prompt Original Audio Ours SDEdit Tstart=150 SDEdit Tstart=120 SDEdit Tstart=100 SDEdit Tstart=70 DDIM Inversion
1 A high quality recording of a cat meowing. A high quality recording of a dog barking.
Tstart=150
2 A high quality recording of a dog barking a lot. A high quality recording of a gun shooting a lot.
Tstart=100
3 A kid talking loudly. A rooster crowing.
Tstart=110


  2.2. Comparisons of Unsupervised Editing

# Inversion Prompt Original Audio Our Semantic Edit SDEdit Tstart=115 SDEdit Tstart=100 SDEdit Tstart=85 SDEdit Tstart=70 Our Edit Parameters
1 A high quality recording of a man singing and drums, guitar and bass playing a song, and later a woman is singing. t'∈[115, 80]
PC #1
2 A high quality recording of a man singing with a rock band accompaniment. t'∈[200, -1]
Specific t=80 used
PC #1
3 t'∈[150, -1]
Specific t=80 used
PCs 1+2+3
4 A high quality recording of flutes and a trumpet playing. t'∈[200, -1]
Specific t=80 used
PCs 1+2+3
5 A recording of a calm country song. t'∈[150, -1]
Specific t=115 used
PCs 1+2+3



Paper

Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
Hila Manor, Tomer Michaeli.
Arxiv

Bibtex

@article{manor2024zeroshot, title={Zero-Shot Unsupervised and Text-Based Audio Editing Using {DDPM} Inversion}, author={Manor, Hila and Michaeli, Tomer}, journal={arXiv preprint arXiv:2402.10009}, year={2024}, }

More results and further discussion about our method can be found in the supplementary material (included in the paper) and our supplemental examples page.
Our official code implementation can be found in the official github repository.

Code


Acknowledgements

This webpage was originally made by Matan Kleiner with the help of Hila Manor for SinDDM and can be used as a template.
It is inspired by the template that was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code for the original template can be found here.
A lot of features are taken from bootstrap. All icons are taken from font awesome.