Versatile Symbolic Music-for-Music Modeling via Function Alignment

Abstract

Many music AI models learn a map between music content and human-defined labels. However, many annotations, such as chords, can be naturally expressed within the music modality itself, e.g., as sequences of symbolic notes. This observation enables both understanding tasks (e.g., chord recognition) and conditional generation tasks (e.g., chord-conditioned melody generation) to be unified under a music-for-music sequence modeling paradigm. In this work, we propose parameter-efficient solutions for a variety of symbolic music-for-music tasks. The high-level idea is that (1) we utilize a pretrained Language Model (LM) for both the reference and the target sequence and (2) we link these two LMs via a lightweight adapter. Experiments show that our method achieves superior performance among different tasks such as chord recognition, melody generation, and drum track generation.

Description of Demos

Common Settings

All compared models are symbolic music generation models.
All audio samples are rendered using the FluidR3 GM soundfont using Virtual MIDI Synth.
Note velocity, pitch bends, and other MIDI control changes are currently ignored.

Compared Models

FA_self: Our proposed PEFT module implemented using the self-attention mechanism (as a special case to link two LMs of the same modality).
FA_cross: Our proposed PEFT module implemented using the cross-attention mechanism (as a general case to link two LMs).
cocomulla: A PEFT module for content-conditioned generation using pre-trained LMs. [Source]
melodyt5: An encoder-decoder transformer model for score generation using the ABC notation. [Source]
assistant: A multi-track MIDI infilling model with an encoder-decoder transformer. [Source]
seq2seq: A smaller transformer encoder-decoder with comparable trainable parameter size, directly trained from scratch.

Drum to Others Generation

The model is given a drum track and generates the full song. The ground truths are from the RWC Pop Music Database.

The first two bars of the full song are also provided as a prompt.
The model can freely choose the instruments (in general MIDI instruments) to use.
Note: The drum condition might be silent at the beginning.

Song	Input & Ground Truth		Our Models		Baselines
	Prompt	Ground Truth	FA_self	FA_cross	cocomulla	seq2seq
RM_P090
RM_P008
RM_P005
RM_P001

Others to Drum Generation

The model is given a full song but the drum track is missing. The model needs to generate the drum track. The ground truths are from the RWC Pop Music Database.

Different from other tasks, no drum prompt is provided here. The model generates from the beginning.

Song	Input & Ground Truth		Our Models		Baselines
	Prompt	Ground Truth	FA_self	FA_cross	cocomulla	seq2seq	assistant
RM_P005
RM_P003

Chord to Melody Generation

The model is given a chord sequence and generates the melody (typically monophonic). The ground truths are from the Nottingham Dataset.

The first two bars of the melody are also provided as a prompt.
Chords are played by the piano and the melody is played by the saxophone.

Song	Input & Ground Truth		Our Models		Baselines
	Prompt	Ground Truth	FA_self	FA_cross	cocomulla	seq2seq	melodyt5
jigs108
ashover28

Melody to Chord Generation

The model is given a monophonic melody and generates the chord sequence. The ground truths are from the Nottingham Dataset.

The first two bars of the melody are also provided as a prompt.
Chords are played by the piano and the melody is played by the saxophone.

Song	Input & Ground Truth		Our Models		Baselines
	Prompt	Ground Truth	FA_self	FA_cross	cocomulla	seq2seq	melodyt5
waltzes5
waltzes30