XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework

2501.08809

Sida Tian et el.

🤗 2025-01-16

TL;DR
#

Current AI music generation struggles with controlling musical emotions and ensuring high-quality output. This research tackles these challenges by introducing XMusic, a novel framework for symbolic music generation. Existing methods often lack flexibility in input types and struggle with consistent output quality.

XMusic uses a two-part system. The first, XProjector, translates diverse input prompts (images, video, text, etc.) into symbolic musical elements such as genre, emotion, rhythm, and notes. The second part, XComposer, generates music based on these elements using a transformer network, and a selector identifies the highest-quality results via multi-task learning. The framework also includes XMIDI, a large, high-quality dataset to train the system, demonstrating significant improvements over current state-of-the-art methods in both objective and subjective evaluations.

Key Takeaways
#

Why does it matter?
#

This paper is important because it addresses a critical gap in AI-generated music: controllability and quality. It introduces a novel framework and dataset that will likely advance research in symbolic music generation, enabling the creation of more sophisticated and emotionally expressive music. The multi-modal approach and the large-scale dataset also provide valuable resources for researchers in related areas. This work could spur innovations in interactive music systems, virtual reality, and game design.

Visual Insights
#

🔼 XMusic is a symbolic music generation framework with two main components: XProjector and XComposer. XProjector takes multi-modal inputs (images, videos, text, tags, humming) and translates them into symbolic musical elements like emotion, genre, rhythm, and notes. These elements act as control signals for XComposer. XComposer consists of a Generator, which produces music based on these control signals, and a Selector, which uses a multi-task learning approach (quality, emotion, and genre assessment) to select the highest-quality output. The Generator is trained on a large dataset, XMIDI, that includes emotion and genre labels.
read the caption
Figure 1: The architectural overview of our XMusic framework. It contains two essential components: XProjector and XComposer. XProjector parses various input prompts into specific symbolic music elements. These elements then serve as control signals, guiding the music generation process within the Generator of XComposer. Additionally, XComposer includes a Selector that evaluates and identifies high-quality generated music. The Generator is trained on our large-scale dataset, XMIDI, which includes precise emotion and genre labels.

Input Prompt	Analyzed Content	Activated Elements
video	emotion, rhythm	$\{\mathcal{P}^{E},\mathcal{P}^{R}\}$
image	emotion	$\{\mathcal{P}^{E}\}$
text	emotion	$\{\mathcal{P}^{E}\}$
tag	emotion / genre	$\{\mathcal{P}^{E}\}/\{\mathcal{P}^{G}\}$
humming	notes, rhythm	$\{\mathcal{P}^{R},\mathcal{P}^{N}\}$

🔼 This table shows the correlation between different types of input prompts (video, image, text, tag, humming) and the corresponding symbolic music elements that are activated in the XProjector module. For each prompt type, the table specifies which symbolic music elements (emotions, rhythms, notes) are primarily influenced and subsequently used to guide the music generation process. This illustrates how XMusic handles the multi-modal input and translates it into a unified representation suitable for music generation.
read the caption
TABLE I: Mapping table of the input prompts and the activated elements.

In-depth insights
#

XMusic Framework
#

The XMusic framework presents a novel approach to symbolic music generation, emphasizing generalization and controllability. It cleverly integrates a multi-modal prompt parser (XProjector) to handle diverse inputs (images, text, humming, etc.), transforming them into a unified symbolic music representation. This representation, enhanced from previous work, allows for fine-grained control over various musical elements (emotion, genre, rhythm). The framework’s XComposer further refines the generation process, employing a generator and selector for high-quality output and efficient filtering. The inclusion of a large-scale, annotated dataset (XMIDI) significantly contributes to its success. The multi-task learning approach in the Selector ensures high-quality results, showcasing a significant leap forward in AI-generated music.

Multimodal Prompt Parsing
#

The concept of “Multimodal Prompt Parsing” in the context of AI music generation is crucial. It tackles the challenge of translating diverse input types—images, text, audio, etc.—into a unified representation understandable by the music generation model. The core innovation lies in creating a “projection space” that maps heterogeneous data onto common musical elements like emotion, genre, rhythm, and notes. This approach elegantly solves the problem of incompatible data formats. For example, temporal information from a video is projected onto rhythm elements, while emotional cues from images are mapped to emotional attributes. This unified representation allows the model to generate music that coherently integrates all aspects of the user prompt. A significant aspect is the method’s versatility, enabling flexible and nuanced control over the generated music. The system’s ability to understand and combine disparate inputs reflects an advanced understanding of multimodal fusion, going beyond simplistic concatenation.

Symbolic Music Representation
#

The choice of symbolic music representation significantly impacts the model’s ability to generate high-quality and controllable music. MIDI-like representations, encoding music as a sequence of events, are popular but struggle with rhythm modeling. Image-like representations, using matrices like piano rolls, offer advantages for capturing harmonic information but lack the temporal expressiveness of MIDI. The paper’s proposed representation improves upon existing methods by incorporating enhanced Compound Words, adding crucial tokens for instrument specification, emotional control (genre and emotion tags), and finer rhythmic details (beat density and strength). This hybrid approach aims to combine the benefits of both MIDI-like and image-like approaches while addressing their respective limitations, ultimately enabling more nuanced and precise control over music generation.

XMIDI Dataset
#

The creation of the XMIDI dataset is a significant contribution to the field of AI-generated music. Its size (108,023 MIDI files) is unprecedented, dwarfing previous datasets and providing a much-needed resource for training robust music generation models. The precise emotion and genre annotations are crucial, enabling researchers to train models capable of generating music with specific emotional and stylistic characteristics. The dataset’s public availability is commendable, fostering broader collaboration and advancement in the field. However, further investigation into the dataset’s potential biases and the annotation methodology’s reliability would enhance its value and provide a deeper understanding of its strengths and limitations.

Future Directions
#

Future research should prioritize expanding the XMusic framework to encompass a wider array of input modalities, such as human motion capture data, depth information, and more nuanced textual descriptions. This would enhance the system’s versatility and allow for a more fine-grained control over the generated music’s emotional and stylistic characteristics. Additionally, improving the XMIDI dataset by incorporating more diverse musical genres, instruments, and emotional expressions is crucial. A larger and more representative dataset would significantly enhance the model’s ability to generalize to unseen musical styles and prompts. Furthermore, investigating methods to improve the interpretability and explainability of the model’s decisions is essential. Understanding how the model processes different types of inputs and generates specific musical features would contribute significantly to its wider acceptance and application. Lastly, exploring techniques for real-time generation and interactive music composition would greatly extend the potential applications of XMusic, from composing adaptive soundtracks for videos to enabling collaborative music creation in real-time.

More visual insights
#

More on tables

Token Type	Vocabulary Size		Embedding Size
Token Type	CP [10]	Ours	CP [10]	Ours
$[$ instrument $]$	-	17 (+1)	-	64
\hdashline $[$ tempo $]$	58 (+2)	65 (+2)	128	256
$[$ position/bar $]$	17 (+1)	33 (+1)	64	256
$[$ chord $]$	133 (+2)	133 (+2)	256	256
\hdashline $[$ pitch $]$	86 (+1)	256 (+1)	512	1024
$[$ duration $]$	17 (+1)	32 (+1)	128	512
$[$ velocity $]$	24 (+1)	44 (+1)	128	512
\hdashline $[$ family $]$	3	5	32	64
\hdashline $[$ density $]$	-	33 (+1)	-	128
$[$ strength $]$	-	37 (+1)	-	128
\hdashline $[$ emotion $]$	-	11 (+1)	-	64
$[$ genre $]$	-	6 (+1)	-	64
\hdashlinetotal	338 (+8)	672 (+13)	-	-

🔼 This table details the specific implementation choices made in the design of the symbolic music representation used in the XMusic model. It compares the vocabulary size, embedding size, and token types in the proposed representation with those used in the Compound Word (CP) model from a previous work. This allows the reader to better understand the differences in the approach and scale of these two models.
read the caption
TABLE II: Comparison of the implementation details between the CP and our proposed representation.

Dataset	Emotion Type	Genre Type	Size (Songs)
MIREX-like [55]	5 classes	multiple	193
VGMIDI [56]	valence	video game	95
EMOPIA [11]	Russell’s 4Q	pop	387
ELMG [12]	2 classes	multiple	11,528
XMIDI (Ours)	11 classes	6 classes	108,023

🔼 This table compares the characteristics of several existing datasets of emotion-labeled MIDI files with the new XMIDI dataset introduced in this paper. It shows the number of emotion classes, genre types (or whether genre information is included at all), and the total number of MIDI files included in each dataset. This comparison highlights the scale and comprehensiveness of the XMIDI dataset relative to prior work, particularly its significantly larger size and more detailed annotation (e.g., 11 emotion classes vs. fewer in others).
read the caption
TABLE III: Comparison between existing emotion-labeled MIDI datasets and the proposed XMIDI dataset.

	Method	PCE $\downarrow$	GS $\uparrow$	EBR $\downarrow$
(a) Unconditioned	CP [10]	2.6025	0.9990	0.0273
	EMOPIA [11]	2.6756	0.9989	0.1197
	XMusic (Ours)	2.5174	0.9992	0.0045
(b) Video-conditioned	CMT [19]	2.7290	0.6698	0.0321
(b) Video-conditioned	XMusic (Ours)	2.6161	0.9983	0.0078

🔼 This table presents an objective comparison of XMusic with state-of-the-art symbolic music generation methods across three key metrics: Pitch Class Histogram Entropy (PCE), Grooving Pattern Similarity (GS), and Empty Beat Rate (EBR). Lower PCE and EBR scores indicate better tonality and rhythm, while a higher GS score suggests stronger rhythmic structure. The table is divided into two sections: unconditioned and video-conditioned methods, showcasing XMusic’s superior performance in both scenarios.
read the caption
TABLE IV: Objective comparison with state-of-the-art symbolic music generation methods.

	Method	Richness	Correctness	Structuredness	Emotion-Matching	Rhythm-Matching	Overall Rank
(a) Unconditioned	CP [10]	2.2323	2.1161	2.0871	-	-	2.1452
	EMOPIA [11]	2.2387	2.4807	2.3452	-	-	2.3549
	XMusic (Ours)	1.5290	1.4032	1.5677	-	-	1.5000
(b) Video-conditioned	CMT [19]	1.7129	1.7484	1.6742	1.6452	1.6807	1.6923
(b) Video-conditioned	XMusic (Ours)	1.2871	1.2516	1.3258	1.3548	1.3194	1.3077
(c) Text-conditioned	BART-base [62]	2.3871	2.4806	2.2226	2.6258	-	2.4290
	GPT-4 [22]	2.1355	2.0065	2.1903	1.8613	-	2.0484
	XMusic (Ours)	1.4774	1.5129	1.5871	1.5129	-	1.5226
(d) Image-conditioned	Synesthesia [63]	1.5548	1.8065	1.7484	1.7936	-	1.7258
(d) Image-conditioned	XMusic (Ours)	1.4452	1.1935	1.2516	1.2064	-	1.2742

🔼 This table presents a subjective evaluation comparing the performance of different symbolic music generation methods. Three state-of-the-art methods and the proposed XMusic framework are compared across four categories of music generation (unconditioned, video-conditioned, text-conditioned, and image-conditioned), with human participants rating the generated music samples based on metrics such as richness, correctness, structuredness, emotion matching, and rhythm matching. The average ranking for each method within each category provides a comprehensive comparison of their overall quality and strengths across different generation paradigms.
read the caption
TABLE V: Subjective comparison with state-of-the-art symbolic music generation methods.

Classes	EMOPIA [11]	XMusic (Ours)
Positive Valence (PV) $\uparrow$	38%	76%
Negative Valence (NV) $\uparrow$	38%	70%
Positive Arousal (PA) $\uparrow$	39%	66%
Negative Arousal (NA) $\uparrow$	51%	84%

🔼 This table presents a subjective comparison of XMusic with the state-of-the-art emotion-conditioned symbolic music generation method, EMOPIA. It shows the percentage of correctly classified music pieces based on four emotion categories: Positive Valence (PV), Negative Valence (NV), Positive Arousal (PA), and Negative Arousal (NA). These categories are derived from a combination of EMOPIA’s emotion quadrants and XMusic’s more granular emotion labels, allowing for a more meaningful comparison despite the difference in granularity.
read the caption
TABLE VI: Subjective comparison with state-of-the-art emotion-conditioned symbolic music generation method.

	Setting	Richness	Correctness	Structuredness	Emotion-Matching	Rhythm-Matching	Overall Rank
(a) Selector	without (✗) Selector	1.5957	1.5878	1.5484	1.5348	-	1.5667
(a) Selector	with (✓) Selector	1.4043	1.4122	1.4516	1.4652	-	1.4333
(b) Emotion Control	No control	2.2097	2.2903	2.2000	2.3548	2.3000	2.2710
	Music-level	2.0839	2.0065	2.2000	2.1129	2.1484	2.1103
	Bar-level	1.7064	1.7032	1.6000	1.5323	1.5516	1.6187
(c) Music Representation	CP [10]	3.2323	3.2942	3.7844	-	-	3.4370
	CP+Tag	2.8179	2.9640	2.5991	-	-	2.7937
	CP+Tag+Instr	2.4308	2.1195	2.1022	-	-	2.2175
	CP+Tag+Instr+Rhythm (Ours)	1.5190	1.6223	1.5143	-	-	1.5519
(d) Data Sampling	Undersampling	2.3813	2.4746	2.1271	-	-	2.3277
	Oversampling	2.2069	2.0604	2.3667	-	-	2.2113
	Original XMIDI	1.4118	1.4650	1.5062	-	-	1.4610

🔼 This table presents the results of an ablation study on the XMusic model. It systematically removes or modifies different components of the model to assess their individual contributions to overall performance. Specifically, it examines the impact of the Selector, emotion control level (music-level vs. bar-level), music representation (comparing different token schemes), and data sampling strategies (undersampling and oversampling) on the subjective and objective evaluation metrics. The subjective metrics used include Richness, Correctness, Structuredness, Emotion-Matching, Rhythm-Matching, and Overall Rank, all based on human evaluations. The table provides a detailed quantitative analysis of how each modification affects the performance, allowing for a better understanding of the key components within the XMusic framework.
read the caption
TABLE VII: Ablation analysis.

Dataset	Method	Richness	Correctness	Structuredness	Overall Rank
AILabs1k7 [10]	CP [10]	2.1521	2.1053	2.0276	2.0950
	EMOPIA [11]	2.2567	2.2318	2.2382	2.2422
	XMusic (Ours)	1.5912	1.6629	1.7342	1.6628
AILabs1k7 [10]	CP [10]	2.1016	2.0651	2.0177	2.0615
+EMOPIA [11]	EMOPIA [11]	2.2607	2.1889	2.2956	2.2484
+EMOPIA [11]	XMusic (Ours)	1.6377	1.7460	1.6867	1.6901

🔼 Table VIII presents a comparison of the performance of XMusic against state-of-the-art symbolic music generation methods on publicly available datasets. Specifically, it shows a subjective evaluation of the quality of music generated by XMusic compared to CP [10] and EMOPIA [11] across different metrics such as Richness, Correctness, and Structuredness. The evaluation was conducted on two datasets: AILabs1k7 and EMOPIA.
read the caption
TABLE VIII: Comparison with state-of-the-art symbolic music generation methods on public datasets.

Setting	PCE $\downarrow$	GS $\uparrow$	EBR $\downarrow$
w/o (✗) Selector	2.5806	0.9991	0.0097
with (✓) Selector	2.5083	0.9994	0.0042

🔼 This table presents the results of an ablation study evaluating the impact of the proposed Selector on the overall performance of the XMusic model. It compares the objective metrics (PCE, GS, EBR) achieved by the XMusic model with and without the Selector. This demonstrates the effectiveness of the Selector in enhancing the quality of generated symbolic music.
read the caption
TABLE IX: Objective ablation study on the proposed Selector.

Classification Head			Accuracy
Quality	Emotion	Genre	Accuracy
✓	✗	✗	83.2%
✓	✓	✗	90.1%
✓	✓	✓	94.8%

🔼 This table presents an ablation study evaluating the impact of the multi-task learning scheme on the Selector component of the XMusic model. It shows how the accuracy of the quality assessment task changes when different combinations of sub-tasks (quality, emotion, and genre recognition) are included in the training process. This demonstrates the contribution of each sub-task to the overall performance of the Selector in identifying high-quality music.
read the caption
TABLE X: Ablation study on the multi-task learning scheme of the Selector.

$(\lambda_{1},\lambda_{2})$	(0,1)	(1,0)	(1,1)	(1,2)	(1,3)	(2,1)	(3,1)
Accuracy (%)	80.2	77.9	83.7	85.2	84.8	82.5	81.7

🔼 This table presents the results of an ablation study investigating the impact of different weighting factors (λ1 and λ2) on the accuracy of image emotion recognition. The study varied the weights assigned to two different models used for emotion classification (ResNet and CLIP) in the XProjector component of the XMusic system. The table shows how changes in these weights affect the overall accuracy of image emotion recognition, providing insights into the relative contributions of each model.
read the caption
TABLE XI: Ablation analysis on the weighting factors λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of image emotion recognition task.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

XMusic Framework#

Multimodal Prompt Parsing#

Symbolic Music Representation#

XMIDI Dataset#

Future Directions#

More visual insights#

Full paper#