Slamming: Training a Speech Language Model on One GPU in a Day

2502.15814

Gallil Maimon et el.

🤗 2025-02-25

TL;DR
#

Speech Language Models (SLMs) require extensive resources to train, hindering research and development for many. Training SLMs demands large-scale datasets, resulting in high computational costs, restricting research to well-funded labs. Existing scaling laws present a pessimistic view of the computational resources needed to train high-quality SLMs.

This paper introduces a recipe for training high-quality SLMs on a single GPU in 24 hours. The recipe, called Slam, uses empirical analysis of model initialization, synthetic training data, preference optimization to maximize performance. Empirical results show that the recipe scales well with more compute, performing on par with leading SLMs at a fraction of the cost. In SLM scaling laws, Slam gives an optimistic view of SLM feasibility.

Key Takeaways
#

Why does it matter?
#

This paper is crucial for SLM research by offering a practical recipe to train high-quality models on limited resources. It challenges pessimistic views on SLM feasibility, encouraging further exploration and refinement of scaling laws. The open-sourced code, models, and training recipes democratize access, opening new avenues for innovation in speech and audio processing.

Visual Insights
#

🔼 This figure compares the Topic-StoryCloze performance of various Speech Language Models (SLMs) against the amount of computational resources (measured in FLOPs) used for training. The size of each circle in the plot corresponds to the size of the respective SLM, providing a visual representation of the model’s scale. The plot shows that even with increased compute, performance gains level off and there are significant differences between SLMs.
read the caption
Figure 1: Comparing Topic-StoryCloze performance of different SLMs as a function of training compute. Model size is indicated by the size of the circle.

Model	Data		Metric
	Div.	Syn.	sBLIMP $\uparrow$	sSC $\uparrow$	tSC $\uparrow$	GenPPL $\downarrow$
OPT125M	✗	✓	55.28	55.46	75.18	96.8
	✓	✓	55.06	55.00	74.83	116.6
	✗	✗	55.88	54.52	70.82	160.3
	✓	✗	55.65	54.78	70.18	172.7
Qwen-0.5B	✗	✓	56.45	55.59	78.01	88.3
	✓	✓	56.17	55.37	77.13	101.3
	✗	✗	56.60	53.50	71.14	145.4
	✓	✗	56.10	53.72	70.66	161.8

🔼 This table presents the results of an experiment investigating the impact of data diversity and synthetic data on the performance of Speech Language Models (SLMs). It compares the performance of SLMs trained using only the Libri-light and LibriSpeech datasets (representing limited diversity) versus models trained with additional diverse datasets. The impact of including synthetic data, specifically sTinyStories, is also examined. Performance is evaluated across several metrics, showing how different data compositions affect the model’s ability to perform various speech tasks.
read the caption
Table 1: Analysing impact of training data diversity and synthetic data on SLM performance. The default Slam recipe does not use diverse data (only Libri-light and LibriSpeech), but uses the synthetic sTinyStories data.

In-depth insights
#

SLM Efficiency
#

The research focuses on improving the efficiency of Speech Language Models (SLMs), addressing the significant computational demands that hinder broader research and development. The paper introduces “Slam,” a recipe for training high-quality SLMs on a single GPU within 24 hours, a notable achievement compared to existing methods requiring extensive resources. The core idea revolves around optimizing various aspects of the training pipeline, including model initialization, architecture, synthetic data usage, and preference optimization. Empirical analysis demonstrates that Slam can achieve performance on par with leading SLMs but at a fraction of the computational cost. This advancement could democratize SLM research, making it accessible to researchers with limited resources. The paper’s insights challenge pessimistic views on SLM scalability, offering an optimistic outlook on the feasibility of training high-quality SLMs with limited computational resources. The study contributes to the broader field of efficient model training and may inspire further innovations in SLM architectures and training methodologies.

SLAM Recipe
#

Diving into the concept of a ‘SLAM Recipe’ for training speech language models (SLMs), it’s essentially a detailed, optimized guide. This ‘recipe’ focuses on efficiently training high-quality SLMs within resource constraints, like using a single GPU in a day. Key aspects would include model initialization strategies, architecture choices, data selection methods (synthetic data generation), and tailored training objectives (e.g., preference optimization). The goal is maximizing SLM performance while minimizing computational demands. This involves tweaking hyperparameters, exploring optimizers and learning rate schedules, and balancing speech-text data. The recipe would provide concrete steps and hyperparameter settings for researchers to reproduce and adapt the SLM training process, making SLM research more accessible and scalable.

DPO benefits
#

Direct Preference Optimization (DPO) emerges as a pivotal technique for refining Speech Language Models (SLMs), offering substantial benefits even with limited compute resources. The research indicates that integrating DPO, particularly with synthetically generated data, notably enhances SLM performance. A critical finding is that allocating a small portion of the training budget, as little as 30 minutes, to DPO yields significant improvements compared to models trained solely with next-token prediction. However, excessive DPO training can be detrimental, potentially degrading model performance if it overshadows the initial pre-training phase. This delicate balance suggests that DPO acts as a powerful fine-tuning mechanism, aligning the model’s output more closely with desired preferences. This is done without needing human feedback. Furthermore, employing a repetition penalty during DPO training helps mitigate the risk of the model generating repetitive or nonsensical outputs, a common issue associated with preference optimization methods. The results suggest that DPO improves both modeling and generation performance on the tasks considered.

Text-to-speech
#

Text-to-speech (TTS) plays a crucial role in enhancing SLM performance, particularly when synthetic data is used for pre-training or preference optimization. The research highlights the effectiveness of generating synthetic speech using TTS models, enabling better semantic understanding and improved generative capabilities. Different TTS models, such as single-speaker models for computational efficiency, are leveraged to synthesize datasets like TinyStories. DPO has been greatly improved through the data generation by TTS that results in more natural prosody and better language models in the acoustic part. The text-to-speech model is capable of transforming the existing text corpus to the target spoken tokens effectively. Thus it can be an important factor in improving the performance of the acoustic model.

Model Tuning
#

While the provided research paper does not explicitly contain a heading titled ‘Model Tuning,’ the document extensively explores methods for optimizing Speech Language Models (SLMs). This optimization, in effect, constitutes model tuning. The study investigates various facets of SLM architecture and training, effectively ’tuning’ the model through different initializations (TWIST, pretrained text LMs), architecture (model size using Qwen2.5), hyperparameters. The optimization process involves also tuning the datasets to get the most semantic information using synthetic data generation to create more diversity in the SLM training data. The paper also uses DPO, in-place of DPO

More visual insights
#

More on tables

	Compute (GPU days)	Parameters	sBLIMP $\uparrow$	sStoryCloze $\uparrow$	tStoryCloze $\uparrow$	GenPPL $\downarrow$	Auto-BLEU $\downarrow$
TWIST-350M Hassid et al. (2024)	40*V100	305M	56.20	-	-	137.3	3.46
TWIST-1.3B Hassid et al. (2024)	160*V100	1B	57.00	52.4	70.6	131.8	3.20
TWIST-7B Hassid et al. (2024)	?	7B	59.00	55.3	74.1	93.7	3.06
TWIST-13B Hassid et al. (2024)	?	13B	59.20	55.4	76.4	-	-
Scaled Optimal Cuervo and Marxer (2024)	?	823M	61.3	56.7	78.0	-	-
Predicted Optimal Cuervo and Marxer (2024)	1*A5000	78M	56.85	54.09	70.49	-	-
TWIST-350M (Original recipe)	1*A5000	305M	51.52 $\pm$ .19	53.65 $\pm$ .57	68.80 $\pm$ .47	259.2 $\pm$ 6.7	3.26 $\pm$ .46
TWIST-350M + sTinyStories	1*A5000	305M	51.21 $\pm$ .26	54.17 $\pm$ .54	72.40 $\pm$ .18	159.0 $\pm$ 6.0	4.18 $\pm$ .24
Slam (-DPO) (ours)	1*A5000	358M	56.45 $\pm$ .17	55.59 $\pm$ .30	78.01 $\pm$ .27	88.3 $\pm$ 1.0	3.47 $\pm$ .17
Slam (ours)	1*A5000	358M	58.86 $\pm$ .20	58.04 $\pm$ .51	82.04 $\pm$ .21	62.8 $\pm$ 4.1	3.88 $\pm$ .11

🔼 This table compares the performance of the proposed SLM training recipe, ‘Slamming,’ against several leading SLMs. It shows key performance metrics (SBLIMP, StoryCloze, TopicStoryCloze, and GenPPL) achieved by these models. A key aspect of the comparison is the compute resources utilized—Slamming demonstrates competitive performance with significantly less compute. The table also includes the performance of a TWIST-350M model trained using the original recipe but with the Slamming code and compute budget. The ‘±’ values show the variance across three separate training runs.
read the caption
Table 2: Comparing slamming to leading SLMs, and predicted optimal performance for the compute. We also consider TWIST-350350350350M using our code and compute budget, but with the original training recipe. ±plus-or-minus\pm± indicates distance to min/max of 3333 seeds.

Speech only pre-training
	GPUs	Params	Num tokens	sBLIMP $\uparrow$	sStoryCloze $\uparrow$	tStoryCloze $\uparrow$	GenPPL $\downarrow$	Auto-BLEU $\downarrow$
GSLM Lakhotia et al. (2021)	8*V100	100M	1B	54.2	53.3	66.6	$\emptyset$	$\emptyset$
SyllableLM Baade et al. (2024)	4*A40	300M	16B	63.7	$\emptyset$	75.4	$\emptyset$	$\emptyset$
TWIST-350M Hassid et al. (2024)	8*V100	305M	10.8B	56.20	$\emptyset$	$\emptyset$	137.3	3.46
TWIST-1.3B Hassid et al. (2024)	32*V100	1B	10.8B	57.00	52.4	70.6	131.8	3.20
TWIST-7B Hassid et al. (2024)	32*V100	7B	36B	59.00	55.3	74.1	93.74	3.06
TWIST-13B Hassid et al. (2024)	32*V100	13B	36B	59.20	55.4	76.4	$\emptyset$	$\emptyset$
Scaled Optimal Cuervo and Marxer (2024)	$\emptyset$	823M	82B	61.3	56.7	78.0	$\emptyset$	$\emptyset$
Moshi Défossez et al. (2024)	?*H100	7B	?	58.9	58.7	81.8	$\emptyset$	$\emptyset$
SpiritLM Nguyen et al. (2025)	64*A100	7B	100B	58.0	54.8	72.9	$\emptyset$	$\emptyset$
Joint speech-text pre-training / preference optimisation
Zeng et al. (2024)	$\emptyset$	9B	$\sim$ 1T	$\emptyset$	62.4	82.9	$\emptyset$	$\emptyset$
Moshi Défossez et al. (2024)	?*H100	7B	$\sim$ 720B	58.8	60.8	83.0	$\emptyset$	$\emptyset$
SpiritLM Nguyen et al. (2025)	64*A100	7B	100B	58.3	61.0	82.9	$\emptyset$	$\emptyset$
AlignSLM-1.3B Lin et al. (2024)	64*A100	1B	10.8B + $\sim$ 158B	59.8	55.0	80.0	$\emptyset$	$\emptyset$
AlignSLM-7B Lin et al. (2024)	64*A100	7B	36B + $\sim$ 158B	62.3	61.1	86.8	$\emptyset$	$\emptyset$
Slam (-DPO)	2*A100	358M	16.7B	58.53	58.15	80.71	67.3	3.25
Slam	1*A5000	358M	1.4B + 5M	58.86	58.04	82.04	62.8	3.88
Slam (scaled)	2*A100	358M	16.7B + 9M	61.11	61.30	84.18	46.6	3.75

🔼 This table analyzes how the performance of the Slam model changes when more computational resources are used during training. It compares Slam’s results (trained on a single A5000 GPU for 24 hours, then scaled to two A100 GPUs for 48 hours) against existing state-of-the-art speech language models (SLMs). Key performance metrics such as sBLIMP, StoryCloze, and perplexity are presented. The number of tokens used during training is estimated, reflecting both the pre-training phase and the optional Direct Preference Optimization (DPO) phase (indicated with a ‘+’). The goal is to show how Slam scales efficiently compared to models that require significantly greater computational resources.
read the caption
Table 3: Analysing the effect of scaling up compute for Slam. Number tokens refers to total, not necessarily unique, tokens used for training (estimated from the provided information). We separately mark DPO tokens with a +.

Parameter	Value
Text Base Model	Qwen2.5-0.5B
TWIST initialisation	True
Data	Librilight + Librispeech + sTinyStories
Train Time	$23.5$ hours $\simeq 17625$ steps
RoPE theta	$10000$
Context length	$1024$
Per device Batch Size	$8$
Gradient Accumulation	$16$
Base Learning Rate	$1e-3$
Warmup Ratio	$1\%$
Optimizer	AdamW
Learning Rate Scheduler	cosine with min $5e-5$
Max Grad Norm	$0.5$
Dtype	bfloat16

🔼 This table details the hyperparameters used for pre-training the speech language model in the Slam recipe, excluding the Direct Preference Optimization (DPO) phase. It lists settings such as the base model used, data sources, training time, various regularization parameters, optimization algorithm, learning rate schedule, and data type. These specifications are crucial for replicating the experiment’s results.
read the caption
Table 4: Slam (-DPO) Pre Training Recipe

Parameter	Value
Initial Model	Slam (-DPO)
Data	SpokenSwag with auto-bleu smaller than 0.3
Train Time	$0.5$ hour $\simeq 813$ steps
RoPE theta	$10000$
Context length	$1024$
Per device Batch Size	$4$
Gradient Accumulation	$16$
Base Learning Rate	$5e-5$
Optimizer	AdamW
Learning Rate Scheduler	inverse sqrt
Max Grad Norm	$0.5$
Dtype	bfloat16
DPO $\beta$	$0.1$

🔼 This table details the hyperparameters used during the Direct Preference Optimization (DPO) phase of the Slam training recipe. It includes settings for the initial model, dataset used for DPO training, training time allocated for DPO, ROPE theta (a hyperparameter controlling the trade-off between granularity and ability to handle long contexts in the Rotary Position Embedding), context length, per-device batch size, gradient accumulation, base learning rate, optimizer used, learning rate scheduler, maximum gradient norm, and data type. Understanding this table is crucial for replicating the Slam DPO training process and analyzing the impact of DPO on final model performance.
read the caption
Table 5: Slam DPO Training Recipe

Model Name	Number of Params
MobileLLM-125M Liu et al. (2024b)	106,492,608
MobileLLM-350M Liu et al. (2024b)	315,117,120
OPT-125M Zhang et al. (2022)	87,015,936
OPT-350M Zhang et al. (2022)	305,714,176
QWEN2.5-0.5B Yang et al. (2024a)	358,347,904
SmolLM2-135M Allal et al. (2025)	106,492,608
SmolLM2-360M Allal et al. (2025)	315,117,120
Pythia-160M Biderman et al. (2023)	85,827,072
Pythia-410M Biderman et al. (2023)	303,339,520

🔼 This table lists various large language models (LLMs) and their corresponding parameter counts. The original LLMs were adapted to use a vocabulary of only 500 speech units. This table is crucial because it demonstrates the model sizes used after adapting them for speech processing, which is key to understanding the experimental setup and results in the paper.
read the caption
Table 6: Model names and parameter counts after changing vocabulary to speech only units (500).

Dataset	Number of Hours	Number of Tokens
Libri-Light Kahn et al. (2020)	$50K$	$3.5B$
LibriSpeech Panayotov et al. (2015)	$960$	$67M$
SWC Baumann et al. (2018)	$750$	$19M$
Tedlium Hernandez et al. (2018)	$1.6K$	$110M$
PeopleSpeech Galvez et al. (2021)	$7K$	$480M$
VoxPopuli Wang et al. (2021b)	$24K$	$1.64B$
sTinyStories	$30K$	$2.2B$

🔼 Table 7 presents a detailed breakdown of the training datasets used in the study. It lists the name of each dataset, the number of hours of audio data it contains, and the total number of tokens extracted from it for training. This allows for a comprehensive understanding of the scale and composition of the data used to train the various models described in the paper. The datasets used are Libri-light, LibriSpeech, SWC, Tedlium, PeopleSpeech, VoxPopuli and sTinyStories.
read the caption
Table 7: Dataset train set sizes that we use.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

SLM Efficiency#

SLAM Recipe#

DPO benefits#

Text-to-speech#

Model Tuning#

More visual insights#

Full paper#