Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

2503.16219

Quy-Anh Dang et el.

🤗 2025-03-21

TL;DR
#

Large language models (LLMs) typically need massive computational resources and datasets. This limits accessibility for resource-constrained settings. It is important to investigate the potential of reinforcement learning (RL) to improve reasoning in smaller LLMs. The study will focus on a 1.5B parameter model under strict constraints. Training is done on 4 NVIDIA A40 GPUs within 24 hours. The study aims to addresses the question of whether LLM can be elevated using an RL-based approach.

The paper adapts the Group Relative Policy Optimization (GRPO) algorithm and curates a compact, high-quality mathematical reasoning dataset. This enables three experiments to explore model behavior and performance. The findings show reasoning gains using only 7,000 samples at a low training cost. The code and datasets are released as open-source resources. The Open-RS models outperform other models. For example, the Open-RS3 achieves the highest AIME24 score, surpassing o1-preview.

Key Takeaways
#

Why does it matter?
#

This research offers a cost-effective alternative to scaling LLMs. It provides insights into trade-offs and lays the foundation for scalable, reasoning-capable LLMs in resource-limited environments. By releasing code/datasets as open-source resources, it fosters reproducibility and further exploration by the research community.

Visual Insights
#

Model	AIME24	MATH-500	AMC23	Minerva	OlympiadBench	Avg.
General Models
Llama-3.1-70B-Instruct	16.7	64.6	30.1	35.3	31.9	35.7
o1-preview	44.6	85.5	–	–	–	–
7B Models
Qwen-2.5-Math-7B-Instruct	13.3	79.8	50.6	34.6	40.7	43.8
rStar-Math-7B	26.7	78.4	47.5	–	47.1	–
Eurus-2-7B-PRIME	26.7	79.2	57.8	38.6	42.1	48.9
Qwen2.5-7B-SimpleRL	26.7	82.4	62.5	39.7	43.3	50.9
1.5B Models
DeepSeek-R1-Distill-Qwen-1.5B	28.8	82.8	62.9	26.5	43.3	48.9
Still-3-1.5B-Preview	32.5	84.4	66.7	29.0	45.4	51.6
DeepScaleR-1.5B-Preview	43.1	87.8	73.6	30.2	50.0	57.0
Our Models
Open-RS1 (100 steps)	30.0	83.8	70.0	29.0	52.4	53.0
Open-RS2 (50 steps)	30.0	85.4	80.0	30.5	52.4	55.7
Open-RS3 (50 steps)	46.7	84.4	72.5	26.8	51.3	56.3

🔼 This table presents a comparison of zero-shot performance across multiple mathematics reasoning benchmarks. Zero-shot means the models were not given any examples before evaluation. Pass@1 refers to the percentage of problems solved correctly on the first try. The table compares the performance of several different large language models (LLMs), including the authors’ models (Open-RS1, Open-RS2, and Open-RS3) and various baselines. The best result for each benchmark is highlighted in bold. Note that some official scores were unavailable for some baselines, indicated by dashes. The table indicates the source of the scores used for comparison.
read the caption
Table 1: Zero-shot pass@1 performance across benchmarks. Bold indicates the highest score per benchmark. Dashes (–) denote unavailable official scores. Scores for o1-preview are sourced from AI (2024b); others from Zeng et al. (2025); Luo et al. (2025). Our models are evaluated using the lighteval package Fourrier et al. (2023).

In-depth insights
#

RL’s Efficiency
#

RL’s efficiency in this context centers on achieving substantial reasoning gains with minimal resources. The paper demonstrates that even under strict computational constraints, small LLMs can exhibit rapid improvement. This contrasts with traditional approaches that demand extensive datasets and computational power. The authors highlight the importance of a compact, high-quality dataset curated for mathematical reasoning, enabling efficient learning. Furthermore, the adaptation of the GRPO algorithm, designed to reduce computational overhead by eliminating the need for a separate critic model, contributes to resource efficiency. However, the study also reveals challenges, such as optimization instability and length constraints, that emerge with prolonged training, suggesting a trade-off between efficiency and sustained performance.

GRPO Adaptation
#

Adapting GRPO for smaller LLMs involves key considerations. Resource constraints necessitate a balance between exploration and exploitation, with careful tuning of hyperparameters like clipping range and KL penalty. GRPO’s inherent efficiency, eliminating the need for a separate critic model, makes it suitable, but the group size ‘G’ must be optimized to ensure sufficient baseline estimation without excessive computational overhead. Reward design is critical; a rule-based system balancing correctness, efficiency, and structural clarity is preferable to resource-intensive neural reward models. Additionally, techniques like reward shaping and curriculum learning could further enhance the adaptation process for better optimization.

Data’s Impact
#

While the paper does not explicitly have a heading called “Data’s Impact,” one can infer the significance of data throughout the study. High-quality data is crucial for effective training, especially in resource-constrained scenarios. The study shows that smaller, well-curated datasets tailored to mathematical reasoning can be surprisingly effective in improving the performance of small LLMs. It’s highlighted that mixing ’easy’ and ‘hard’ examples can improve the learning dynamics. However, data alone is not sufficient; there are trade-offs with model size and optimization strategies. Length constraints during training also influence the reasoning abilities of LLMs, implying careful consideration is needed to find the right balance between data quantity, quality, and training methodologies. The cost-effectiveness of the RL-based fine-tuning suggests that with thoughtfully designed data curation, one can achieve comparable, and in some cases superior, results to larger models trained with more extensive resources.

Length Limits
#

Length limits present a multifaceted challenge in training language models, particularly smaller ones via reinforcement learning. Constrained generation length, as observed, can prematurely truncate reasoning processes, hindering performance on complex tasks needing extended chains of thought. Balancing generation length is crucial; too short, and solutions are cut off, while excessive lengths can lead to instability and irrelevant content. Optimal length must align with task complexity. Length constraints likely interact with data complexity; simpler datasets may necessitate shorter generations, while harder ones demand expanded limits. The use of cosine reward is a promising strategy to regulate completion lengths, however, extending length limits are necessary for extremely hard tasks, particularly with multilingual base models. Future solutions might involve multi-stage length schedules that adjust generation length dynamically. Further research should explore balancing solution completeness with the risk of instability during prolonged generation.

Scaling Small LLMs
#

Scaling small LLMs presents a resource-efficient alternative to large models. The paper investigates the potential of Reinforcement Learning (RL) to improve reasoning in small LLMs under strict constraints, aiming to balance performance gains with limitations such as reduced computational overhead. A critical aspect is data curation focusing on high-quality datasets tailored to mathematical reasoning, which can minimize training costs. RL-based fine-tuning enables the models to refine decision-making processes by optimizing for task-specific rewards. Overcoming challenges like data efficiency, optimization stability, and length constraints is crucial. The study demonstrates that small LLMs can achieve competitive reasoning performance. A key finding is the potential of minimal resources to significantly enhance reasoning, thus promoting the democratization of advanced AI. The emphasis is on scalable, reasoning-capable models.

More visual insights
#

More on tables

	rStar-Math-7B	Eurus-2-7B-PRIME	Qwen2.5-7B-SimpleRL	Open-RS
Base Model	Qwen2.5-Math-7B	Qwen2.5-Math-7B	Qwen2.5-Math-7B	DeepSeek-R1-Distill-Qwen-1.5B
SFT Data	7.3M	230k	0	0
RM Data	7k	0	0	0
RM	None	Eurus-2-7B-SFT	None	None
RL Data	3.647M $\times$ 16	150k $\times$ 4	8k $\times$ 8	7k $\times$ 6
Hardware	10x 8 H100 80GB, 15x 4 A100 40GB	1x 8 A100 80GB	4x 6 A100 80GB	1x 4 A40 48GB
Time	–	72h	36h	24h
Cost Est.	–	$1088	$1633	$42

🔼 This table compares the resource requirements and costs associated with training several 7-billion parameter language models. It specifically highlights the differences in data used (supervision data, reinforcement learning data), the hardware used for training, the training time, and the estimated cost. The data for the models are either taken directly from the papers where the models were introduced or from publicly available GitHub issues where authors discuss training specifics and resource constraints.
read the caption
Table 2: Comparison of data usage and training costs for 7B models. Data are sourced from original papers or GitHub issues addressing author’s resource constraints.

	DeepScaleR-1.5B-Preview	Still-3-1.5B-Preview	Open-RS
Base Model	DeepSeek-R1-Distill-Qwen-1.5B	DeepSeek-R1-Distill-Qwen-1.5B	DeepSeek-R1-Distill-Qwen-1.5B
SFT Data	0	0	0
RM Data	0	0	0
RM	None	None	None
RL Data	40k $\times$ 16	30k $\times$ 8	7k $\times$ 6
Hardware	8x A100 80GB	1x 8 A100 80GB	1x 4 A40 48GB
Time	240h	150h	24h
Cost Est.	$3629	$2268	$42

🔼 This table compares the resource requirements (data usage and training costs) for training three different 1.5-billion parameter language models. It highlights the significant differences in computational cost and dataset size between existing models and the model proposed in this paper (Open-RS). The comparison is made to demonstrate the efficiency gains achieved by the proposed method, especially considering the constraints of limited resources.
read the caption
Table 3: Comparison of data usage and training costs for 1.5B models. Data are sourced from original papers or GitHub issues addressing author’s resource constraints.

Dataset	# Samples
AIME24	30
MATH-500	500
AMC23	40
Minerva	272
OlympiadBench	675

🔼 This table lists the benchmark datasets used to evaluate the reasoning capabilities of the models. It includes the dataset name, and the number of samples in each dataset. These datasets represent a diverse range of mathematical reasoning challenges, including problems from various competitions, academic sources, and other domains, varying in difficulty and problem type. The purpose is to provide a comprehensive evaluation of model performance across different problem characteristics and difficulty levels.
read the caption
Table 4: Benchmark Datasets and Sample Sizes for Evaluation

General Settings
Parameter	Value
bf16	true
use_vllm	true
vllm_device	auto
vllm_enforce_eager	true
vllm_gpu_memory_utilization	0.7
vllm_max_model_len	4608
do_eval	false
output_dir	data/OpenRS-GRPO
overwrite_output_dir	true
Training Configuration
gradient_accumulation_steps	4
gradient_checkpointing	true
gradient_checkpointing_kwargs	use_reentrant: false
learning_rate	1.0e-06
lr_scheduler_type	cosine_with_min_lr
lr_scheduler_kwargs	min_lr_rate: 0.1
warmup_ratio	0.1
max_steps	500
num_train_epochs	1
per_device_train_batch_size	6
per_device_eval_batch_size	6
Generation Settings
max_prompt_length	512
max_completion_length	3584 or 4096
num_generations	6
temperature	0.7
seed	42
Logging and Saving
log_completions	true
log_level	info
logging_first_step	true
logging_steps	1
logging_strategy	steps
save_strategy	steps
save_steps	50
report_to	wandb
Reward Configuration
reward_funcs	format, accuracy (cosine)
reward_weights	1.0, 2.0
Hub Settings
hub_model_id	OpenRS-GRPO
hub_strategy	every_save
push_to_hub	true

🔼 This table details the hyperparameters used to configure the Group Relative Policy Optimization (GRPO) trainer during the reinforcement learning process. It’s divided into sections for general settings (overall training parameters), training configuration (optimizer specifics), generation settings (parameters controlling text generation), logging and saving settings (frequency of logging and model checkpoints), and reward configuration (weighting of different reward components). Each parameter’s value is specified, offering a clear view of the training setup.
read the caption
Table 5: Hyperparameter Setups for GRPO Trainer

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

RL’s Efficiency#

GRPO Adaptation#

Data’s Impact#

Length Limits#

Scaling Small LLMs#

More visual insights#

Full paper#