R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

2503.05592

Huatong Song et el.

🤗 2025-03-10

TL;DR
#

Existing Large Reasoning Models often struggle with open-ended tasks due to their reliance on internal knowledge, leading to inaccuracies. To address this, the paper proposed a two-stage outcome-based Reinforcement Learning approach designed to enhance the search capabilities of Large Language Models. This allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process.

The framework uses Retrieve-Reward to incentivize retrieval operations and Answer-Reward to encourage models to utilize external systems effectively. The R1-Searcher significantly outperforms previous methods, even compared to GPT-4o-mini. It demonstrated great results on the HotpotQA and 2Wiki datasets. Also, the model achieves great generalization on the Bamboogle dataset compared to the Search-o1.

Key Takeaways
#

Why does it matter?
#

This paper introduces a novel RL framework to enhance LLMs’ search capabilities, addressing limitations in knowledge-intensive tasks. The significant performance gains over existing methods, including closed-source models, highlight the potential of RL for improving RAG systems and opening new research avenues.

Visual Insights
#

🔼 Figure 1 presents a bar chart comparing the performance of R1-Searcher against several other methods across four distinct multi-hop question answering (QA) benchmark datasets: HotpotQA, 2WikiMultiHopQA, Bamboogle, and Musique. The chart displays the accuracy or percentile achieved by each method on each dataset. R1-Searcher consistently outperforms the other models, demonstrating its state-of-the-art performance on all four benchmarks.
read the caption
Figure 1: Performance comparisons between R1-Searcher and other methods on four multi-hop QA benchmarks. R1-Searcher achieves state-of-the-art performance on each dataset.

Stage	Dataset	Easy	Medium	Difficult
Stage-1	HotpotQA	-	200	-
Stage-1	2WikiMultiHopQA	-	150	-
Stage-2	HotpotQA	-	2561	2000
Stage-2	2WikiMultiHopQA	-	1087	2500

🔼 This table presents the dataset used for training the R1-Searcher model. It shows the breakdown of the data by stage (Stage-1 and Stage-2), dataset (HotpotQA and 2WikiMultiHopQA), and difficulty level (Easy, Medium, Difficult). The number of samples for each category in each stage is detailed, providing a complete overview of the training data distribution across difficulty levels and datasets.
read the caption
Table 1: The information of the data used during RL training.

In-depth insights
#

RL for Search
#

Reinforcement Learning (RL) for Search presents a promising avenue for enhancing the capabilities of Language Models (LLMs), particularly in scenarios demanding external knowledge integration. This approach addresses limitations of LLMs that primarily rely on internal knowledge. By incentivizing LLMs to autonomously engage external search systems, RL offers a pathway to overcome inaccuracies and hallucinations, crucial for time-sensitive and knowledge-intensive tasks. The key is to train LLMs to effectively use the search tool and integrate it into the reasoning. By carefully design the reward function we can progressively enable the model to invoke the external retrieval system and integrate retrieved documents into reasoning.

R1-Searcher
#

R1-Searcher is presented as a novel two-stage outcome-based reinforcement learning (RL) approach designed to enhance search capabilities in Large Language Models (LLMs). It addresses the challenge of LLMs relying heavily on internal knowledge, which can lead to inaccuracies, especially in time-sensitive or knowledge-intensive tasks. The framework enables LLMs to autonomously invoke external search systems to access additional knowledge during reasoning. A key feature is its reliance exclusively on RL, eliminating the need for process rewards or distillation for a cold start, potentially streamlining the training process and improving efficiency. The two-stage approach likely involves an initial phase focused on learning the mechanics of external search integration, followed by a refinement stage that optimizes the use of retrieved information for accurate problem-solving, it represents a significant step towards more deliberative reasoning in LLMs. The method significantly outperforms previous strong RAG methods.

Two-Stage RL
#

Two-stage RL is a crucial strategy for complex problem-solving. It allows for a phased approach, where initial stages focus on exploration and learning basic skills, and subsequent stages refine these skills towards a specific goal. This mimics human learning, starting with broad understanding and moving towards specialized expertise. The first stage could involve pre-training, imitation learning, or unsupervised exploration to establish a foundational policy. The second stage involves finetuning with reinforcement learning, guiding the policy towards optimal performance. This approach prevents premature convergence to suboptimal policies and improves sample efficiency. The design of the reward function is critical, as it will influence the model to take certain actions based on the reward it gets. Moreover, it can be designed so that the model may consider a different reward at different stages.

RAG-based Rollout
#

RAG-based rollout aims to improve how language models use external knowledge. By strategically inserting special tags like <begin_of_query> and <end_of_query>, the model is prompted to use the search tool during generation. When the model generates the <end_of_query> tag, the process pauses so that the query can be extracted and used for retrieval. The retrieved documents, surrounded by <begin_of_documents> and <end_of_documents> tags, are then incorporated into the model’s reasoning process. This method helps ensure that retrieval is a seamless part of the reasoning, enabling the model to make decisions based on retrieved knowledge without disrupting its flow.

Online Search
#

Online search represents a crucial frontier for enhancing LLMs, moving beyond static knowledge to real-time information access. Integrating LLMs with online search capabilities holds the potential to address knowledge-intensive tasks, time-sensitive queries, and personalized information needs more effectively. A key challenge lies in enabling LLMs to formulate effective search queries, understand the retrieved information, and synthesize it into coherent and accurate responses. Another challenge is to design training methodologies that allow LLMs to effectively leverage online search without overfitting or memorizing specific search results. There is also a need to ensure the ethical use of online search within LLMs, mitigating the risk of biased information and promoting reliable answers.

More visual insights
#

More on tables

Models	Types	Methods	HotpotQA		2Wiki		Bamboogle		Musique
Models	Types	Methods	$\text{ACC}_{R}$	$\text{ACC}_{L}$	$\text{ACC}_{R}$	$\text{ACC}_{L}$	$\text{ACC}_{R}$	$\text{ACC}_{L}$	$\text{ACC}_{R}$	$\text{ACC}_{L}$
GPT	Zero-Shot	Naive Generation	0.324	0.404	0.348	0.346	0.240	0.280	0.134	0.170
	Zero-Shot	Standard RAG	0.342	0.450	0.344	0.292	0.272	0.328	0.172	0.188
	Branching	SuRe	0.270	0.380	0.244	0.264	0.168	0.208	0.128	0.146
	Branching	REPLUG	0.350	0.428	0.296	0.254	0.224	0.256	0.132	0.138
	Summary	LongLLMLingua	0.358	0.450	0.324	0.316	0.248	0.288	0.150	0.172
		RECOMP	0.332	0.398	0.298	0.306	0.136	0.176	0.118	0.134
		Selective-Context	0.366	0.442	0.350	0.290	0.240	0.288	0.152	0.172
	Adaptive	SKR	0.360	0.454	0.364	0.314	0.248	0.288	0.162	0.174
	RAG-CoT	Self-Ask	0.392	0.462	0.336	0.478	0.336	0.416	0.260	0.270
		Iter-RetGen	0.374	0.456	0.326	0.270	0.232	0.256	0.178	0.188
		IRCoT	0.434	0.308	0.492	0.114	0.272	0.184	0.192	0.214
	Test-Time	CR-Planner	0.404	0.416	0.520	0.478	0.488	0.524	0.272	0.262
	Test-Time	ReARTeR	0.468	0.506	0.554	0.534	0.496	0.544	0.296	0.302
Llama	Zero-Shot	Naive Generation	0.208	0.268	0.326	0.254	0.144	0.168	0.068	0.096
	Zero-Shot	Standard RAG	0.334	0.398	0.336	0.212	0.168	0.216	0.104	0.098
	Branching	SuRe	0.266	0.346	0.122	0.262	0.160	0.192	0.106	0.144
	Branching	REPLUG	0.290	0.348	0.334	0.204	0.168	0.232	0.078	0.090
	Summary	LongLLMLingua	0.314	0.382	0.304	0.294	0.168	0.216	0.088	0.100
		RECOMP	0.318	0.380	0.324	0.322	0.104	0.160	0.112	0.126
		Selective-Context	0.296	0.358	0.266	0.204	0.144	0.200	0.092	0.104
	Adaptive	SKR	0.300	0.372	0.336	0.212	0.176	0.208	0.100	0.112
	RAG-CoT	Self-Ask	0.316	0.408	0.306	0.322	0.360	0.432	0.222	0.226
		Iter-RetGen	0.302	0.362	0.310	0.224	0.144	0.176	0.084	0.084
		IRCoT	0.210	0.146	0.338	0.312	0.120	0.104	0.060	0.042
	Test-Time	CR-Planer	0.332	0.350	0.420	0.350	0.304	0.336	0.144	0.098
	Test-Time	ReARTeR	0.424	0.434	0.470	0.364	0.438	0.484	0.244	0.252
	Reasoning	Marco-o1	0.352	0.348	0.442	0.184	0.224	0.200	0.134	0.104
	Reasoning	Skywork-o1	0.306	0.256	0.344	0.190	0.176	0.160	0.092	0.060
Llama	RL	R1-Searcher	0.648	0.746	0.594	0.628	0.504	0.544	0.254	0.282
Qwen	RL-Zero	R1-Searcher	0.654	0.750	0.636	0.650	0.528	0.544	0.282	0.314

🔼 This table presents a performance comparison of the R1-Searcher model against various baselines on four distinct multi-hop question answering (QA) benchmarks. The results are shown in terms of accuracy metrics (Cover Exact Match (ACC_R) and LLM-as-Judge (ACC_L)). Baselines include zero-shot methods, retrieval-augmented generation (RAG) methods using different techniques, and test-time scaling methods. The table highlights the superior performance of R1-Searcher across all benchmarks, demonstrating the effectiveness of its approach in enhancing the search capabilities of large language models (LLMs). The abbreviations GPT, Qwen, and Llama refer to specific LLMs used in the baselines.
read the caption
Table 2: Performance comparisons between R1-Searcher and the baselines on four multi-hop QA benchmarks. The boldface indicates the best performance. GPT, Qwen, and Llama are the abbreviations of GPT-4o-mini, Qwen-2.5-7B-Base, and Llama-3.1-8B-Instruct, respectively.

Method	HotpotQA			2Wiki			Bamboogle			Avg (CEM)
Method	EM	CEM	F1	EM	CEM	F1	EM	CEM	F1	Avg (CEM)
GRPO	53.0	60.5	68.6	58.0	60.5	63.0	48.0	56.0	60.5	59.0
Reinforce++	58.4	64.8	70.6	57.5	61.5	62.9	44.0	50.4	57.1	58.9

🔼 This table presents a comparison of the performance achieved by two different reinforcement learning algorithms, GRPO and Reinforce++, when used to train the Llama-3.1-8B-Instruct language model on three distinct multi-hop question answering (QA) benchmark datasets. It evaluates the effectiveness of each algorithm in improving the model’s ability to answer complex, multi-step questions, showcasing differences in performance metrics across the chosen benchmarks.
read the caption
Table 3: Performance comparison of Llama-3.1-8B-Instruct trained using GRPO and Reinforce++ on three multi-hop QA benchmarks.

Method	HotpotQA			2Wiki			Bamboogle			Avg (CEM)
Method	EM	CEM	F1	EM	CEM	F1	EM	CEM	F1	Avg (CEM)
Qwen-Base-RL	58.0	65.4	71.9	55.4	63.6	63.7	45.6	52.8	57.7	60.6
Qwen-Base-SFT	37.0	49.5	51.3	42.5	54.5	51.3	40.8	46.4	51.0	50.1
Llama-Instruct-RL	58.4	64.8	70.6	55.0	59.4	61.2	44.0	50.4	57.1	58.2
Llama-Instruct-SFT	36.0	47.0	50.4	38.0	51.0	48.3	39.4	46.6	48.2	48.2

🔼 This table presents a performance comparison between models trained using Reinforcement Learning (RL) and Supervised Fine-tuning (SFT). It shows the results of both Qwen-2.5-7B-Base and Llama-3.1-8B-Instruct models on three multi-hop question answering (QA) benchmarks: HotpotQA, 2Wiki, and Bamboogle. The metrics used to evaluate performance are Exact Match (EM), Cover Exact Match (CEM), and F1 score. The table highlights the superior performance of RL-trained models, demonstrating better generalization and retrieval capabilities compared to SFT-trained models.
read the caption
Table 4: Performance comparison of Qwen-2.5-7B-Base and Llama-3.1-8B-Instruct trained using RL and SFT on three multi-hop QA benchmarks. Qwen-Base and Llama-Instruct are the abbreviations of Qwen-2.5-7B-Base and Llama-3.1-8B-Instruct, respectively.

Method	HotpotQA			2Wiki			Bamboogle			Avg (CEM)
Method	EM	CEM	F1	EM	CEM	F1	EM	CEM	F1	Avg (CEM)
EM	55.0	62.0	69.3	29.0	29.0	30.0	24.8	28.0	33.2	39.7
CEM	53.4	65.0	68.8	51.8	59.2	61.7	46.4	54.4	59.0	59.5
F1	58.0	65.4	71.9	55.4	63.6	63.7	45.6	52.8	57.7	60.6

🔼 This table presents a comparison of the performance of the Qwen-2.5-7B-Base language model trained using reinforcement learning with different reward metrics for outcome supervision. The reward metrics compared are Exact Match (EM), Cover Exact Match (CEM), and F1 score, which are common evaluation metrics in question answering tasks. The performance is evaluated on three multi-hop question answering (QA) benchmarks: HotpotQA, 2Wiki, and Bamboogle. The table shows the performance of the model in terms of EM, CEM, and F1 scores on each benchmark. This allows for an assessment of how different reward functions influence the model’s ability to generate accurate answers and the effect they have on various metrics.
read the caption
Table 5: Performance comparison of the Qwen-2.5-7B-Base model utilizing different metrics for outcome-supervised reward calculation on three mutil-hop QA benchmarks.

Method	HotpotQA			2Wiki			Bamboogle			Avg (CEM)
Method	EM	CEM	F1	EM	CEM	F1	EM	CEM	F1	Avg (CEM)
w/o Difficult	54.8	61.8	69.3	55.4	63.6	63.7	44.8	51.2	56.9	58.8
w. Difficult	58.0	65.4	71.9	54.8	64.2	63.8	45.6	52.8	57.7	60.8

🔼 This table presents a performance comparison of the Qwen-2.5-7B-Base language model trained on datasets with varying difficulty levels. It assesses the model’s performance on three multi-hop question answering benchmarks (HotpotQA, 2Wiki, and Bamboogle). The comparison focuses on evaluating the impact of training data difficulty (easy, medium, hard questions) on the model’s ability to answer complex questions across different datasets.
read the caption
Table 6: Performance comparison of the Qwen-2.5-7B-Base model trained on datasets of different difficulty levels on three mutil-hop QA benchmarks.

Method	HotpotQA			2Wiki			Bamboogle			Avg (CEM)
Method	EM	CEM	F1	EM	CEM	F1	EM	CEM	F1	Avg (CEM)
HotpotQA	53.8	59.2	67.2	46.7	54.3	54.7	44.0	50.4	55.1	54.6
2Wiki	46.0	50.5	58.7	45.0	47.5	48.2	31.2	32.8	39.4	43.6
Mixture	58.0	65.4	71.9	55.4	63.6	63.7	45.6	52.8	57.7	60.6

🔼 This table presents a performance comparison of the Qwen-2.5-7B-Base language model trained using reinforcement learning (RL) on different datasets. It shows the impact of data diversity and difficulty on the model’s performance across three multi-hop question answering benchmarks: HotpotQA, 2WikiMultiHopQA, and Bamboogle. The table compares the model’s performance using different training dataset configurations: trained on HotpotQA only, 2WikiMultiHopQA only, and a mixture of both datasets. The performance metrics used are Exact Match (EM), Cover Exact Match (CEM), and F1 score, providing a comprehensive evaluation of the model’s accuracy and ability to retrieve relevant information.
read the caption
Table 7: Performance comparison of the Qwen-2.5-7B-Base model trained on different datasets.

Question:

When did the last king from Britain’s House of Hanover die?

Answer:

20 June 1837

Response (RL) (True): To solve this question, I need to find out when the last king from Britain’s House of Hanover died. Step 1: Identify the name of the last king from the House of Hanover. Step 2: Determine the date of the last king’s death. **Step 1: Identify the name of the last king from the House of Hanover.** I’ll use a search query to find out: <|begin_of_query|> last king from Britain’s House of Hanover<|end_of_query|> <|begin_of_documents|> …

<|end_of_documents|> From the search results, I can see that William IV died on 20 June 1837. Therefore, the last king from Britain’s House of Hanover died on 20 June 1837.</think> <answer> 20 June 1837 </answer>

Response (SFT) (False): To answer this question, I need to recall the British monarchs who belonged to the House of Hanover and identify the last one. The House of Hanover includes kings like George I, George II, and George III. I need to find out when George III, the last Hanoverian king, passed away. George III ruled from 1760 to 1820. After his death, the throne passed to his son, George IV, who was from the House of Hanover’s successor, the House of Saxe-Coburg and Gotha. I will verify the date of George III’s death to provide the correct answer. <|begin_of_query|> George III death date<|end_of_query|> <|begin_of_documents|> …

<|end_of_documents|> George III died on 29 January 1820. This confirms that he was the last king from Britain’s House of Hanover, and there was no Hanoverian king after him. The answer is 29 January 1820. </think> <answer> 29 January 1820 </answer>

🔼 This table presents a case study using the Bamboogle dataset to compare the performance of two Qwen-2.5-7B-Instruct models trained using reinforcement learning (RL) and supervised fine-tuning (SFT). It highlights the differences in their reasoning processes and ability to answer complex, multi-hop questions requiring external information retrieval. The example question, answer, and the reasoning processes of both models are provided, illustrating how the RL-trained model effectively decomposes the question into subproblems, retrieves relevant information, and arrives at the correct answer, while the SFT-trained model struggles with similar processes, relying more heavily on internal knowledge and potentially resulting in an incorrect answer.
read the caption
Table 8: Case study from the Bamboogle dataset comparing Qwen-2.5-7B-Instruct models trained with RL and SFT.

Question:

Who was president of the United States in the year that Citibank was founded?

Answer:

james madison

Response (Qwen-2.5-7B-Base) (True): To solve this question, I need to find out who the president of the United States was in the year that Citibank was founded and then identify that president. Step 1: Determine the year Citibank was founded. I’ll use a search query to find out: <|begin_of_query|> Citibank founding year<|end_of_query|> <|begin_of_documents|> …

<|end_of_documents|> The search results mention that James Madison was the President of the United States in 1812. Therefore, the president of the United States in the year that Citibank was founded was James Madison.</think> <answer> James Madison </answer>

Response (Llama-3.1-8B-Instruct) (True): Given the information, I don’t have specific knowledge about the president of the United States in the year Citibank was founded. I will perform a search to find the required information. The query to search for is "<|begin_of_query|> president of united states citibank founded<|end_of_query|> <|begin_of_documents|> …

<|end_of_documents|> Based on the search results, I found that the president of the United States in 1812 was James Madison.</think> <answer>James Madison</answer>

🔼 This table presents a comparative case study of two large language models (LLMs), Qwen-2.5-7B-Instruct and Llama-3.1-8B-Instruct, both trained using reinforcement learning (RL). The models are evaluated on their ability to answer a complex question from the Bamboogle dataset. The table shows the different reasoning processes employed by each model, along with the final answers, highlighting the differences in their approaches to problem-solving and demonstrating how RL affects their ability to decompose complex tasks, access external information through search, and arrive at a correct solution.
read the caption
Table 9: Case study from the Bamboogle dataset comparing Qwen-2.5-7B-Instruct and Llama-3.1-8B-Instruct models trained with RL.

TL;DR#

Key Takeaways#

Why does it matter?#

Visual Insights#

In-depth insights#

RL for Search#

R1-Searcher#

Two-Stage RL#

RAG-based Rollout#

Online Search#

More visual insights#

Full paper#