Skip to main content

🏢 Université Paris-Saclay

Ignore the KL Penalty! Boosting Exploration on Critical Tokens to Enhance RL Fine-Tuning
·3104 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Université Paris-Saclay
Boosting RL fine-tuning efficiency in LLMs: A novel KL penalty modification prioritizes exploration on critical tokens, dramatically improving model performance on arithmetic tasks.