Skip to main content

🏢 University College London (UCL)

Group Robust Preference Optimization in Reward-free RLHF
·2045 words·10 mins· loading · loading
Natural Language Processing Large Language Models 🏢 University College London (UCL)
Group Robust Preference Optimization (GRPO) enhances reward-free RLHF by aligning LLMs to diverse group preferences, maximizing worst-case performance, and significantly improving fairness.