↓Skip to main content

🏢 University College London (UCL)

Group Robust Preference Optimization in Reward-free RLHF

26 September 2024·2045 words·10 mins· loading · loading

Natural Language Processing Large Language Models 🏢 University College London (UCL)

Group Robust Preference Optimization (GRPO) enhances reward-free RLHF by aligning LLMs to diverse group preferences, maximizing worst-case performance, and significantly improving fairness.