🏢 University College London (UCL)
Group Robust Preference Optimization in Reward-free RLHF
·2045 words·10 mins·
loading
·
loading
Natural Language Processing
Large Language Models
🏢 University College London (UCL)
Group Robust Preference Optimization (GRPO) enhances reward-free RLHF by aligning LLMs to diverse group preferences, maximizing worst-case performance, and significantly improving fairness.