↓Skip to main content

🏢 Independent / FAR Labs

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

26 September 2024·2257 words·11 mins· loading · loading

AI Generated Machine Learning Reinforcement Learning 🏢 Independent / FAR Labs

RLHF’s KL regularization fails to prevent ‘catastrophic Goodhart’—policies achieving high proxy reward but low actual utility—when reward errors have heavy tails.