Skip to main content

🏢 Independent / FAR Labs

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
·2257 words·11 mins· loading · loading
AI Generated Machine Learning Reinforcement Learning 🏢 Independent / FAR Labs
RLHF’s KL regularization fails to prevent ‘catastrophic Goodhart’—policies achieving high proxy reward but low actual utility—when reward errors have heavy tails.