↓Skip to main content

🏢 Gray Swan AI

Improving Alignment and Robustness with Circuit Breakers

26 September 2024·2515 words·12 mins· loading · loading

AI Theory Safety 🏢 Gray Swan AI

AI systems are made safer by ‘circuit breakers’ that directly control harmful internal representations, significantly improving alignment and robustness against adversarial attacks with minimal impact…