↓Skip to main content

🏢 Independent

Refusal in Language Models Is Mediated by a Single Direction

26 September 2024·4093 words·20 mins· loading · loading

AI Theory Safety 🏢 Independent

LLM refusal is surprisingly mediated by a single, easily manipulated direction in the model’s activation space.