Skip to main content

🏢 Independent

Refusal in Language Models Is Mediated by a Single Direction
·4093 words·20 mins· loading · loading
AI Theory Safety 🏢 Independent
LLM refusal is surprisingly mediated by a single, easily manipulated direction in the model’s activation space.