🏢 Independent
Refusal in Language Models Is Mediated by a Single Direction
·4093 words·20 mins·
loading
·
loading
AI Theory
Safety
🏢 Independent
LLM refusal is surprisingly mediated by a single, easily manipulated direction in the model’s activation space.