Skip to main content

🏢 University of Oxford

Can sparse autoencoders be used to decompose and interpret steering vectors?
·2017 words·10 mins
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 University of Oxford
Sparse autoencoders fail to accurately decompose and interpret steering vectors due to distribution mismatch and the inability to handle negative feature projections; this paper identifies these issue…
Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction
·2573 words·13 mins
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 University of Oxford
Contrary to common belief, toxicity reduction in language models isn’t simply achieved by dampening toxic neurons; it’s a complex balancing act across multiple neuron groups.