🏢 University of Oxford
Can sparse autoencoders be used to decompose and interpret steering vectors?
·2017 words·10 mins
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 University of Oxford
Sparse autoencoders fail to accurately decompose and interpret steering vectors due to distribution mismatch and the inability to handle negative feature projections; this paper identifies these issue…
Ablation is Not Enough to Emulate DPO: How Neuron Dynamics Drive Toxicity Reduction
·2573 words·13 mins
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 University of Oxford
Contrary to common belief, toxicity reduction in language models isn’t simply achieved by dampening toxic neurons; it’s a complex balancing act across multiple neuron groups.