↓Skip to main content

Interpretability

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

24 March 2025·3290 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Interpretability 🏢 AIRI

LLMs’ reasoning is decoded via sparse autoencoders, revealing key features that, when steered, enhance performance. First mechanistic account of reasoning in LLMs!

Mixture of Experts Made Intrinsically Interpretable

5 March 2025·3052 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Interpretability 🏢 University of Oxford

MoE-X: An intrinsically interpretable Mixture-of-Experts language model that uses sparse, wide networks to enhance transparency.

LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

20 February 2025·1710 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers AI Theory Interpretability 🏢 AIRI

LLMs use punctuation in context memory, surprisingly boosting performance by using seemingly trivial tokens.