Skip to main content

Interpretability

Learning to Understand: Identifying Interactions via the Möbius Transform
·2143 words·11 mins· loading · loading
AI Theory Interpretability 🏢 UC Berkeley
Unlocking complex models’ secrets: New algorithm identifies input interactions using the Möbius Transform, boosting interpretability with surprising speed and accuracy.
Learning Discrete Concepts in Latent Hierarchical Models
·2302 words·11 mins· loading · loading
AI Theory Interpretability 🏢 Carnegie Mellon University
This paper introduces a novel framework for learning discrete concepts from high-dimensional data, establishing theoretical conditions for identifying underlying hierarchical causal structures and pro…
Interpretable Mesomorphic Networks for Tabular Data
·2985 words·15 mins· loading · loading
Machine Learning Interpretability 🏢 University of Freiburg
Interpretable Mesomorphic Neural Networks (IMNs) achieve accuracy comparable to black-box models while offering free-lunch explainability for tabular data through instance-specific linear models gener…
Interpretable Generalized Additive Models for Datasets with Missing Values
·2769 words·13 mins· loading · loading
Machine Learning Interpretability 🏢 Duke University
M-GAM: Interpretable additive models handling missing data with superior accuracy & sparsity!
Interpretable Concept-Based Memory Reasoning
·2660 words·13 mins· loading · loading
AI Theory Interpretability 🏢 KU Leuven
CMR: A novel Concept-Based Memory Reasoner delivers human-understandable, verifiable AI task predictions by using a neural selection mechanism over a set of human-understandable logic rules, achievin…
Improving Decision Sparsity
·4802 words·23 mins· loading · loading
AI Generated AI Theory Interpretability 🏢 Duke University
Boosting machine learning model interpretability, this paper introduces cluster-based and tree-based Sparse Explanation Values (SEV) for generating more meaningful and credible explanations by optimiz…
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
·6707 words·32 mins· loading · loading
AI Generated AI Theory Interpretability 🏢 Apollo Research
End-to-end sparse autoencoders revolutionize neural network interpretability by learning functionally important features, outperforming traditional methods in efficiency and accuracy.
GraphTrail: Translating GNN Predictions into Human-Interpretable Logical Rules
·2764 words·13 mins· loading · loading
AI Theory Interpretability 🏢 IIT Delhi
GRAPHTRAIL unveils the first end-to-end global GNN explainer, translating black-box GNN predictions into easily interpretable boolean formulas over subgraph concepts, achieving significant improvement…
Finding Transformer Circuits With Edge Pruning
·2284 words·11 mins· loading · loading
Interpretability 🏢 Princeton University
Edge Pruning efficiently discovers sparse, yet accurate, computational subgraphs (circuits) in large language models via gradient-based edge pruning, advancing mechanistic interpretability research.
Explanations that reveal all through the definition of encoding
·1891 words·9 mins· loading · loading
AI Theory Interpretability 🏢 New York University
New method, STRIPE-X, powerfully detects ’encoding’ in AI explanations—a sneaky phenomenon where explanations predict outcomes better than their constituent parts alone would suggest.
Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
·4142 words·20 mins· loading · loading
AI Generated AI Theory Interpretability 🏢 UC Berkeley
Chess AI Leela Zero surprisingly uses learned look-ahead, internally representing future optimal moves, significantly improving its strategic decision-making.
Dual-Perspective Activation: Efficient Channel Denoising via Joint Forward-Backward Criterion for Artificial Neural Networks
·1941 words·10 mins· loading · loading
AI Theory Interpretability 🏢 Zhejiang University
Dual-Perspective Activation (DPA) efficiently denoises ANN channels by jointly using forward and backward propagation criteria, improving sparsity and accuracy.
Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers
·2710 words·13 mins· loading · loading
AI Generated AI Theory Interpretability 🏢 Harvard University
Researchers dissected attention paths in Transformers using statistical mechanics, revealing a task-relevant kernel combination mechanism boosting generalization performance.
Denoising Diffusion Path: Attribution Noise Reduction with An Auxiliary Diffusion Model
·2911 words·14 mins· loading · loading
AI Generated AI Theory Interpretability 🏢 School of Computer Science, Fudan University
Denoising Diffusion Path (DDPath) uses diffusion models to dramatically reduce noise in attribution methods for deep neural networks, leading to clearer explanations and improved quantitative results.
Data-faithful Feature Attribution: Mitigating Unobservable Confounders via Instrumental Variables
·1976 words·10 mins· loading · loading
AI Theory Interpretability 🏢 Zhejiang University
Data-faithful feature attribution tackles misinterpretations from unobservable confounders by using instrumental variables to train confounder-free models, leading to more robust and accurate feature …
Compact Proofs of Model Performance via Mechanistic Interpretability
·4006 words·19 mins· loading · loading
AI Theory Interpretability 🏢 MIT
Researchers developed a novel method using mechanistic interpretability to create compact formal proofs for AI model performance, improving AI safety and reliability.
Causal Dependence Plots
·2526 words·12 mins· loading · loading
AI Theory Interpretability 🏢 London School of Economics
Causal Dependence Plots (CDPs) visualize how machine learning model predictions causally depend on input features, overcoming limitations of existing methods that ignore causal relationships.
Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?
·3015 words·15 mins· loading · loading
AI Theory Interpretability 🏢 ETH Zurich
This paper presents a novel method to make black box neural networks intervenable using only a small validation set with concept labels, improving the effectiveness of concept-based interventions.
B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable
·2514 words·12 mins· loading · loading
AI Theory Interpretability 🏢 Max Planck Institute for Informatics
B-cosification: cheaply transform any pre-trained deep neural network into an inherently interpretable model.
Auditing Local Explanations is Hard
·1271 words·6 mins· loading · loading
AI Theory Interpretability 🏢 University of Tübingen and Tübingen AI Center
Auditing local explanations is surprisingly hard: proving explanation trustworthiness requires far more data than previously thought, especially in high dimensions, challenging current AI explainabil…