Interpretability

Learning to Understand: Identifying Interactions via the Möbius Transform

26 September 2024·2143 words·11 mins· loading · loading

AI Theory Interpretability 🏢 UC Berkeley

Unlocking complex models’ secrets: New algorithm identifies input interactions using the Möbius Transform, boosting interpretability with surprising speed and accuracy.

Learning Discrete Concepts in Latent Hierarchical Models

26 September 2024·2302 words·11 mins· loading · loading

AI Theory Interpretability 🏢 Carnegie Mellon University

This paper introduces a novel framework for learning discrete concepts from high-dimensional data, establishing theoretical conditions for identifying underlying hierarchical causal structures and pro…

Interpretable Mesomorphic Networks for Tabular Data

26 September 2024·2985 words·15 mins· loading · loading

Machine Learning Interpretability 🏢 University of Freiburg

Interpretable Mesomorphic Neural Networks (IMNs) achieve accuracy comparable to black-box models while offering free-lunch explainability for tabular data through instance-specific linear models gener…

Interpretable Generalized Additive Models for Datasets with Missing Values

26 September 2024·2769 words·13 mins· loading · loading

Machine Learning Interpretability 🏢 Duke University

M-GAM: Interpretable additive models handling missing data with superior accuracy & sparsity!

Interpretable Concept-Based Memory Reasoning

26 September 2024·2660 words·13 mins· loading · loading

AI Theory Interpretability 🏢 KU Leuven

CMR: A novel Concept-Based Memory Reasoner delivers human-understandable, verifiable AI task predictions by using a neural selection mechanism over a set of human-understandable logic rules, achievin…

Improving Decision Sparsity

26 September 2024·4802 words·23 mins· loading · loading

AI Generated AI Theory Interpretability 🏢 Duke University

Boosting machine learning model interpretability, this paper introduces cluster-based and tree-based Sparse Explanation Values (SEV) for generating more meaningful and credible explanations by optimiz…

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

26 September 2024·6707 words·32 mins· loading · loading

AI Generated AI Theory Interpretability 🏢 Apollo Research

End-to-end sparse autoencoders revolutionize neural network interpretability by learning functionally important features, outperforming traditional methods in efficiency and accuracy.

GraphTrail: Translating GNN Predictions into Human-Interpretable Logical Rules

26 September 2024·2764 words·13 mins· loading · loading

AI Theory Interpretability 🏢 IIT Delhi

GRAPHTRAIL unveils the first end-to-end global GNN explainer, translating black-box GNN predictions into easily interpretable boolean formulas over subgraph concepts, achieving significant improvement…

Finding Transformer Circuits With Edge Pruning

26 September 2024·2284 words·11 mins· loading · loading

Interpretability 🏢 Princeton University

Edge Pruning efficiently discovers sparse, yet accurate, computational subgraphs (circuits) in large language models via gradient-based edge pruning, advancing mechanistic interpretability research.

Explanations that reveal all through the deﬁnition of encoding

26 September 2024·1891 words·9 mins· loading · loading

AI Theory Interpretability 🏢 New York University

New method, STRIPE-X, powerfully detects ’encoding’ in AI explanations—a sneaky phenomenon where explanations predict outcomes better than their constituent parts alone would suggest.

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

26 September 2024·4142 words·20 mins· loading · loading

AI Generated AI Theory Interpretability 🏢 UC Berkeley

Chess AI Leela Zero surprisingly uses learned look-ahead, internally representing future optimal moves, significantly improving its strategic decision-making.

Dual-Perspective Activation: Efficient Channel Denoising via Joint Forward-Backward Criterion for Artificial Neural Networks

26 September 2024·1941 words·10 mins· loading · loading

AI Theory Interpretability 🏢 Zhejiang University

Dual-Perspective Activation (DPA) efficiently denoises ANN channels by jointly using forward and backward propagation criteria, improving sparsity and accuracy.

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

26 September 2024·2710 words·13 mins· loading · loading

AI Generated AI Theory Interpretability 🏢 Harvard University

Researchers dissected attention paths in Transformers using statistical mechanics, revealing a task-relevant kernel combination mechanism boosting generalization performance.

Denoising Diffusion Path: Attribution Noise Reduction with An Auxiliary Diffusion Model

26 September 2024·2911 words·14 mins· loading · loading

AI Generated AI Theory Interpretability 🏢 School of Computer Science, Fudan University

Denoising Diffusion Path (DDPath) uses diffusion models to dramatically reduce noise in attribution methods for deep neural networks, leading to clearer explanations and improved quantitative results.

Data-faithful Feature Attribution: Mitigating Unobservable Confounders via Instrumental Variables

26 September 2024·1976 words·10 mins· loading · loading

AI Theory Interpretability 🏢 Zhejiang University

Data-faithful feature attribution tackles misinterpretations from unobservable confounders by using instrumental variables to train confounder-free models, leading to more robust and accurate feature …

Compact Proofs of Model Performance via Mechanistic Interpretability

26 September 2024·4006 words·19 mins· loading · loading

AI Theory Interpretability 🏢 MIT

Researchers developed a novel method using mechanistic interpretability to create compact formal proofs for AI model performance, improving AI safety and reliability.

Causal Dependence Plots

26 September 2024·2526 words·12 mins· loading · loading

AI Theory Interpretability 🏢 London School of Economics

Causal Dependence Plots (CDPs) visualize how machine learning model predictions causally depend on input features, overcoming limitations of existing methods that ignore causal relationships.

Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?

26 September 2024·3015 words·15 mins· loading · loading

AI Theory Interpretability 🏢 ETH Zurich

This paper presents a novel method to make black box neural networks intervenable using only a small validation set with concept labels, improving the effectiveness of concept-based interventions.

B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable

26 September 2024·2514 words·12 mins· loading · loading

AI Theory Interpretability 🏢 Max Planck Institute for Informatics

B-cosification: cheaply transform any pre-trained deep neural network into an inherently interpretable model.

Auditing Local Explanations is Hard

26 September 2024·1271 words·6 mins· loading · loading

AI Theory Interpretability 🏢 University of Tübingen and Tübingen AI Center

Auditing local explanations is surprisingly hard: proving explanation trustworthiness requires far more data than previously thought, especially in high dimensions, challenging current AI explainabil…