🏢 MIT CSAIL
Theoretical Analysis of Weak-to-Strong Generalization
·1703 words·8 mins·
loading
·
loading
AI Theory
Generalization
🏢 MIT CSAIL
Strong student models can learn from weaker teachers, even correcting errors and generalizing beyond the teacher’s expertise. This paper provides new theoretical bounds explaining this ‘weak-to-strong…
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
·2727 words·13 mins·
loading
·
loading
Natural Language Processing
Large Language Models
🏢 MIT CSAIL
Cross-Layer Attention (CLA) shrinks Transformer Key-Value cache 2x, improving LLMs’ memory efficiency without accuracy loss.
In-Context Symmetries: Self-Supervised Learning through Contextual World Models
·3570 words·17 mins·
loading
·
loading
Computer Vision
Self-Supervised Learning
🏢 MIT CSAIL
CONTEXTSSL: A novel self-supervised learning algorithm that adapts to task-specific symmetries by using context, achieving significant performance gains over existing methods.
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
·2643 words·13 mins·
loading
·
loading
AI Applications
Robotics
🏢 MIT CSAIL
Diffusion Forcing merges next-token prediction and full-sequence diffusion for superior sequence generation.
A Theoretical Understanding of Self-Correction through In-context Alignment
·1997 words·10 mins·
loading
·
loading
Natural Language Processing
Large Language Models
🏢 MIT CSAIL
LLMs improve through self-correction, but the mechanisms are unclear. This paper provides a theoretical framework and empirical evidence demonstrating that self-correction arises from in-context align…