Skip to main content

🏢 MIT CSAIL

Theoretical Analysis of Weak-to-Strong Generalization
·1703 words·8 mins· loading · loading
AI Theory Generalization 🏢 MIT CSAIL
Strong student models can learn from weaker teachers, even correcting errors and generalizing beyond the teacher’s expertise. This paper provides new theoretical bounds explaining this ‘weak-to-strong…
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
·2727 words·13 mins· loading · loading
Natural Language Processing Large Language Models 🏢 MIT CSAIL
Cross-Layer Attention (CLA) shrinks Transformer Key-Value cache 2x, improving LLMs’ memory efficiency without accuracy loss.
In-Context Symmetries: Self-Supervised Learning through Contextual World Models
·3570 words·17 mins· loading · loading
Computer Vision Self-Supervised Learning 🏢 MIT CSAIL
CONTEXTSSL: A novel self-supervised learning algorithm that adapts to task-specific symmetries by using context, achieving significant performance gains over existing methods.
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
·2643 words·13 mins· loading · loading
AI Applications Robotics 🏢 MIT CSAIL
Diffusion Forcing merges next-token prediction and full-sequence diffusion for superior sequence generation.
A Theoretical Understanding of Self-Correction through In-context Alignment
·1997 words·10 mins· loading · loading
Natural Language Processing Large Language Models 🏢 MIT CSAIL
LLMs improve through self-correction, but the mechanisms are unclear. This paper provides a theoretical framework and empirical evidence demonstrating that self-correction arises from in-context align…