🏢 University of Warsaw
Mixture of Tokens: Continuous MoE through Cross-Example Aggregation
·1989 words·10 mins·
loading
·
loading
Natural Language Processing
Large Language Models
🏢 University of Warsaw
Mixture of Tokens (MoT) achieves 3x faster LLM training than dense Transformers and matches state-of-the-art MoE performance via continuous token mixing.