Skip to main content

🏢 University of Warsaw

Mixture of Tokens: Continuous MoE through Cross-Example Aggregation
·1989 words·10 mins· loading · loading
Natural Language Processing Large Language Models 🏢 University of Warsaw
Mixture of Tokens (MoT) achieves 3x faster LLM training than dense Transformers and matches state-of-the-art MoE performance via continuous token mixing.