↓Skip to main content

🏢 University of Warsaw

Mixture of Tokens: Continuous MoE through Cross-Example Aggregation

26 September 2024·1989 words·10 mins· loading · loading

Natural Language Processing Large Language Models 🏢 University of Warsaw

Mixture of Tokens (MoT) achieves 3x faster LLM training than dense Transformers and matches state-of-the-art MoE performance via continuous token mixing.