↓Skip to main content

🏢 Key Laboratory of Intelligent Information Processing

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

7 January 2025·5398 words·26 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Key Laboratory of Intelligent Information Processing

LLaVA-Mini achieves comparable performance to state-of-the-art LMMs using only one vision token, drastically reducing computational cost and latency.