Skip to main content

🏢 Key Laboratory of Intelligent Information Processing

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
·5398 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Key Laboratory of Intelligent Information Processing
LLaVA-Mini achieves comparable performance to state-of-the-art LMMs using only one vision token, drastically reducing computational cost and latency.