🏢 Key Laboratory of Intelligent Information Processing
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
·5398 words·26 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Key Laboratory of Intelligent Information Processing
LLaVA-Mini achieves comparable performance to state-of-the-art LMMs using only one vision token, drastically reducing computational cost and latency.