Skip to main content

🏢 DLUT

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
·5073 words·24 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 DLUT
EVEv2.0: A novel encoder-free vision-language model outperforms existing approaches by using a divide-and-conquer architecture and a data-efficient training strategy, achieving strong vision-reasoning…