Vision-Language Models

DiffCLIP: Differential Attention Meets CLIP

9 March 2025·2247 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 KAUST

DiffCLIP: Enhancing CLIP models by integrating differential attention, achieving superior performance with minimal overhead.

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

4 March 2025·5020 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 National University of Singapore

VLMs often disproportionately trust text over visual data, leading to performance drops and safety concerns.

Visual-RFT: Visual Reinforcement Fine-Tuning

3 March 2025·3386 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Jiaotong University

Visual-RFT: Enhance LVLMs’ visual reasoning via reinforcement learning with verifiable rewards, achieving strong performance with limited data.

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

3 March 2025·3130 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft

Phi-4: Compact Multimodal Language Models via Mixture-of-LoRAs

HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

28 February 2025·3091 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Kuaishou Technology

HAIC improves MLLMs’ action understanding with high-quality video captions & new benchmark, boosting performance and generation.

UniTok: A Unified Tokenizer for Visual Generation and Understanding

27 February 2025·3043 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Hong Kong

UniTok: A unified tokenizer bridging the visual generation and understanding gap via multi-codebook quantization, achieving SOTA in MLLMs.

Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

24 February 2025·4130 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Beijing Jiaotong University

Mobile-Agent-V: Automating mobile tasks using video guidance for efficient, scalable operation, outperforming existing frameworks by 30%.

Evaluating Multimodal Generative AI with Korean Educational Standards

21 February 2025·2108 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NAVER Cloud AI

KoNET: Evaluating multimodal AI in Korean with edu standards.

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

20 February 2025·4915 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Google DeepMind

SigLIP 2: Multilingual Vision-Language Encoders with Semantic Understanding, Localization, and Dense Features.

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

20 February 2025·4251 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Pennsylvania

CoSyn: Code-guided synth data for scaling text-rich image understanding, achieving SOTA via targeted multimodal data generation!

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

20 February 2025·402 words·2 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Menlo Research

AlphaMaze enhances LLMs’ spatial intelligence via GRPO, achieving 93% accuracy in maze navigation and showing emergent reasoning.

RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

18 February 2025·5226 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Sydney

RealSyn: A new, scalable multimodal dataset revolutionizes vision-language learning by effectively using interleaved image-text documents.

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

18 February 2025·2594 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

mmMamba: a novel framework creates linear-complexity multimodal models via distillation, drastically improving efficiency without sacrificing performance.

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

17 February 2025·2102 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

HermesFlow seamlessly bridges the understanding-generation gap in MLLMs using a novel Pair-DPO framework and self-play optimization on homologous data, achieving significant performance improvements.

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

14 February 2025·4310 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

HealthGPT: A novel medical vision-language model unifying comprehension and generation via heterogeneous knowledge adaptation, achieving superior performance on various medical tasks.

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

13 February 2025·2430 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Cambridge

ZeroBench: a new visual reasoning benchmark, proves impossible for current large multimodal models, pushing the boundaries of AI visual understanding.

Exploring the Potential of Encoder-free Architectures in 3D LMMs

13 February 2025·3414 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Northwestern Polytechnical University

Encoder-free 3D LMMs outperform state-of-the-art, achieving comparable results to significantly larger models.

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

10 February 2025·5073 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 DLUT

EVEv2.0: A novel encoder-free vision-language model outperforms existing approaches by using a divide-and-conquer architecture and a data-efficient training strategy, achieving strong vision-reasoning…

Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

8 February 2025·3420 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Jiao Tong University

Show-o Turbo dramatically speeds up multimodal understanding and generation by leveraging parallel decoding and consistency distillation, achieving significant performance gains with fewer sampling st…

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

7 February 2025·5172 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NVIDIA Research

QLIP: A new visual tokenizer unifying autoregressive multimodal understanding & generation with state-of-the-art reconstruction and zero-shot performance!