Vision-Language Models

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

6 February 2025·2102 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Ola: a novel 7B parameter omni-modal language model achieves state-of-the-art performance across image, video and audio tasks using a progressive modality alignment training strategy.

The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

5 February 2025·4880 words·23 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Rutgers University

VISTA steers LVLMs away from hallucinations by cleverly adjusting token rankings during inference, improving visual grounding and semantic coherence.

Baichuan-Omni-1.5 Technical Report

26 January 2025·3756 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Baichuan Inc.

Baichuan-Omni-1.5: An open-source omni-modal LLM achieving SOTA performance across multiple modalities.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

22 January 2025·4124 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 DAMO Academy, Alibaba Group

VideoLLaMA3: Vision-centric training yields state-of-the-art image & video understanding!

FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

22 January 2025·4361 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

FILMAGENT: A multi-agent framework automates end-to-end virtual film production using LLMs, exceeding single-agent performance in a collaborative workflow.

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

21 January 2025·2690 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory

InternLM-XComposer2.5-Reward: A novel multi-modal reward model boosting Large Vision Language Model performance.

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

17 January 2025·3786 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Google DeepMind

New multimodal safety test suite (MSTS) reveals vision-language models’ vulnerabilities and underscores the unique challenges of multimodal inputs.

Multimodal LLMs Can Reason about Aesthetics in Zero-Shot

15 January 2025·3561 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University

Multimodal LLMs can now evaluate art aesthetics with human-level accuracy using a novel dataset (MM-StyleBench) and prompt method (ArtCoT), significantly improving AI alignment in artistic evaluation.

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

14 January 2025·4505 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Parameter-Inverted Image Pyramid Networks (PIIP) drastically cut visual model computing costs without sacrificing accuracy by using smaller models for higher-resolution images and larger models for lo…

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

9 January 2025·22812 words·108 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Würzburg

Centurio: a 100-language LVLMs achieves state-of-the-art multilingual performance by strategically incorporating non-English data in training, proving that multilingualism doesn’t hinder English profi…

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

8 January 2025·2599 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Zhejiang University

InfiGUIAgent, a novel multimodal GUI agent, leverages a two-stage training pipeline to achieve advanced reasoning and GUI interaction capabilities, outperforming existing models in benchmarks.

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

7 January 2025·4541 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

Sa2VA marries SAM2 and LLaVA for dense grounded image and video understanding, achieving state-of-the-art results on multiple benchmarks.

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

7 January 2025·5398 words·26 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Key Laboratory of Intelligent Information Processing

LLaVA-Mini achieves comparable performance to state-of-the-art LMMs using only one vision token, drastically reducing computational cost and latency.

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

6 January 2025·2565 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong

Dispider: A novel system enabling real-time interaction with video LLMs via disentangled perception, decision, and reaction modules for efficient, accurate responses to streaming video.

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

3 January 2025·2577 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent Youtu Lab

VITA-1.5 achieves near real-time vision and speech interaction by using a novel three-stage training method that progressively integrates speech data into an LLM, enabling fluent conversations.

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

1 January 2025·4036 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 College of Computer Science and Technology, Zhejiang University

New multimodal textbook dataset boosts Vision-Language Model (VLM) performance!

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

31 December 2024·3571 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 DAMO Academy, Alibaba Group

VideoRefer Suite boosts video LLM understanding by introducing a large-scale, high-quality object-level video instruction dataset, a versatile spatial-temporal object encoder model, and a comprehensiv…

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

28 December 2024·5637 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong, Shenzhen

Multimodal LLMs for medical imaging now generalize better via compositional generalization, leveraging relationships between image features (modality, anatomy, task) to understand unseen images and im…

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

27 December 2024·3641 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Oxford

OS-Genesis: Reverse task synthesis revolutionizes GUI agent training by generating high-quality trajectory data without human supervision, drastically boosting performance on challenging benchmarks.

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

27 December 2024·3329 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Xi'an Jiaotong University

LaDeCo: a layered approach to automatic graphic design composition, generating high-quality designs by sequentially composing elements into semantic layers.