Multimodal Learning

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

14 February 2025·4310 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

HealthGPT: A novel medical vision-language model unifying comprehension and generation via heterogeneous knowledge adaptation, achieving superior performance on various medical tasks.

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

13 February 2025·2430 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Cambridge

ZeroBench: a new visual reasoning benchmark, proves impossible for current large multimodal models, pushing the boundaries of AI visual understanding.

Exploring the Potential of Encoder-free Architectures in 3D LMMs

13 February 2025·3414 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Northwestern Polytechnical University

Encoder-free 3D LMMs outperform state-of-the-art, achieving comparable results to significantly larger models.

I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

12 February 2025·3464 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Hong Kong University of Science and Technology

ThinkDiff empowers text-to-image diffusion models with multimodal reasoning by aligning vision-language models to an LLM decoder, achieving state-of-the-art results on in-context reasoning benchmarks.

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

10 February 2025·5073 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 DLUT

EVEv2.0: A novel encoder-free vision-language model outperforms existing approaches by using a divide-and-conquer architecture and a data-efficient training strategy, achieving strong vision-reasoning…

Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

8 February 2025·3420 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Jiao Tong University

Show-o Turbo dramatically speeds up multimodal understanding and generation by leveraging parallel decoding and consistency distillation, achieving significant performance gains with fewer sampling st…

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

7 February 2025·5172 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NVIDIA Research

QLIP: A new visual tokenizer unifying autoregressive multimodal understanding & generation with state-of-the-art reconstruction and zero-shot performance!

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

6 February 2025·2102 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Ola: a novel 7B parameter omni-modal language model achieves state-of-the-art performance across image, video and audio tasks using a progressive modality alignment training strategy.

The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering

5 February 2025·4880 words·23 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Rutgers University

VISTA steers LVLMs away from hallucinations by cleverly adjusting token rankings during inference, improving visual grounding and semantic coherence.

The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles

3 February 2025·3250 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Singapore University of Technology and Design

GPT models’ multimodal reasoning abilities are tracked over time on challenging visual puzzles, revealing surprisingly steady improvement and cost trade-offs.

Baichuan-Omni-1.5 Technical Report

26 January 2025·3756 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Baichuan Inc.

Baichuan-Omni-1.5: An open-source omni-modal LLM achieving SOTA performance across multiple modalities.

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

22 January 2025·4124 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 DAMO Academy, Alibaba Group

VideoLLaMA3: Vision-centric training yields state-of-the-art image & video understanding!

FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces

22 January 2025·4361 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

FILMAGENT: A multi-agent framework automates end-to-end virtual film production using LLMs, exceeding single-agent performance in a collaborative workflow.

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

21 January 2025·4964 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 ByteDance Seed, Tsinghua University

UI-TARS, a novel native GUI agent, achieves state-of-the-art performance by solely using screenshots as input, eliminating the need for complex agent frameworks and expert-designed workflows.

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

21 January 2025·2690 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory

InternLM-XComposer2.5-Reward: A novel multi-modal reward model boosting Large Vision Language Model performance.

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

17 January 2025·3786 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Google DeepMind

New multimodal safety test suite (MSTS) reveals vision-language models’ vulnerabilities and underscores the unique challenges of multimodal inputs.

Multimodal LLMs Can Reason about Aesthetics in Zero-Shot

15 January 2025·3561 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University

Multimodal LLMs can now evaluate art aesthetics with human-level accuracy using a novel dataset (MM-StyleBench) and prompt method (ArtCoT), significantly improving AI alignment in artistic evaluation.

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

15 January 2025·1663 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Cross-Modal Retrieval 🏢 Noah's Ark Lab, Huawei

MMDocIR, a new benchmark dataset, enables better evaluation of multi-modal document retrieval systems by providing page-level and layout-level annotations for diverse long documents and questions.

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

14 January 2025·4505 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Parameter-Inverted Image Pyramid Networks (PIIP) drastically cut visual model computing costs without sacrificing accuracy by using smaller models for higher-resolution images and larger models for lo…

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

9 January 2025·22812 words·108 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Würzburg

Centurio: a 100-language LVLMs achieves state-of-the-art multilingual performance by strategically incorporating non-English data in training, proving that multilingualism doesn’t hinder English profi…