Multimodal Learning

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

11 March 2025·2632 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Zhejiang University, China

SegAgent: Improves MLLMs’ pixel understanding by mimicking human annotation, enabling mask refinement without altering output space.

Referring to Any Person

11 March 2025·3096 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 International Digital Economy Academy (IDEA)

Introducing HumanRef, a new dataset & RexSeek, a multimodal LLM, to improve human-centric referring tasks by addressing limitations of existing methods.

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

11 March 2025·2951 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 Huazhong University of Science & Technology

OmniMamba: Efficient multimodal understanding and generation via SSMs, trained on 2M image-text pairs.

GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

11 March 2025·2477 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

GTR: Prevents thought collapse in RL-based VLM agents by process guidance, enhancing performance in complex visual reasoning tasks.

$^R$FLAV: Rolling Flow matching for infinite Audio Video generation

11 March 2025·2128 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Audio-Visual Learning 🏢 University of Parma

RFLAV: A novel rolling flow matching model for infinite audio-video generation with high quality, synchronization, and temporal coherence.

Video Action Differencing

10 March 2025·3793 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Stanford

VidDiff: Identify subtle action differences in videos for coaching and skill learning.

Should VLMs be Pre-trained with Image Data?

10 March 2025·3469 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Toyota Research Institute

Image data during pre-training can boost Vision-Language Model (VLM) performance, especially when introduced later in the process.

Motion Anything: Any to Motion Generation

10 March 2025·7987 words·38 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 ANU

Motion Anything: control human motion generation with multimodal conditions like text and music.

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

9 March 2025·2597 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Zhongguancun Laboratory

VisualSimpleQA: A new benchmark for fine-grained evaluation of visual and linguistic modules in fact-seeking LVLMs.

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

9 March 2025·2549 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 WHU

ProJudge: MLLM judges’ benchmark for sci-reasoning & instruction-tuning data to boost performance!

DiffCLIP: Differential Attention Meets CLIP

9 March 2025·2247 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 KAUST

DiffCLIP: Enhancing CLIP models by integrating differential attention, achieving superior performance with minimal overhead.

ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

9 March 2025·2871 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 Nankai University

ARMOR: Empowers MLLMs with interleaved multimodal generation via asymmetric synergy, using limited resources.

Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

9 March 2025·2695 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Audio-Visual Learning 🏢 Imperial College London

Llama-MTSK: AVSR via Matryoshka LLMs, adapting to computational limits without sacrificing accuracy!

Unified Reward Model for Multimodal Understanding and Generation

7 March 2025·368 words·2 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 Fudan University

UNIFIEDREWARD: A unified reward model that enhances multimodal understanding and generation!

R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning

7 March 2025·1187 words·6 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Alibaba Group

R1-Omni: RLVR enhances multimodal emotion recognition, boosting reasoning and generalization.

SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing

6 March 2025·2729 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Shanghai Artificial Intelligence Laboratory

SURVEYFORGE automates survey generation, improving quality and evaluation.

EgoLife: Towards Egocentric Life Assistant

5 March 2025·3562 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 NTU S-Lab

EgoLife: Ultra-long egocentric dataset & benchmark enabling AI assistants to understand and enhance daily life. Datasets and models released!

Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

4 March 2025·5020 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 National University of Singapore

VLMs often disproportionately trust text over visual data, leading to performance drops and safety concerns.

Visual-RFT: Visual Reinforcement Fine-Tuning

3 March 2025·3386 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Jiaotong University

Visual-RFT: Enhance LVLMs’ visual reasoning via reinforcement learning with verifiable rewards, achieving strong performance with limited data.

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

3 March 2025·1959 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Technology Sydney

VideoUFO: A new user-focused, million-scale dataset that improves text-to-video generation by aligning training data with real user interests and preferences!