Multimodal Reasoning
Video-R1: Reinforcing Video Reasoning in MLLMs
·1632 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 CUHK MMLab
Video-R1: First to explore rule-based RL for video reasoning in MLLMs, enhancing performance on key benchmarks.
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
·2895 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Shanghai AI Laboratory
MLLMs still struggle with spatial reasoning! LEGO-Puzzles benchmark reveals critical deficiencies, paving the way for AI advancement.
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
·3214 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 University of California, Los Angeles
OpenVLThinker: Iteratively refining vision-language models for complex reasoning, bridging the gap to R1-style capabilities.
MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving
·3857 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Xi'an Jiaotong University
MAPS solves multimodal scientific problems better by combining multiple agents and Socratic learning.
OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning
·2043 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 OPPO Research Institute
OThink-MR1 enhances MLLM reasoning via dynamic reinforcement learning, achieving remarkable cross-task generalization!
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning
·2607 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Nanjing University
TVC mitigates visual forgetting in multimodal LLMs, enhancing reasoning by strategically re-introducing and compressing visual information.
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
·4473 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Stanford University
MicroVQA: A new benchmark to test visual-question-answering in microscopy-based research.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
·3237 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 NUS
A comprehensive survey of multimodal chain-of-thought (MCoT) reasoning, bridging the gap in existing literature and fostering innovation towards multimodal AGI.
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification
·2497 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 HIT
MPBench: Multimodal benchmark to identify errors in reasoning processes.
ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
·2549 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 WHU
ProJudge: MLLM judges’ benchmark for sci-reasoning & instruction-tuning data to boost performance!
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning
·1187 words·6 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Alibaba Group
R1-Omni: RLVR enhances multimodal emotion recognition, boosting reasoning and generalization.
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
·3310 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Johns Hopkins University
R2-T2: Boost multimodal MoE performance by re-routing experts in test-time, no retraining needed!
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
·1916 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 University of California, Santa Cruz
MMIR: A new benchmark to assess and improve multimodal reasoning models’ ability to detect inconsistencies in real-world content.
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
·4398 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Tsinghua University
video-SALMONN-01: An open-source audio-visual LLM enhances video understanding with a novel reasoning-intensive dataset and the pDPO method, achieving significant accuracy gains.
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
·1563 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Reallm Labs
InfiR: Efficient, small AI models rival larger ones in reasoning, slashing costs and boosting privacy for wider AI use.
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
·3464 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Hong Kong University of Science and Technology
ThinkDiff empowers text-to-image diffusion models with multimodal reasoning by aligning vision-language models to an LLM decoder, achieving state-of-the-art results on in-context reasoning benchmarks.
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
·3250 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Singapore University of Technology and Design
GPT models’ multimodal reasoning abilities are tracked over time on challenging visual puzzles, revealing surprisingly steady improvement and cost trade-offs.
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
·2983 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Gaoling School of Artificial Intelligence, Renmin University of China
Virgo: A new multimodal slow-thinking system, significantly improves MLLM reasoning by fine-tuning with text-based long-form thought data, demonstrating comparable performance to commercial systems.
Diving into Self-Evolving Training for Multimodal Reasoning
·3292 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Hong Kong University of Science and Technology
M-STAR: a novel self-evolving training framework significantly boosts multimodal reasoning in large models without human annotation, achieving state-of-the-art results.
Progressive Multimodal Reasoning via Active Retrieval
·3576 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Gaoling School of Artificial Intelligence, Renmin University of China
AR-MCTS: a novel framework boosting multimodal large language model reasoning by actively retrieving key supporting evidence and using Monte Carlo Tree Search for improved path selection and verificat…