Multimodal Learning

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

3 March 2025·3130 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft

Phi-4: Compact Multimodal Language Models via Mixture-of-LoRAs

CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments

2 March 2025·1626 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Shenzhen Future Network of Intelligence Institute

CLEA: Enhancing task execution in dynamic environments with a closed-loop embodied agent.

Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions

1 March 2025·3420 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Datasets 🏢 Xiaohongshu Inc.

Qilin: A multimodal dataset with APP-level user sessions for advancing search and recommendation systems.

HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

28 February 2025·3091 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Kuaishou Technology

HAIC improves MLLMs’ action understanding with high-quality video captions & new benchmark, boosting performance and generation.

UniTok: A Unified Tokenizer for Visual Generation and Understanding

27 February 2025·3043 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Hong Kong

UniTok: A unified tokenizer bridging the visual generation and understanding gap via multi-codebook quantization, achieving SOTA in MLLMs.

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

27 February 2025·3310 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Johns Hopkins University

R2-T2: Boost multimodal MoE performance by re-routing experts in test-time, no retraining needed!

Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

24 February 2025·4130 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Beijing Jiaotong University

Mobile-Agent-V: Automating mobile tasks using video guidance for efficient, scalable operation, outperforming existing frameworks by 30%.

Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

22 February 2025·1916 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 University of California, Santa Cruz

MMIR: A new benchmark to assess and improve multimodal reasoning models’ ability to detect inconsistencies in real-world content.

Evaluating Multimodal Generative AI with Korean Educational Standards

21 February 2025·2108 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NAVER Cloud AI

KoNET: Evaluating multimodal AI in Korean with edu standards.

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

20 February 2025·4915 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Google DeepMind

SigLIP 2: Multilingual Vision-Language Encoders with Semantic Understanding, Localization, and Dense Features.

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

20 February 2025·4251 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Pennsylvania

CoSyn: Code-guided synth data for scaling text-rich image understanding, achieving SOTA via targeted multimodal data generation!

PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

20 February 2025·2325 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 MAIS, Institute of Automation, Chinese Academy of Sciences, China

PC-Agent: A new hierarchical framework that significantly improves complex task automation on PCs by 32%!

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

20 February 2025·3063 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 National University of Singapore

InterFeedback: LMMs need better human feedback to enhance AI assistants!

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

20 February 2025·402 words·2 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Menlo Research

AlphaMaze enhances LLMs’ spatial intelligence via GRPO, achieving 93% accuracy in maze navigation and showing emergent reasoning.

RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

18 February 2025·5226 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Sydney

RealSyn: A new, scalable multimodal dataset revolutionizes vision-language learning by effectively using interleaved image-text documents.

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

18 February 2025·2594 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

mmMamba: a novel framework creates linear-complexity multimodal models via distillation, drastically improving efficiency without sacrificing performance.

Magma: A Foundation Model for Multimodal AI Agents

18 February 2025·5533 words·26 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Microsoft Research

Magma: a new foundation model for multimodal AI agents excels at bridging verbal and spatial intelligence, achieving state-of-the-art performance across various tasks, including UI navigation and robo…

video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

17 February 2025·4398 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Tsinghua University

video-SALMONN-01: An open-source audio-visual LLM enhances video understanding with a novel reasoning-intensive dataset and the pDPO method, achieving significant accuracy gains.

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

17 February 2025·1563 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Reallm Labs

InfiR: Efficient, small AI models rival larger ones in reasoning, slashing costs and boosting privacy for wider AI use.

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

17 February 2025·2102 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

HermesFlow seamlessly bridges the understanding-generation gap in MLLMs using a novel Pair-DPO framework and self-play optimization on homologous data, achieving significant performance improvements.