Multimodal Learning
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
·3130 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Microsoft
Phi-4: Compact Multimodal Language Models via Mixture-of-LoRAs
CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments
·1626 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Embodied AI
🏢 Shenzhen Future Network of Intelligence Institute
CLEA: Enhancing task execution in dynamic environments with a closed-loop embodied agent.
Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions
·3420 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Datasets
🏢 Xiaohongshu Inc.
Qilin: A multimodal dataset with APP-level user sessions for advancing search and recommendation systems.
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
·3091 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Kuaishou Technology
HAIC improves MLLMs’ action understanding with high-quality video captions & new benchmark, boosting performance and generation.
UniTok: A Unified Tokenizer for Visual Generation and Understanding
·3043 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Hong Kong
UniTok: A unified tokenizer bridging the visual generation and understanding gap via multi-codebook quantization, achieving SOTA in MLLMs.
R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
·3310 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Johns Hopkins University
R2-T2: Boost multimodal MoE performance by re-routing experts in test-time, no retraining needed!
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration
·4130 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Beijing Jiaotong University
Mobile-Agent-V: Automating mobile tasks using video guidance for efficient, scalable operation, outperforming existing frameworks by 30%.
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
·1916 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 University of California, Santa Cruz
MMIR: A new benchmark to assess and improve multimodal reasoning models’ ability to detect inconsistencies in real-world content.
Evaluating Multimodal Generative AI with Korean Educational Standards
·2108 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 NAVER Cloud AI
KoNET: Evaluating multimodal AI in Korean with edu standards.
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
·4915 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
SigLIP 2: Multilingual Vision-Language Encoders with Semantic Understanding, Localization, and Dense Features.
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
·4251 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Pennsylvania
CoSyn: Code-guided synth data for scaling text-rich image understanding, achieving SOTA via targeted multimodal data generation!
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
·2325 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Embodied AI
🏢 MAIS, Institute of Automation, Chinese Academy of Sciences, China
PC-Agent: A new hierarchical framework that significantly improves complex task automation on PCs by 32%!
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
·3063 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Human-AI Interaction
🏢 National University of Singapore
InterFeedback: LMMs need better human feedback to enhance AI assistants!
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO
·402 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Menlo Research
AlphaMaze enhances LLMs’ spatial intelligence via GRPO, achieving 93% accuracy in maze navigation and showing emergent reasoning.
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm
·5226 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Sydney
RealSyn: A new, scalable multimodal dataset revolutionizes vision-language learning by effectively using interleaved image-text documents.
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
·2594 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
mmMamba: a novel framework creates linear-complexity multimodal models via distillation, drastically improving efficiency without sacrificing performance.
Magma: A Foundation Model for Multimodal AI Agents
·5533 words·26 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Embodied AI
🏢 Microsoft Research
Magma: a new foundation model for multimodal AI agents excels at bridging verbal and spatial intelligence, achieving state-of-the-art performance across various tasks, including UI navigation and robo…
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
·4398 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Tsinghua University
video-SALMONN-01: An open-source audio-visual LLM enhances video understanding with a novel reasoning-intensive dataset and the pDPO method, achieving significant accuracy gains.
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
·1563 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Reallm Labs
InfiR: Efficient, small AI models rival larger ones in reasoning, slashing costs and boosting privacy for wider AI use.
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
·2102 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
HermesFlow seamlessly bridges the understanding-generation gap in MLLMs using a novel Pair-DPO framework and self-play optimization on homologous data, achieving significant performance improvements.