Skip to main content

Multimodal Learning

Video-R1: Reinforcing Video Reasoning in MLLMs
·1632 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 CUHK MMLab
Video-R1: First to explore rule-based RL for video reasoning in MLLMs, enhancing performance on key benchmarks.
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
·2964 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Vivo AI Lab
UI-R1 enhances GUI agents’ action prediction using reinforcement learning.
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
·3498 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Zhejiang University
Embodied-Reasoner: Integrates visual search, reasoning, and action for interactive tasks, outperforming existing models in embodied environments.
ViLBench: A Suite for Vision-Language Process Reward Modeling
·373 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 UC Santa Cruz
VILBENCH: Vision-Language Process Reward Modeling Suite
Unified Multimodal Discrete Diffusion
·3324 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Carnegie Mellon University
UniDisc: a unified multimodal discrete diffusion model for joint text and image generation, surpassing autoregressive models in quality & efficiency!
Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
·5958 words·28 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Audio-Visual Learning 🏢 Grad. School of AI, POSTECH
New metrics and representation enhance 3D talking head realism by focusing on perceptual lip synchronization.
Scaling Vision Pre-Training to 4K Resolution
·6421 words·31 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 UC Berkeley
PS3 scales CLIP vision pre-training to 4K resolution with near-constant cost, achieving state-of-the-art performance in multi-modal LLMs.
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
·2895 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Shanghai AI Laboratory
MLLMs still struggle with spatial reasoning! LEGO-Puzzles benchmark reveals critical deficiencies, paving the way for AI advancement.
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
·3191 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Chinese Academy of Sciences
HAVEN: A new benchmark to tackle the hallucination issue in video understanding of large multimodal models!
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
·4635 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 MBZUAI
Video SimpleQA: A New Benchmark for Factuality Evaluation in Large Video Language Models.
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
·1703 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Northwestern University
MetaSpatial: RL for 3D Spatial Reasoning in VLMs
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
·3612 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
COMP: Continually pre-training Vision Foundation Models for better vision and language alignment and arbitrary size inputs.
AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning
·327 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Menlo Research
AlphaSpace enables robotic actions via semantic tokenization and symbolic reasoning, enhancing spatial intelligence in LLMs.
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning
·3123 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
Vision-R1: Improves LVLMs via vision-guided reinforcement learning, eliminating the need for human feedback and specialized reward models.
When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making
·1218 words·6 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University
VLMs self-improve with text-only training, outperforming vision for human-centered decisions, opening efficient enhancement avenues.
PVChat: Personalized Video Chat with One-Shot Learning
·4971 words·24 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Nanyang Technological University
PVChat: Personalize video understanding with one-shot learning, enabling identity-aware video comprehension.
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
·3214 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 University of California, Los Angeles
OpenVLThinker: Iteratively refining vision-language models for complex reasoning, bridging the gap to R1-style capabilities.
MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving
·3857 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Xi'an Jiaotong University
MAPS solves multimodal scientific problems better by combining multiple agents and Socratic learning.
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
·3338 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Renmin University of China
ETVA evaluates text-to-video alignment via fine-grained question generation and answering.
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
·3300 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Westlake University
VidKV: Achieves 1.5x-bit KV cache quantization for VideoLLMs, maintaining performance without retraining.
Buy Me A Coffee