Skip to main content

Multimodal Learning

RIG: Synergizing Reasoning and Imagination in End-to-End Generalist Policy
·3587 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Zhejiang University
RIG: Synergizes reasoning and imagination in an end-to-end generalist policy for embodied agents, improving sample efficiency and generalization.
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
·1493 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 MAUM AI Inc.
KOFFVQA: Objectively evaluates Korean VLMs with a new free-form VQA benchmark, improving evaluation reliability via detailed grading criteria.
MoCha: Towards Movie-Grade Talking Character Synthesis
·2382 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Waterloo
MoCha: Movie-Grade Talking Character Synthesis!
Video-R1: Reinforcing Video Reasoning in MLLMs
·1632 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 CUHK MMLab
Video-R1: First to explore rule-based RL for video reasoning in MLLMs, enhancing performance on key benchmarks.
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
·2964 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Vivo AI Lab
UI-R1 enhances GUI agents’ action prediction using reinforcement learning.
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
·3498 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Zhejiang University
Embodied-Reasoner: Integrates visual search, reasoning, and action for interactive tasks, outperforming existing models in embodied environments.
ViLBench: A Suite for Vision-Language Process Reward Modeling
·373 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 UC Santa Cruz
VILBENCH: Vision-Language Process Reward Modeling Suite
Unified Multimodal Discrete Diffusion
·3324 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Carnegie Mellon University
UniDisc: a unified multimodal discrete diffusion model for joint text and image generation, surpassing autoregressive models in quality & efficiency!
Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics
·5958 words·28 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Audio-Visual Learning 🏢 Grad. School of AI, POSTECH
New metrics and representation enhance 3D talking head realism by focusing on perceptual lip synchronization.
Scaling Vision Pre-Training to 4K Resolution
·6421 words·31 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 UC Berkeley
PS3 scales CLIP vision pre-training to 4K resolution with near-constant cost, achieving state-of-the-art performance in multi-modal LLMs.
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
·2895 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Shanghai AI Laboratory
MLLMs still struggle with spatial reasoning! LEGO-Puzzles benchmark reveals critical deficiencies, paving the way for AI advancement.
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
·3191 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Chinese Academy of Sciences
HAVEN: A new benchmark to tackle the hallucination issue in video understanding of large multimodal models!
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
·4635 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 MBZUAI
Video SimpleQA: A New Benchmark for Factuality Evaluation in Large Video Language Models.
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
·1703 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Northwestern University
MetaSpatial: RL for 3D Spatial Reasoning in VLMs
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
·3612 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
COMP: Continually pre-training Vision Foundation Models for better vision and language alignment and arbitrary size inputs.
AlphaSpace: Enabling Robotic Actions through Semantic Tokenization and Symbolic Reasoning
·327 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Menlo Research
AlphaSpace enables robotic actions via semantic tokenization and symbolic reasoning, enhancing spatial intelligence in LLMs.
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning
·3123 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
Vision-R1: Improves LVLMs via vision-guided reinforcement learning, eliminating the need for human feedback and specialized reward models.
When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making
·1218 words·6 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University
VLMs self-improve with text-only training, outperforming vision for human-centered decisions, opening efficient enhancement avenues.
PVChat: Personalized Video Chat with One-Shot Learning
·4971 words·24 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Nanyang Technological University
PVChat: Personalize video understanding with one-shot learning, enabling identity-aware video comprehension.
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement
·3214 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 University of California, Los Angeles
OpenVLThinker: Iteratively refining vision-language models for complex reasoning, bridging the gap to R1-style capabilities.