Skip to main content

Multimodal Learning

MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving
·3857 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Xi'an Jiaotong University
MAPS solves multimodal scientific problems better by combining multiple agents and Socratic learning.
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
·3338 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Renmin University of China
ETVA evaluates text-to-video alignment via fine-grained question generation and answering.
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
·3300 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Westlake University
VidKV: Achieves 1.5x-bit KV cache quantization for VideoLLMs, maintaining performance without retraining.
OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning
·2043 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 OPPO Research Institute
OThink-MR1 enhances MLLM reasoning via dynamic reinforcement learning, achieving remarkable cross-task generalization!
M3: 3D-Spatial MultiModal Memory
·2710 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 UC San Diego
M3: Gaussian-integrated memory system for multimodal 3D scene understanding with foundation models.
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
·2805 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University
ActVLP: Enhancing VLMs through visual-linguistic guidance for superior action-based decision-making in interactive environments.
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
·3142 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University
UPME: Peer review for MLLMs, minus human bias!
TULIP: Towards Unified Language-Image Pretraining
·3271 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 UC Berkeley
TULIP enhances image-text pretraining by unifying generative data augmentation with contrastive learning, achieving state-of-the-art performance in visual understanding.
See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias
·3559 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Chung-Ang University
BALGRAD mitigates dominant modality bias in vision-language models by reweighting gradients and aligning task directions for balanced learning and improved performance.
MusicInfuser: Making Video Diffusion Listen and Dance
·4650 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Washington
Sync your moves! MusicInfuser adapts video diffusion to make models listen and dance to music, preserving style and aligning movement.
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding
·441 words·3 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 UNC-Chapel Hill
MDocAgent: Multi-agent Doc understanding by integrating text and image for better accuracy.
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
·5843 words·28 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Zhejiang University
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
·4040 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 NVIDIA
Cosmos-Reason1: Physical AI models that reason and act in the real world, bridging the gap between perception and embodied decision-making.
ViSpeak: Visual Instruction Feedback in Streaming Videos
·4700 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 School of Computer Science and Engineering, Sun Yat-Sen University, China
ViSpeak: Enables visual instruction feedback in streaming videos, enhancing human-AI interaction.
Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions
·5687 words·27 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 KAIST AI
SIGHTATION: A BLV-aligned dataset utilizing sighted user feedback to enhance diagram descriptions generated by VLMs, improving accessibility for visually impaired learners.
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning
·2607 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Nanjing University
TVC mitigates visual forgetting in multimodal LLMs, enhancing reasoning by strategically re-introducing and compressing visual information.
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
·4473 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Stanford University
MicroVQA: A new benchmark to test visual-question-answering in microscopy-based research.
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
·5931 words·28 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Fudan University
ADR balances visual-language models by adaptively calibrating long-tail data, boosting LLaVA 1.5 by 4.36% without increasing training data volume.
Free-form language-based robotic reasoning and grasping
·1651 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Fondazione Bruno Kessler
FreeGrasp: enabling robots to grasp by interpreting instructions and reasoning about object spatial relationships.
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
·2841 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
DeepPerception enhances MLLMs with cognitive visual perception, achieving superior grounding through knowledge integration & reasoning.