Vision-Language Models
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
·2964 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Vivo AI Lab
UI-R1 enhances GUI agents’ action prediction using reinforcement learning.
ViLBench: A Suite for Vision-Language Process Reward Modeling
·373 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 UC Santa Cruz
VILBENCH: Vision-Language Process Reward Modeling Suite
Scaling Vision Pre-Training to 4K Resolution
·6421 words·31 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 UC Berkeley
PS3 scales CLIP vision pre-training to 4K resolution with near-constant cost, achieving state-of-the-art performance in multi-modal LLMs.
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
·3191 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Chinese Academy of Sciences
HAVEN: A new benchmark to tackle the hallucination issue in video understanding of large multimodal models!
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
·4635 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 MBZUAI
Video SimpleQA: A New Benchmark for Factuality Evaluation in Large Video Language Models.
MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse
·1703 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Northwestern University
MetaSpatial: RL for 3D Spatial Reasoning in VLMs
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
·3612 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
COMP: Continually pre-training Vision Foundation Models for better vision and language alignment and arbitrary size inputs.
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning
·3123 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences
Vision-R1: Improves LVLMs via vision-guided reinforcement learning, eliminating the need for human feedback and specialized reward models.
When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making
·1218 words·6 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong Polytechnic University
VLMs self-improve with text-only training, outperforming vision for human-centered decisions, opening efficient enhancement avenues.
PVChat: Personalized Video Chat with One-Shot Learning
·4971 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Nanyang Technological University
PVChat: Personalize video understanding with one-shot learning, enabling identity-aware video comprehension.
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
·3338 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Renmin University of China
ETVA evaluates text-to-video alignment via fine-grained question generation and answering.
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
·3300 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Westlake University
VidKV: Achieves 1.5x-bit KV cache quantization for VideoLLMs, maintaining performance without retraining.
M3: 3D-Spatial MultiModal Memory
·2710 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 UC San Diego
M3: Gaussian-integrated memory system for multimodal 3D scene understanding with foundation models.
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
·2805 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
ActVLP: Enhancing VLMs through visual-linguistic guidance for superior action-based decision-making in interactive environments.
TULIP: Towards Unified Language-Image Pretraining
·3271 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 UC Berkeley
TULIP enhances image-text pretraining by unifying generative data augmentation with contrastive learning, achieving state-of-the-art performance in visual understanding.
See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias
·3559 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Chung-Ang University
BALGRAD mitigates dominant modality bias in vision-language models by reweighting gradients and aligning task directions for balanced learning and improved performance.
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
·5843 words·28 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Zhejiang University
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions
·5687 words·27 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 KAIST AI
SIGHTATION: A BLV-aligned dataset utilizing sighted user feedback to enhance diagram descriptions generated by VLMs, improving accessibility for visually impaired learners.
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
·5931 words·28 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Fudan University
ADR balances visual-language models by adaptively calibrating long-tail data, boosting LLaVA 1.5 by 4.36% without increasing training data volume.
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
·2841 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
DeepPerception enhances MLLMs with cognitive visual perception, achieving superior grounding through knowledge integration & reasoning.