Vision-Language Models
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
·3551 words·17 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Shanghai Artificial Intelligence Laboratory, Fudan University
Critic-V enhances VLM reasoning accuracy by incorporating a critic model that provides constructive feedback, significantly outperforming existing methods on several benchmarks.
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
·5469 words·26 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Show Lab, National University of Singapore
ShowUI, a novel vision-language-action model, efficiently manages high-resolution GUI screenshots and diverse task needs via UI-guided token selection and interleaved streaming, achieving state-of-the…
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
·3716 words·18 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Northwestern Polytechnical University
FiCoCo: A unified paradigm accelerates Multimodal Large Language Model (MLLM) inference by up to 82.4% with minimal performance loss, surpassing state-of-the-art training-free methods.
Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models
·2315 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Kim Jaechul Graduate School of AI, KAIST
FreeΒ²Guide: Gradient-free path integral control enhances text-to-video generation using powerful large vision-language models, improving alignment without gradient-based fine-tuning.
UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
·4040 words·19 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ University of Chinese Academy of Sciences
UniPose: A unified multimodal framework for human pose comprehension, generation, and editing, enabling seamless transitions across various modalities and showcasing zero-shot generalization.
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
·3021 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Integrated Vision and Language Lab, KAIST
SALOVA, a novel video-LLM framework, enhances long-form video comprehension through targeted retrieval. It introduces SceneWalk, a high-quality dataset of densely-captioned long videos, and integrates…
Knowledge Transfer Across Modalities with Natural Language Supervision
·7979 words·38 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ University of Turin
Teach AI new visual concepts using only their textual descriptions!
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
·4108 words·20 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Beihang University
VideoEspresso: A new dataset and Hybrid LVLMs framework boost fine-grained video reasoning!
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
·2416 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Nanjing University
This survey paper offers a comprehensive overview of Multimodal Large Language Model (MLLM) evaluation, systematically categorizing benchmarks and methods, and identifying gaps for future research, th…
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
·3534 words·17 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ NTU, Singapore
Large multimodal models’ inner workings are demystified using a novel framework that identifies, interprets, and even steers their internal features, opening the door to safer, more reliable AI.
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
·2697 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Tencent AI Lab
Insight-V: A multi-agent system enhances multi-modal LLMs’ visual reasoning by generating high-quality long-chain reasoning data and employing a two-stage training pipeline, achieving significant perf…
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
·4473 words·21 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ University of Washington
GMAI-VL-5.5M & GMAI-VL: A new multimodal medical dataset and vision-language model achieve state-of-the-art results in various medical tasks.
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
·3767 words·18 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Hong Kong Baptist University
VideoAutoArena automates large multimodal model (LMM) evaluation using simulated users, offering a cost-effective and scalable solution compared to traditional human annotation.
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
·3633 words·18 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Vivo AI Lab
BlueLM-V-3B: Algorithm and system co-design enables efficient, real-time multimodal language model deployment on mobile devices.
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
·2224 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Metabrain AGI Lab
Awaker2.5-VL: A novel Mixture-of-Experts architecture stably scales MLLMs, solving multi-task conflict with parameter efficiency and achieving state-of-the-art performance.
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
·2726 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Peking University
LLaVA-01: A novel visual language model achieves superior reasoning performance through structured, multi-stage processing and efficient inference-time scaling, surpassing even larger, closed-source m…
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
·4459 words·21 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
Boosting multimodal reasoning in LLMs, researchers developed Mixed Preference Optimization (MPO) and a large-scale dataset (MMPR), significantly improving reasoning accuracy and achieving performance …
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
·4045 words·19 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
JanusFlow harmonizes autoregression and rectified flow for unified multimodal understanding and generation, achieving state-of-the-art results on standard benchmarks.
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
·2584 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Carnegie Mellon University
VideoGLaMM: a new large multimodal model achieves precise pixel-level visual grounding in videos by seamlessly integrating a dual vision encoder, a spatio-temporal decoder, and a large language model.
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
·2445 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Microsoft Research
LLM2CLIP boosts CLIP’s performance by cleverly integrating LLMs, enabling it to understand longer, more complex image captions and achieving state-of-the-art results across various benchmarks.