Multimodal Learning
Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input
·2871 words·14 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Polytechnic of Turin
AIUTA minimizes user input in instance navigation by leveraging agent self-dialogue and dynamic interaction, achieving state-of-the-art performance.
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
·4218 words·20 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Chinese University of Hong Kong
Video-3D LLM masters 3D scene understanding by cleverly fusing video data with 3D positional encoding, achieving state-of-the-art performance.
VLSBench: Unveiling Visual Leakage in Multimodal Safety
·5131 words·25 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Shanghai Artificial Intelligence Laboratory
VLSBench exposes visual leakage in MLLM safety benchmarks, creating a new, leak-free benchmark to evaluate true multimodal safety.
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
·3277 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Human-AI Interaction
π’ SenseTime Research
SOLAMI: enabling immersive, natural interactions with 3D characters via a unified social vision-language-action model and a novel synthetic multimodal dataset.
On Domain-Specific Post-Training for Multimodal Large Language Models
·4939 words·24 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ State Key Laboratory of General Artificial Intelligence, BIGAI
AdaMLLM enhances multimodal LLMs for specific domains via a novel visual instruction synthesizer and a single-stage post-training pipeline, achieving superior performance compared to existing methods.
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing
·3199 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Integrated Vision and Language Lab, KAIST, South Korea
Video-MaΒ²mba efficiently handles long videos by using State Space Models, achieving linear scaling in memory and time, and employing a novel Multi-Axis Gradient Checkpointing (MA-GC) for significant m…
VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models
·3026 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ NC Research, NCSOFT
VARCO-VISION: A new open-source 14B parameter Korean-English vision-language model excels at bilingual image-text understanding and generation, expanding AI capabilities for low-resource languages.
MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation
·4014 words·19 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Yonsei University
MaskRIS revolutionizes referring image segmentation by using novel masking and contextual learning to enhance data augmentation, achieving state-of-the-art results.
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
·2978 words·14 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Peking University
VideoLLM’s interaction format is revolutionized by the novel Video-Text Duet, enabling real-time, time-sensitive video comprehension with significantly improved performance.
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
·3551 words·17 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Shanghai Artificial Intelligence Laboratory, Fudan University
Critic-V enhances VLM reasoning accuracy by incorporating a critic model that provides constructive feedback, significantly outperforming existing methods on several benchmarks.
SketchAgent: Language-Driven Sequential Sketch Generation
·5526 words·26 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Human-AI Interaction
π’ MIT
SketchAgent uses a multimodal LLM to generate dynamic, sequential sketches from textual prompts, enabling collaborative drawing and chat-based editing.
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
·5469 words·26 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Show Lab, National University of Singapore
ShowUI, a novel vision-language-action model, efficiently manages high-resolution GUI screenshots and diverse task needs via UI-guided token selection and interleaved streaming, achieving state-of-the…
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
·3716 words·18 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Northwestern Polytechnical University
FiCoCo: A unified paradigm accelerates Multimodal Large Language Model (MLLM) inference by up to 82.4% with minimal performance loss, surpassing state-of-the-art training-free methods.
Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models
·2315 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Kim Jaechul Graduate School of AI, KAIST
FreeΒ²Guide: Gradient-free path integral control enhances text-to-video generation using powerful large vision-language models, improving alignment without gradient-based fine-tuning.
UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing
·4040 words·19 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ University of Chinese Academy of Sciences
UniPose: A unified multimodal framework for human pose comprehension, generation, and editing, enabling seamless transitions across various modalities and showcasing zero-shot generalization.
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
·3021 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Integrated Vision and Language Lab, KAIST
SALOVA, a novel video-LLM framework, enhances long-form video comprehension through targeted retrieval. It introduces SceneWalk, a high-quality dataset of densely-captioned long videos, and integrates…
Knowledge Transfer Across Modalities with Natural Language Supervision
·7979 words·38 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ University of Turin
Teach AI new visual concepts using only their textual descriptions!
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
·4108 words·20 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Beihang University
VideoEspresso: A new dataset and Hybrid LVLMs framework boost fine-grained video reasoning!
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
·2416 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Nanjing University
This survey paper offers a comprehensive overview of Multimodal Large Language Model (MLLM) evaluation, systematically categorizing benchmarks and methods, and identifying gaps for future research, th…
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
·3534 words·17 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ NTU, Singapore
Large multimodal models’ inner workings are demystified using a novel framework that identifies, interprets, and even steers their internal features, opening the door to safer, more reliable AI.