Multimodal Learning

MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

28 November 2024·4014 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Yonsei University

MaskRIS revolutionizes referring image segmentation by using novel masking and contextual learning to enhance data augmentation, achieving state-of-the-art results.

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

27 November 2024·2978 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

VideoLLM’s interaction format is revolutionized by the novel Video-Text Duet, enabling real-time, time-sensitive video comprehension with significantly improved performance.

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

27 November 2024·3551 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory, Fudan University

Critic-V enhances VLM reasoning accuracy by incorporating a critic model that provides constructive feedback, significantly outperforming existing methods on several benchmarks.

SketchAgent: Language-Driven Sequential Sketch Generation

26 November 2024·5526 words·26 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 MIT

SketchAgent uses a multimodal LLM to generate dynamic, sequential sketches from textual prompts, enabling collaborative drawing and chat-based editing.

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

26 November 2024·5469 words·26 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Show Lab, National University of Singapore

ShowUI, a novel vision-language-action model, efficiently manages high-resolution GUI screenshots and diverse task needs via UI-guided token selection and interleaved streaming, achieving state-of-the…

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

26 November 2024·3716 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Northwestern Polytechnical University

FiCoCo: A unified paradigm accelerates Multimodal Large Language Model (MLLM) inference by up to 82.4% with minimal performance loss, surpassing state-of-the-art training-free methods.

Free$^2$Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models

26 November 2024·2315 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Kim Jaechul Graduate School of AI, KAIST

Free²Guide: Gradient-free path integral control enhances text-to-video generation using powerful large vision-language models, improving alignment without gradient-based fine-tuning.

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

25 November 2024·4040 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Chinese Academy of Sciences

UniPose: A unified multimodal framework for human pose comprehension, generation, and editing, enabling seamless transitions across various modalities and showcasing zero-shot generalization.

SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis

25 November 2024·3021 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Integrated Vision and Language Lab, KAIST

SALOVA, a novel video-LLM framework, enhances long-form video comprehension through targeted retrieval. It introduces SceneWalk, a high-quality dataset of densely-captioned long videos, and integrates…

Knowledge Transfer Across Modalities with Natural Language Supervision

23 November 2024·7979 words·38 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Turin

Teach AI new visual concepts using only their textual descriptions!

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

22 November 2024·4108 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Beihang University

VideoEspresso: A new dataset and Hybrid LVLMs framework boost fine-grained video reasoning!

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

22 November 2024·2416 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Nanjing University

This survey paper offers a comprehensive overview of Multimodal Large Language Model (MLLM) evaluation, systematically categorizing benchmarks and methods, and identifying gaps for future research, th…

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

22 November 2024·3534 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NTU, Singapore

Large multimodal models’ inner workings are demystified using a novel framework that identifies, interprets, and even steers their internal features, opening the door to safer, more reliable AI.

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

21 November 2024·2697 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent AI Lab

Insight-V: A multi-agent system enhances multi-modal LLMs’ visual reasoning by generating high-quality long-chain reasoning data and employing a two-stage training pipeline, achieving significant perf…

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

21 November 2024·4473 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Washington

GMAI-VL-5.5M & GMAI-VL: A new multimodal medical dataset and vision-language model achieve state-of-the-art results in various medical tasks.

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

20 November 2024·3767 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong Baptist University

VideoAutoArena automates large multimodal model (LMM) evaluation using simulated users, offering a cost-effective and scalable solution compared to traditional human annotation.

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

16 November 2024·3633 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Vivo AI Lab

BlueLM-V-3B: Algorithm and system co-design enables efficient, real-time multimodal language model deployment on mobile devices.

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

16 November 2024·2224 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Metabrain AGI Lab

Awaker2.5-VL: A novel Mixture-of-Experts architecture stably scales MLLMs, solving multi-task conflict with parameter efficiency and achieving state-of-the-art performance.

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

15 November 2024·2599 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Roblox

SmoothCache: A universal technique boosts Diffusion Transformer inference speed by 8-71% across modalities, without sacrificing quality!

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

15 November 2024·2726 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

LLaVA-01: A novel visual language model achieves superior reasoning performance through structured, multi-stage processing and efficient inference-time scaling, surpassing even larger, closed-source m…