Multimodal Learning

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

5 December 2024·2671 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent AI Lab

Divot: A novel diffusion-powered video tokenizer enables unified video comprehension & generation with LLMs, surpassing existing methods.

Discriminative Fine-tuning of LVLMs

5 December 2024·4145 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Samsung AI Cambridge

VladVA: A novel training framework converts generative LVLMs into powerful discriminative models, achieving state-of-the-art performance on image-text retrieval and compositionality benchmarks.

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

4 December 2024·5178 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 ByteDance

TokenFlow: One image tokenizer, mastering both visual understanding & generation!

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

4 December 2024·3120 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Washington

Boosting visual reasoning in multimodal language models, AURORA leverages novel ‘Perception Tokens’ for improved depth estimation and object counting.

PaliGemma 2: A Family of Versatile VLMs for Transfer

4 December 2024·6035 words·29 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Google DeepMind

PaliGemma 2: A family of versatile, open-weight VLMs achieving state-of-the-art results on various transfer tasks by scaling model size and resolution.

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

4 December 2024·7212 words·34 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Innovation Institute Huawei Noah's Ark Lab

INST-IT boosts multimodal instance understanding by using explicit visual prompts for instruction tuning, achieving significant improvements on various benchmarks.

Personalized Multimodal Large Language Models: A Survey

3 December 2024·599 words·3 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of California, San Diego

This survey reveals the exciting advancements in personalized multimodal large language models (MLLMs), offering a novel taxonomy, highlighting key challenges and applications, ultimately pushing the …

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

3 December 2024·5843 words·28 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 CUHK MMLab

AV-Odyssey Bench reveals that current multimodal LLMs struggle with basic audio-visual understanding, prompting the development of a comprehensive benchmark for more effective evaluation.

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

2 December 2024·3550 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

X-Prompt: a novel autoregressive vision-language model achieves universal in-context image generation by efficiently compressing contextual information and using a unified training framework for super…

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

2 December 2024·4300 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NVIDIA

VLSI: Verbalized Layers-to-Interactions efficiently transfers knowledge from large to small VLMs using layer-wise natural language distillation, achieving significant performance gains without scaling…

Towards Universal Soccer Video Understanding

2 December 2024·2836 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Jiao Tong University

Soccer video understanding gets a major boost with SoccerReplay-1988, the largest multi-modal dataset, and MatchVision, a new visual-language model achieving state-of-the-art performance on event clas…

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

2 December 2024·5107 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 UC Los Angeles

OmniFlow: a novel generative model masters any-to-any multi-modal generation, outperforming existing models and offering flexible control!

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

2 December 2024·3719 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 South China University of Technology

LSceneLLM boosts large 3D scene understanding by adaptively focusing on task-relevant visual details using LLMs’ visual preferences, surpassing existing methods on multiple benchmarks.

Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input

2 December 2024·2871 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Polytechnic of Turin

AIUTA minimizes user input in instance navigation by leveraging agent self-dialogue and dynamic interaction, achieving state-of-the-art performance.

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

30 November 2024·4218 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong

Video-3D LLM masters 3D scene understanding by cleverly fusing video data with 3D positional encoding, achieving state-of-the-art performance.

VLSBench: Unveiling Visual Leakage in Multimodal Safety

29 November 2024·5131 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory

VLSBench exposes visual leakage in MLLM safety benchmarks, creating a new, leak-free benchmark to evaluate true multimodal safety.

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

29 November 2024·3277 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 SenseTime Research

SOLAMI: enabling immersive, natural interactions with 3D characters via a unified social vision-language-action model and a novel synthetic multimodal dataset.

On Domain-Specific Post-Training for Multimodal Large Language Models

29 November 2024·4939 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 State Key Laboratory of General Artificial Intelligence, BIGAI

AdaMLLM enhances multimodal LLMs for specific domains via a novel visual instruction synthesizer and a single-stage post-training pipeline, achieving superior performance compared to existing methods.

Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

29 November 2024·3199 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Integrated Vision and Language Lab, KAIST, South Korea

Video-Ma²mba efficiently handles long videos by using State Space Models, achieving linear scaling in memory and time, and employing a novel Multi-Axis Gradient Checkpointing (MA-GC) for significant m…

VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

28 November 2024·3026 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NC Research, NCSOFT

VARCO-VISION: A new open-source 14B parameter Korean-English vision-language model excels at bilingual image-text understanding and generation, expanding AI capabilities for low-resource languages.