Multimodal Learning
Chimera: Improving Generalist Model with Domain-Specific Experts
·4776 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Artificial Intelligence Laboratory
Chimera boosts large multimodal models’ performance on specialized tasks by cleverly integrating domain-specific expert models, achieving state-of-the-art results on multiple benchmarks.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
·3252 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Adelaide
State-Adaptive Mixture of Experts (SAME) model excels in generic language-guided visual navigation by consolidating diverse tasks and dynamically adapting to varying instruction granularities.
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
·4233 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Carnegie Mellon University
MAmmoTH-VL: A novel approach to instruction tuning at scale creates a 12M dataset eliciting chain-of-thought reasoning, yielding state-of-the-art multimodal reasoning capabilities.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
·9241 words·44 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
InternVL 2.5, a new open-source multimodal LLM, surpasses 70% on the MMMU benchmark, rivaling top commercial models through model, data, and test-time scaling strategies.
VisionZip: Longer is Better but Not Necessary in Vision Language Models
·7032 words·34 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 CUHK
VisionZip boosts vision-language model efficiency by intelligently selecting key visual tokens, achieving near-state-of-the-art performance with drastically reduced computational costs.
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
·4750 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
GENMAC: Multi-agent collaboration revolutionizes compositional text-to-video generation, achieving state-of-the-art results by iteratively refining videos via specialized agents.
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
·3412 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
Florence-VL enhances vision-language models by incorporating a generative vision encoder and a novel depth-breadth fusion architecture, achieving state-of-the-art results on various benchmarks.
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
·2671 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tencent AI Lab
Divot: A novel diffusion-powered video tokenizer enables unified video comprehension & generation with LLMs, surpassing existing methods.
Discriminative Fine-tuning of LVLMs
·4145 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Samsung AI Cambridge
VladVA: A novel training framework converts generative LVLMs into powerful discriminative models, achieving state-of-the-art performance on image-text retrieval and compositionality benchmarks.
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
·5178 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 ByteDance
TokenFlow: One image tokenizer, mastering both visual understanding & generation!
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
·3120 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Washington
Boosting visual reasoning in multimodal language models, AURORA leverages novel ‘Perception Tokens’ for improved depth estimation and object counting.
PaliGemma 2: A Family of Versatile VLMs for Transfer
·6035 words·29 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
PaliGemma 2: A family of versatile, open-weight VLMs achieving state-of-the-art results on various transfer tasks by scaling model size and resolution.
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
·7212 words·34 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Innovation Institute Huawei Noah's Ark Lab
INST-IT boosts multimodal instance understanding by using explicit visual prompts for instruction tuning, achieving significant improvements on various benchmarks.
Personalized Multimodal Large Language Models: A Survey
·599 words·3 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of California, San Diego
This survey reveals the exciting advancements in personalized multimodal large language models (MLLMs), offering a novel taxonomy, highlighting key challenges and applications, ultimately pushing the …
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
·5843 words·28 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Understanding
🏢 CUHK MMLab
AV-Odyssey Bench reveals that current multimodal LLMs struggle with basic audio-visual understanding, prompting the development of a comprehensive benchmark for more effective evaluation.
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
·3550 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
X-Prompt: a novel autoregressive vision-language model achieves universal in-context image generation by efficiently compressing contextual information and using a unified training framework for super…
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
·4300 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 NVIDIA
VLSI: Verbalized Layers-to-Interactions efficiently transfers knowledge from large to small VLMs using layer-wise natural language distillation, achieving significant performance gains without scaling…
Towards Universal Soccer Video Understanding
·2836 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Jiao Tong University
Soccer video understanding gets a major boost with SoccerReplay-1988, the largest multi-modal dataset, and MatchVision, a new visual-language model achieving state-of-the-art performance on event clas…
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
·5107 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Generation
🏢 UC Los Angeles
OmniFlow: a novel generative model masters any-to-any multi-modal generation, outperforming existing models and offering flexible control!
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
·3719 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 South China University of Technology
LSceneLLM boosts large 3D scene understanding by adaptively focusing on task-relevant visual details using LLMs’ visual preferences, surpassing existing methods on multiple benchmarks.