Multimodal Learning
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
·2671 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tencent AI Lab
Divot: A novel diffusion-powered video tokenizer enables unified video comprehension & generation with LLMs, surpassing existing methods.
Discriminative Fine-tuning of LVLMs
·4145 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Samsung AI Cambridge
VladVA: A novel training framework converts generative LVLMs into powerful discriminative models, achieving state-of-the-art performance on image-text retrieval and compositionality benchmarks.
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
·5178 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 ByteDance
TokenFlow: One image tokenizer, mastering both visual understanding & generation!
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
·3120 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Washington
Boosting visual reasoning in multimodal language models, AURORA leverages novel ‘Perception Tokens’ for improved depth estimation and object counting.
PaliGemma 2: A Family of Versatile VLMs for Transfer
·6035 words·29 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
PaliGemma 2: A family of versatile, open-weight VLMs achieving state-of-the-art results on various transfer tasks by scaling model size and resolution.
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
·7212 words·34 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Innovation Institute Huawei Noah's Ark Lab
INST-IT boosts multimodal instance understanding by using explicit visual prompts for instruction tuning, achieving significant improvements on various benchmarks.
Personalized Multimodal Large Language Models: A Survey
·599 words·3 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of California, San Diego
This survey reveals the exciting advancements in personalized multimodal large language models (MLLMs), offering a novel taxonomy, highlighting key challenges and applications, ultimately pushing the …
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
·5843 words·28 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Understanding
🏢 CUHK MMLab
AV-Odyssey Bench reveals that current multimodal LLMs struggle with basic audio-visual understanding, prompting the development of a comprehensive benchmark for more effective evaluation.
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
·3550 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
X-Prompt: a novel autoregressive vision-language model achieves universal in-context image generation by efficiently compressing contextual information and using a unified training framework for super…
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
·4300 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 NVIDIA
VLSI: Verbalized Layers-to-Interactions efficiently transfers knowledge from large to small VLMs using layer-wise natural language distillation, achieving significant performance gains without scaling…
Towards Universal Soccer Video Understanding
·2836 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Jiao Tong University
Soccer video understanding gets a major boost with SoccerReplay-1988, the largest multi-modal dataset, and MatchVision, a new visual-language model achieving state-of-the-art performance on event clas…
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
·5107 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Generation
🏢 UC Los Angeles
OmniFlow: a novel generative model masters any-to-any multi-modal generation, outperforming existing models and offering flexible control!
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
·3719 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 South China University of Technology
LSceneLLM boosts large 3D scene understanding by adaptively focusing on task-relevant visual details using LLMs’ visual preferences, surpassing existing methods on multiple benchmarks.
Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input
·2871 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Polytechnic of Turin
AIUTA minimizes user input in instance navigation by leveraging agent self-dialogue and dynamic interaction, achieving state-of-the-art performance.
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
·4218 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Chinese University of Hong Kong
Video-3D LLM masters 3D scene understanding by cleverly fusing video data with 3D positional encoding, achieving state-of-the-art performance.
VLSBench: Unveiling Visual Leakage in Multimodal Safety
·5131 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Artificial Intelligence Laboratory
VLSBench exposes visual leakage in MLLM safety benchmarks, creating a new, leak-free benchmark to evaluate true multimodal safety.
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters
·3277 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Human-AI Interaction
🏢 SenseTime Research
SOLAMI: enabling immersive, natural interactions with 3D characters via a unified social vision-language-action model and a novel synthetic multimodal dataset.
On Domain-Specific Post-Training for Multimodal Large Language Models
·4939 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 State Key Laboratory of General Artificial Intelligence, BIGAI
AdaMLLM enhances multimodal LLMs for specific domains via a novel visual instruction synthesizer and a single-stage post-training pipeline, achieving superior performance compared to existing methods.
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing
·3199 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Integrated Vision and Language Lab, KAIST, South Korea
Video-Ma²mba efficiently handles long videos by using State Space Models, achieving linear scaling in memory and time, and employing a novel Multi-Axis Gradient Checkpointing (MA-GC) for significant m…
VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models
·3026 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 NC Research, NCSOFT
VARCO-VISION: A new open-source 14B parameter Korean-English vision-language model excels at bilingual image-text understanding and generation, expanding AI capabilities for low-resource languages.