Skip to main content

Vision-Language Models

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3553 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
LLaVA-UHD v2 enhances MLLMs by integrating high-resolution visual details using a hierarchical window transformer.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2901 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
Enhance image captions significantly with DCE, a novel engine leveraging visual specialists to generate comprehensive, detailed descriptions surpassing LMM and human-annotated captions.
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
·5510 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Beijing University of Posts and Telecommunications
New benchmark reveals how well AI understands and meets real-world human needs.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
·1887 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Meta GenAI
Apollo LMMs achieve SOTA on video understanding tasks by exploring and optimizing the design and training of video-LMMs.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
·3840 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
SynerGen-VL: A simpler, more powerful unified MLLM for image understanding and generation.
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation
·5249 words·25 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft Research
OLA-VLM boosts multimodal LLMs’ visual understanding by distilling knowledge from specialized visual encoders into the LLM’s internal representations during pretraining, achieving significant performa…
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
·3111 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
Lyra: An efficient, speech-centric framework for omni-cognition, achieving state-of-the-art results across various modalities while being highly efficient.
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
·2853 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Mohamed Bin Zayed University of Artificial Intelligence
BiMediX2, a bilingual medical expert LMM excels in diverse medical modalities.
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
·3931 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Huawei Noah's Ark Lab
ILLUME: A unified multi-modal LLM efficiently integrates visual understanding & generation, achieving competitive performance with significantly less data.
Chimera: Improving Generalist Model with Domain-Specific Experts
·4776 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory
Chimera boosts large multimodal models’ performance on specialized tasks by cleverly integrating domain-specific expert models, achieving state-of-the-art results on multiple benchmarks.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
·3252 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Adelaide
State-Adaptive Mixture of Experts (SAME) model excels in generic language-guided visual navigation by consolidating diverse tasks and dynamically adapting to varying instruction granularities.
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
·4233 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
MAmmoTH-VL: A novel approach to instruction tuning at scale creates a 12M dataset eliciting chain-of-thought reasoning, yielding state-of-the-art multimodal reasoning capabilities.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
·9241 words·44 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
InternVL 2.5, a new open-source multimodal LLM, surpasses 70% on the MMMU benchmark, rivaling top commercial models through model, data, and test-time scaling strategies.
VisionZip: Longer is Better but Not Necessary in Vision Language Models
·7032 words·34 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 CUHK
VisionZip boosts vision-language model efficiency by intelligently selecting key visual tokens, achieving near-state-of-the-art performance with drastically reduced computational costs.
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
·4750 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
GENMAC: Multi-agent collaboration revolutionizes compositional text-to-video generation, achieving state-of-the-art results by iteratively refining videos via specialized agents.
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
·3412 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft Research
Florence-VL enhances vision-language models by incorporating a generative vision encoder and a novel depth-breadth fusion architecture, achieving state-of-the-art results on various benchmarks.
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
·2671 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent AI Lab
Divot: A novel diffusion-powered video tokenizer enables unified video comprehension & generation with LLMs, surpassing existing methods.
Discriminative Fine-tuning of LVLMs
·4145 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Samsung AI Cambridge
VladVA: A novel training framework converts generative LVLMs into powerful discriminative models, achieving state-of-the-art performance on image-text retrieval and compositionality benchmarks.
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
·5178 words·25 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 ByteDance
TokenFlow: One image tokenizer, mastering both visual understanding & generation!
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
·3120 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Washington
Boosting visual reasoning in multimodal language models, AURORA leverages novel ‘Perception Tokens’ for improved depth estimation and object counting.