Skip to main content

Vision-Language Models

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
·3509 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai AI Laboratory
Task Preference Optimization (TPO) significantly boosts multimodal large language models’ visual understanding by aligning them with fine-grained visual tasks via learnable task tokens, achieving 14.6…
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
·2929 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Toronto
MMFactory: A universal framework for vision-language tasks, offering diverse programmatic solutions based on user needs and constraints, outperforming existing methods.
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
·3344 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 AIRI
3DGraphLLM boosts 3D scene understanding by cleverly merging semantic graphs and LLMs, enabling more accurate scene descriptions and outperforming existing methods.
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
·2604 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
MegaPairs synthesizes 26M+ high-quality multimodal retrieval training examples, enabling state-of-the-art zero-shot performance and surpassing existing methods trained on 70x more data.
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
·3592 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Meta GenAI
CrossFlow: Directly evolve any modality to another using flow matching, achieving state-of-the-art results across various tasks!
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3553 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
LLaVA-UHD v2 enhances MLLMs by integrating high-resolution visual details using a hierarchical window transformer.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2901 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
Enhance image captions significantly with DCE, a novel engine leveraging visual specialists to generate comprehensive, detailed descriptions surpassing LMM and human-annotated captions.
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
·5510 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Beijing University of Posts and Telecommunications
New benchmark reveals how well AI understands and meets real-world human needs.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
·1887 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Meta GenAI
Apollo LMMs achieve SOTA on video understanding tasks by exploring and optimizing the design and training of video-LMMs.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
·3840 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
SynerGen-VL: A simpler, more powerful unified MLLM for image understanding and generation.
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation
·5249 words·25 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft Research
OLA-VLM boosts multimodal LLMs’ visual understanding by distilling knowledge from specialized visual encoders into the LLM’s internal representations during pretraining, achieving significant performa…
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
·3111 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
Lyra: An efficient, speech-centric framework for omni-cognition, achieving state-of-the-art results across various modalities while being highly efficient.
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
·2853 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Mohamed Bin Zayed University of Artificial Intelligence
BiMediX2, a bilingual medical expert LMM excels in diverse medical modalities.
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
·3931 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Huawei Noah's Ark Lab
ILLUME: A unified multi-modal LLM efficiently integrates visual understanding & generation, achieving competitive performance with significantly less data.
Chimera: Improving Generalist Model with Domain-Specific Experts
·4776 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory
Chimera boosts large multimodal models’ performance on specialized tasks by cleverly integrating domain-specific expert models, achieving state-of-the-art results on multiple benchmarks.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
·3252 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Adelaide
State-Adaptive Mixture of Experts (SAME) model excels in generic language-guided visual navigation by consolidating diverse tasks and dynamically adapting to varying instruction granularities.
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
·4233 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
MAmmoTH-VL: A novel approach to instruction tuning at scale creates a 12M dataset eliciting chain-of-thought reasoning, yielding state-of-the-art multimodal reasoning capabilities.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
·9241 words·44 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
InternVL 2.5, a new open-source multimodal LLM, surpasses 70% on the MMMU benchmark, rivaling top commercial models through model, data, and test-time scaling strategies.
VisionZip: Longer is Better but Not Necessary in Vision Language Models
·7032 words·34 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 CUHK
VisionZip boosts vision-language model efficiency by intelligently selecting key visual tokens, achieving near-state-of-the-art performance with drastically reduced computational costs.
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
·4750 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
GENMAC: Multi-agent collaboration revolutionizes compositional text-to-video generation, achieving state-of-the-art results by iteratively refining videos via specialized agents.