Vision-Language Models

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

26 December 2024·3509 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai AI Laboratory

Task Preference Optimization (TPO) significantly boosts multimodal large language models’ visual understanding by aligning them with fine-grained visual tasks via learnable task tokens, achieving 14.6…

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

24 December 2024·2929 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Toronto

MMFactory: A universal framework for vision-language tasks, offering diverse programmatic solutions based on user needs and constraints, outperforming existing methods.

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

24 December 2024·3344 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 AIRI

3DGraphLLM boosts 3D scene understanding by cleverly merging semantic graphs and LLMs, enabling more accurate scene descriptions and outperforming existing methods.

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

19 December 2024·2604 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

MegaPairs synthesizes 26M+ high-quality multimodal retrieval training examples, enabling state-of-the-art zero-shot performance and surpassing existing methods trained on 70x more data.

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

19 December 2024·3592 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Meta GenAI

CrossFlow: Directly evolve any modality to another using flow matching, achieving state-of-the-art results across various tasks!

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

18 December 2024·3553 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

LLaVA-UHD v2 enhances MLLMs by integrating high-resolution visual details using a hierarchical window transformer.

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

18 December 2024·2901 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

Enhance image captions significantly with DCE, a novel engine leveraging visual specialists to generate comprehensive, detailed descriptions surpassing LMM and human-annotated captions.

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

17 December 2024·5510 words·26 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Beijing University of Posts and Telecommunications

New benchmark reveals how well AI understands and meets real-world human needs.

Apollo: An Exploration of Video Understanding in Large Multimodal Models

13 December 2024·1887 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Meta GenAI

Apollo LMMs achieve SOTA on video understanding tasks by exploring and optimizing the design and training of video-LMMs.

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

12 December 2024·3840 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

SynerGen-VL: A simpler, more powerful unified MLLM for image understanding and generation.

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

12 December 2024·5249 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft Research

OLA-VLM boosts multimodal LLMs’ visual understanding by distilling knowledge from specialized visual encoders into the LLM’s internal representations during pretraining, achieving significant performa…

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

12 December 2024·3111 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

Lyra: An efficient, speech-centric framework for omni-cognition, achieving state-of-the-art results across various modalities while being highly efficient.

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

10 December 2024·2853 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Mohamed Bin Zayed University of Artificial Intelligence

BiMediX2, a bilingual medical expert LMM excels in diverse medical modalities.

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

9 December 2024·3931 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Huawei Noah's Ark Lab

ILLUME: A unified multi-modal LLM efficiently integrates visual understanding & generation, achieving competitive performance with significantly less data.

Chimera: Improving Generalist Model with Domain-Specific Experts

8 December 2024·4776 words·23 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory

Chimera boosts large multimodal models’ performance on specialized tasks by cleverly integrating domain-specific expert models, achieving state-of-the-art results on multiple benchmarks.

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

7 December 2024·3252 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Adelaide

State-Adaptive Mixture of Experts (SAME) model excels in generic language-guided visual navigation by consolidating diverse tasks and dynamically adapting to varying instruction granularities.

MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

6 December 2024·4233 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

MAmmoTH-VL: A novel approach to instruction tuning at scale creates a 12M dataset eliciting chain-of-thought reasoning, yielding state-of-the-art multimodal reasoning capabilities.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

6 December 2024·9241 words·44 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

InternVL 2.5, a new open-source multimodal LLM, surpasses 70% on the MMMU benchmark, rivaling top commercial models through model, data, and test-time scaling strategies.

VisionZip: Longer is Better but Not Necessary in Vision Language Models

5 December 2024·7032 words·34 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 CUHK

VisionZip boosts vision-language model efficiency by intelligently selecting key visual tokens, achieving near-state-of-the-art performance with drastically reduced computational costs.

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

5 December 2024·4750 words·23 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

GENMAC: Multi-agent collaboration revolutionizes compositional text-to-video generation, achieving state-of-the-art results by iteratively refining videos via specialized agents.