Multimodal Learning
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3553 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
LLaVA-UHD v2 enhances MLLMs by integrating high-resolution visual details using a hierarchical window transformer.
GUI Agents: A Survey
·360 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Human-AI Interaction
🏢 University of Maryland
A comprehensive survey of GUI agents, categorizing benchmarks, architectures, training methods, and open challenges, providing a unified framework for researchers.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2901 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
Enhance image captions significantly with DCE, a novel engine leveraging visual specialists to generate comprehensive, detailed descriptions surpassing LMM and human-annotated captions.
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
·5510 words·26 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Beijing University of Posts and Telecommunications
New benchmark reveals how well AI understands and meets real-world human needs.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
·1887 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Meta GenAI
Apollo LMMs achieve SOTA on video understanding tasks by exploring and optimizing the design and training of video-LMMs.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
·3840 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
SynerGen-VL: A simpler, more powerful unified MLLM for image understanding and generation.
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation
·5249 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
OLA-VLM boosts multimodal LLMs’ visual understanding by distilling knowledge from specialized visual encoders into the LLM’s internal representations during pretraining, achieving significant performa…
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
·3107 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Generation
🏢 University of Edinburgh
VMB generates music from videos, images, and text, using description and retrieval bridges to improve quality and controllability.
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
·3111 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
Lyra: An efficient, speech-centric framework for omni-cognition, achieving state-of-the-art results across various modalities while being highly efficient.
GenEx: Generating an Explorable World
·2719 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Embodied AI
🏢 Johns Hopkins University
GenEx generates explorable 3D worlds from a single image, enabling embodied AI agents to explore and learn.
Multimodal Latent Language Modeling with Next-Token Diffusion
·4442 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Generation
🏢 Microsoft Research
LatentLM: a novel multimodal model unifying discrete & continuous data via next-token diffusion, surpassing existing methods in performance & scalability across various tasks.
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
·2853 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Mohamed Bin Zayed University of Artificial Intelligence
BiMediX2, a bilingual medical expert LMM excels in diverse medical modalities.
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
·3931 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Huawei Noah's Ark Lab
ILLUME: A unified multi-modal LLM efficiently integrates visual understanding & generation, achieving competitive performance with significantly less data.
Chimera: Improving Generalist Model with Domain-Specific Experts
·4776 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Artificial Intelligence Laboratory
Chimera boosts large multimodal models’ performance on specialized tasks by cleverly integrating domain-specific expert models, achieving state-of-the-art results on multiple benchmarks.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
·3252 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Adelaide
State-Adaptive Mixture of Experts (SAME) model excels in generic language-guided visual navigation by consolidating diverse tasks and dynamically adapting to varying instruction granularities.
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
·4233 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Carnegie Mellon University
MAmmoTH-VL: A novel approach to instruction tuning at scale creates a 12M dataset eliciting chain-of-thought reasoning, yielding state-of-the-art multimodal reasoning capabilities.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
·9241 words·44 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
InternVL 2.5, a new open-source multimodal LLM, surpasses 70% on the MMMU benchmark, rivaling top commercial models through model, data, and test-time scaling strategies.
VisionZip: Longer is Better but Not Necessary in Vision Language Models
·7032 words·34 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 CUHK
VisionZip boosts vision-language model efficiency by intelligently selecting key visual tokens, achieving near-state-of-the-art performance with drastically reduced computational costs.
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
·4750 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
GENMAC: Multi-agent collaboration revolutionizes compositional text-to-video generation, achieving state-of-the-art results by iteratively refining videos via specialized agents.
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
·3412 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
Florence-VL enhances vision-language models by incorporating a generative vision encoder and a novel depth-breadth fusion architecture, achieving state-of-the-art results on various benchmarks.