Skip to main content

Multimodal Learning

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
·3344 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 AIRI
3DGraphLLM boosts 3D scene understanding by cleverly merging semantic graphs and LLMs, enabling more accurate scene descriptions and outperforming existing methods.
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
·3633 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 Shanghai Jiao Tong University
PC Agent: While you sleep, AI works! This AI system uses human cognition transfer to perform complex digital tasks, exceeding the capabilities of existing digital agents by efficiently learning from h…
Diving into Self-Evolving Training for Multimodal Reasoning
·3292 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Hong Kong University of Science and Technology
M-STAR: a novel self-evolving training framework significantly boosts multimodal reasoning in large models without human annotation, achieving state-of-the-art results.
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
·1534 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Illinois Urbana-Champaign
MMAudio achieves state-of-the-art video-to-audio synthesis by jointly training on audio-visual and text-audio data, enabling high-quality, semantically and temporally aligned audio generation.
Progressive Multimodal Reasoning via Active Retrieval
·3576 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Gaoling School of Artificial Intelligence, Renmin University of China
AR-MCTS: a novel framework boosting multimodal large language model reasoning by actively retrieving key supporting evidence and using Monte Carlo Tree Search for improved path selection and verificat…
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
·2604 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
MegaPairs synthesizes 26M+ high-quality multimodal retrieval training examples, enabling state-of-the-art zero-shot performance and surpassing existing methods trained on 70x more data.
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
·3592 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Meta GenAI
CrossFlow: Directly evolve any modality to another using flow matching, achieving state-of-the-art results across various tasks!
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
·3553 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
LLaVA-UHD v2 enhances MLLMs by integrating high-resolution visual details using a hierarchical window transformer.
GUI Agents: A Survey
·360 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 University of Maryland
A comprehensive survey of GUI agents, categorizing benchmarks, architectures, training methods, and open challenges, providing a unified framework for researchers.
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception
·2901 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
Enhance image captions significantly with DCE, a novel engine leveraging visual specialists to generate comprehensive, detailed descriptions surpassing LMM and human-annotated captions.
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
·5510 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Beijing University of Posts and Telecommunications
New benchmark reveals how well AI understands and meets real-world human needs.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
·1887 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Meta GenAI
Apollo LMMs achieve SOTA on video understanding tasks by exploring and optimizing the design and training of video-LMMs.
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
·3840 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
SynerGen-VL: A simpler, more powerful unified MLLM for image understanding and generation.
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation
·5249 words·25 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft Research
OLA-VLM boosts multimodal LLMs’ visual understanding by distilling knowledge from specialized visual encoders into the LLM’s internal representations during pretraining, achieving significant performa…
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
·3107 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Edinburgh
VMB generates music from videos, images, and text, using description and retrieval bridges to improve quality and controllability.
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
·3111 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
Lyra: An efficient, speech-centric framework for omni-cognition, achieving state-of-the-art results across various modalities while being highly efficient.
GenEx: Generating an Explorable World
·2719 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Johns Hopkins University
GenEx generates explorable 3D worlds from a single image, enabling embodied AI agents to explore and learn.
Multimodal Latent Language Modeling with Next-Token Diffusion
·4442 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Microsoft Research
LatentLM: a novel multimodal model unifying discrete & continuous data via next-token diffusion, surpassing existing methods in performance & scalability across various tasks.
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
·2853 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Mohamed Bin Zayed University of Artificial Intelligence
BiMediX2, a bilingual medical expert LMM excels in diverse medical modalities.
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
·3931 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Huawei Noah's Ark Lab
ILLUME: A unified multi-modal LLM efficiently integrates visual understanding & generation, achieving competitive performance with significantly less data.