Skip to main content

Multimodal Learning

Interfacing Foundation Models' Embeddings
·2676 words·13 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏒 UW-Madison
FIND, a lightweight transformer interface, seamlessly aligns foundation models’ embeddings for unified image and dataset-level understanding, enabling generalizable, interleaved performance on segment…
InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction
·2177 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏒 University of Illinois Urbana-Champaign
InterDreamer: Zero-shot text-guided 3D human-object interaction generation without paired data, achieved via decoupled semantic and dynamic modeling, using LLMs and a physics-based world model.
InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint
·2703 words·13 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏒 Chinese University of Hong Kong
InterControl: Zero-shot multi-person interaction generation by precisely controlling every joint using only single-person data.
Instruction-Guided Visual Masking
·3666 words·18 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏒 Tsinghua University
Instruction-Guided Visual Masking (IVM) boosts multimodal instruction following by precisely focusing models on relevant image regions via visual masking, achieving state-of-the-art results on multipl…
InstructG2I: Synthesizing Images from Multimodal Attributed Graphs
·1973 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏒 University of Illinois at Urbana-Champaign
INSTRUCTG2I: a novel graph context-conditioned diffusion model, generates images from multimodal attributed graphs, addressing challenges in graph size, dependencies, and controllability.
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
·6925 words·33 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏒 Sorbonne University
Frozen LLMs surprisingly excel at multimodal tasks; this paper reveals that their success stems from an implicit multimodal alignment effect, paving the way for efficient LMMs.
Images that Sound: Composing Images and Sounds on a Single Canvas
·2562 words·13 mins· loading · loading
Multimodal Learning Multimodal Generation 🏒 University of Michigan
Researchers create ‘images that sound’β€”visual spectrograms looking like natural images and sounding like natural audioβ€”by cleverly composing pre-trained image and audio diffusion models in a shared la…
Identifiable Shared Component Analysis of Unpaired Multimodal Mixtures
·2736 words·13 mins· loading · loading
AI Generated Multimodal Learning Cross-Modal Retrieval 🏒 Oregon State University
Unaligned multimodal mixtures’ shared components are identifiable under mild conditions using a distribution-matching approach, relaxing assumptions of existing methods.
HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid
·2769 words·13 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏒 Shanghai Jiao Tong University
Humanoid robot learns to rearrange objects using vision and language instructions, achieving remarkable success on diverse tasks in a novel dataset.
How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval
·3595 words·17 mins· loading · loading
Multimodal Learning Cross-Modal Retrieval 🏒 University of Toronto
MolPhenix, a novel multi-modal model, drastically improves zero-shot molecular retrieval by leveraging a pre-trained phenomics model and a novel similarity-aware loss, achieving an 8.1x improvement ov…
How Control Information Influences Multilingual Text Image Generation and Editing?
·2075 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏒 University of Science and Technology of China
TextGen enhances multilingual visual text generation and editing by optimizing control information using Fourier analysis and a two-stage framework, achieving state-of-the-art results.
Homology Consistency Constrained Efficient Tuning for Vision-Language Models
·1675 words·8 mins· loading · loading
Multimodal Learning Vision-Language Models 🏒 University of Science and Technology of China
Constraining vision-language model tuning via persistent homology ensures consistent image-text alignment, improving few-shot learning and domain generalization.
Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
·2062 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏒 ECE & 2IPAI, Seoul National University
This paper introduces HVFA, a novel OCR-free document understanding framework using MLLMs and multi-scale visual features, achieving superior performance across various document understanding tasks.
Harmonizing Visual Text Comprehension and Generation
·2525 words·12 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏒 East China Normal University
TextHarmony: a unified multimodal model harmonizes visual text comprehension & generation, achieving improved performance across benchmarks with minimal parameter increase.
GuardT2I: Defending Text-to-Image Models from Adversarial Prompts
·3130 words·15 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏒 Tsinghua University
GuardT2I: A novel framework defends text-to-image models against adversarial prompts by translating latent guidance embeddings into natural language, enabling effective adversarial prompt detection wi…
GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation
·2589 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏒 KAIST
GrounDiT: Training-free spatial grounding for text-to-image generation using Diffusion Transformers and a novel noisy patch transplantation technique for precise object placement.
Grounding Multimodal Large Language Models in Actions
·3629 words·18 mins· loading · loading
AI Generated Multimodal Learning Embodied AI 🏒 Apple
Researchers unveil unified architecture for grounding multimodal large language models in actions, showing superior performance with learned tokenization for continuous actions and semantic alignment …
Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models
·2293 words·11 mins· loading · loading
Representation Learning Multimodal Learning 🏒 Ningbo Institute of Digital Twin, Eastern Institute of Technology
GEM, a novel framework, uses a bidirectional graph and MLLMs to achieve fine-grained, relation-aware disentanglement in unsupervised representation learning, surpassing existing methods.
GOMAA-Geo: GOal Modality Agnostic Active Geo-localization
·3664 words·18 mins· loading · loading
Multimodal Learning Vision-Language Models 🏒 Department of Computer Science and Engineering, Washington University in St. Louis
GOMAA-Geo, a novel framework, enables efficient and accurate goal localization using aerial imagery, regardless of goal description modality (text or images), demonstrating impressive zero-shot genera…
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
·2396 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏒 Hong Kong University of Science and Technology
GITA, a novel framework, integrates visual graphs into language models for superior vision-language graph reasoning, outperforming existing LLMs and introducing the first vision-language dataset, GVLQ…