Multimodal Learning
Interfacing Foundation Models' Embeddings
·2676 words·13 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
π’ UW-Madison
FIND, a lightweight transformer interface, seamlessly aligns foundation models’ embeddings for unified image and dataset-level understanding, enabling generalizable, interleaved performance on segment…
InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction
·2177 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ University of Illinois Urbana-Champaign
InterDreamer: Zero-shot text-guided 3D human-object interaction generation without paired data, achieved via decoupled semantic and dynamic modeling, using LLMs and a physics-based world model.
InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint
·2703 words·13 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
π’ Chinese University of Hong Kong
InterControl: Zero-shot multi-person interaction generation by precisely controlling every joint using only single-person data.
Instruction-Guided Visual Masking
·3666 words·18 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
Instruction-Guided Visual Masking (IVM) boosts multimodal instruction following by precisely focusing models on relevant image regions via visual masking, achieving state-of-the-art results on multipl…
InstructG2I: Synthesizing Images from Multimodal Attributed Graphs
·1973 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ University of Illinois at Urbana-Champaign
INSTRUCTG2I: a novel graph context-conditioned diffusion model, generates images from multimodal attributed graphs, addressing challenges in graph size, dependencies, and controllability.
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
·6925 words·33 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
π’ Sorbonne University
Frozen LLMs surprisingly excel at multimodal tasks; this paper reveals that their success stems from an implicit multimodal alignment effect, paving the way for efficient LMMs.
Images that Sound: Composing Images and Sounds on a Single Canvas
·2562 words·13 mins·
loading
·
loading
Multimodal Learning
Multimodal Generation
π’ University of Michigan
Researchers create ‘images that sound’βvisual spectrograms looking like natural images and sounding like natural audioβby cleverly composing pre-trained image and audio diffusion models in a shared la…
Identifiable Shared Component Analysis of Unpaired Multimodal Mixtures
·2736 words·13 mins·
loading
·
loading
AI Generated
Multimodal Learning
Cross-Modal Retrieval
π’ Oregon State University
Unaligned multimodal mixtures’ shared components are identifiable under mild conditions using a distribution-matching approach, relaxing assumptions of existing methods.
HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid
·2769 words·13 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
π’ Shanghai Jiao Tong University
Humanoid robot learns to rearrange objects using vision and language instructions, achieving remarkable success on diverse tasks in a novel dataset.
How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval
·3595 words·17 mins·
loading
·
loading
Multimodal Learning
Cross-Modal Retrieval
π’ University of Toronto
MolPhenix, a novel multi-modal model, drastically improves zero-shot molecular retrieval by leveraging a pre-trained phenomics model and a novel similarity-aware loss, achieving an 8.1x improvement ov…
How Control Information Influences Multilingual Text Image Generation and Editing?
·2075 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ University of Science and Technology of China
TextGen enhances multilingual visual text generation and editing by optimizing control information using Fourier analysis and a two-stage framework, achieving state-of-the-art results.
Homology Consistency Constrained Efficient Tuning for Vision-Language Models
·1675 words·8 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ University of Science and Technology of China
Constraining vision-language model tuning via persistent homology ensures consistent image-text alignment, improving few-shot learning and domain generalization.
Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
·2062 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ ECE & 2IPAI, Seoul National University
This paper introduces HVFA, a novel OCR-free document understanding framework using MLLMs and multi-scale visual features, achieving superior performance across various document understanding tasks.
Harmonizing Visual Text Comprehension and Generation
·2525 words·12 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
π’ East China Normal University
TextHarmony: a unified multimodal model harmonizes visual text comprehension & generation, achieving improved performance across benchmarks with minimal parameter increase.
GuardT2I: Defending Text-to-Image Models from Adversarial Prompts
·3130 words·15 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
GuardT2I: A novel framework defends text-to-image models against adversarial prompts by translating latent guidance embeddings into natural language, enabling effective adversarial prompt detection wi…
GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation
·2589 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ KAIST
GrounDiT: Training-free spatial grounding for text-to-image generation using Diffusion Transformers and a novel noisy patch transplantation technique for precise object placement.
Grounding Multimodal Large Language Models in Actions
·3629 words·18 mins·
loading
·
loading
AI Generated
Multimodal Learning
Embodied AI
π’ Apple
Researchers unveil unified architecture for grounding multimodal large language models in actions, showing superior performance with learned tokenization for continuous actions and semantic alignment …
Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models
·2293 words·11 mins·
loading
·
loading
Representation Learning
Multimodal Learning
π’ Ningbo Institute of Digital Twin, Eastern Institute of Technology
GEM, a novel framework, uses a bidirectional graph and MLLMs to achieve fine-grained, relation-aware disentanglement in unsupervised representation learning, surpassing existing methods.
GOMAA-Geo: GOal Modality Agnostic Active Geo-localization
·3664 words·18 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Department of Computer Science and Engineering, Washington University in St. Louis
GOMAA-Geo, a novel framework, enables efficient and accurate goal localization using aerial imagery, regardless of goal description modality (text or images), demonstrating impressive zero-shot genera…
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
·2396 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Hong Kong University of Science and Technology
GITA, a novel framework, integrates visual graphs into language models for superior vision-language graph reasoning, outperforming existing LLMs and introducing the first vision-language dataset, GVLQ…