Multimodal Learning

Interfacing Foundation Models' Embeddings

26 September 2024·2676 words·13 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 UW-Madison

FIND, a lightweight transformer interface, seamlessly aligns foundation models’ embeddings for unified image and dataset-level understanding, enabling generalizable, interleaved performance on segment…

InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

26 September 2024·2177 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Illinois Urbana-Champaign

InterDreamer: Zero-shot text-guided 3D human-object interaction generation without paired data, achieved via decoupled semantic and dynamic modeling, using LLMs and a physics-based world model.

InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint

26 September 2024·2703 words·13 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong

InterControl: Zero-shot multi-person interaction generation by precisely controlling every joint using only single-person data.

Instruction-Guided Visual Masking

26 September 2024·3666 words·18 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Instruction-Guided Visual Masking (IVM) boosts multimodal instruction following by precisely focusing models on relevant image regions via visual masking, achieving state-of-the-art results on multipl…

InstructG2I: Synthesizing Images from Multimodal Attributed Graphs

26 September 2024·1973 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Illinois at Urbana-Champaign

INSTRUCTG2I: a novel graph context-conditioned diffusion model, generates images from multimodal attributed graphs, addressing challenges in graph size, dependencies, and controllability.

Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs

26 September 2024·6925 words·33 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Sorbonne University

Frozen LLMs surprisingly excel at multimodal tasks; this paper reveals that their success stems from an implicit multimodal alignment effect, paving the way for efficient LMMs.

Images that Sound: Composing Images and Sounds on a Single Canvas

26 September 2024·2562 words·13 mins· loading · loading

Multimodal Learning Multimodal Generation 🏢 University of Michigan

Researchers create ‘images that sound’—visual spectrograms looking like natural images and sounding like natural audio—by cleverly composing pre-trained image and audio diffusion models in a shared la…

Identifiable Shared Component Analysis of Unpaired Multimodal Mixtures

26 September 2024·2736 words·13 mins· loading · loading

AI Generated Multimodal Learning Cross-Modal Retrieval 🏢 Oregon State University

Unaligned multimodal mixtures’ shared components are identifiable under mild conditions using a distribution-matching approach, relaxing assumptions of existing methods.

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

26 September 2024·2769 words·13 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Shanghai Jiao Tong University

Humanoid robot learns to rearrange objects using vision and language instructions, achieving remarkable success on diverse tasks in a novel dataset.

How Molecules Impact Cells: Unlocking Contrastive PhenoMolecular Retrieval

26 September 2024·3595 words·17 mins· loading · loading

Multimodal Learning Cross-Modal Retrieval 🏢 University of Toronto

MolPhenix, a novel multi-modal model, drastically improves zero-shot molecular retrieval by leveraging a pre-trained phenomics model and a novel similarity-aware loss, achieving an 8.1x improvement ov…

How Control Information Influences Multilingual Text Image Generation and Editing?

26 September 2024·2075 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

TextGen enhances multilingual visual text generation and editing by optimizing control information using Fourier analysis and a two-stage framework, achieving state-of-the-art results.

Homology Consistency Constrained Efficient Tuning for Vision-Language Models

26 September 2024·1675 words·8 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

Constraining vision-language model tuning via persistent homology ensures consistent image-text alignment, improving few-shot learning and domain generalization.

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

26 September 2024·2062 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 ECE & 2IPAI, Seoul National University

This paper introduces HVFA, a novel OCR-free document understanding framework using MLLMs and multi-scale visual features, achieving superior performance across various document understanding tasks.

Harmonizing Visual Text Comprehension and Generation

26 September 2024·2525 words·12 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 East China Normal University

TextHarmony: a unified multimodal model harmonizes visual text comprehension & generation, achieving improved performance across benchmarks with minimal parameter increase.

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

26 September 2024·3130 words·15 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Tsinghua University

GuardT2I: A novel framework defends text-to-image models against adversarial prompts by translating latent guidance embeddings into natural language, enabling effective adversarial prompt detection wi…

GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

26 September 2024·2589 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 KAIST

GrounDiT: Training-free spatial grounding for text-to-image generation using Diffusion Transformers and a novel noisy patch transplantation technique for precise object placement.

Grounding Multimodal Large Language Models in Actions

26 September 2024·3629 words·18 mins· loading · loading

AI Generated Multimodal Learning Embodied AI 🏢 Apple

Researchers unveil unified architecture for grounding multimodal large language models in actions, showing superior performance with learned tokenization for continuous actions and semantic alignment …

Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models

26 September 2024·2293 words·11 mins· loading · loading

Representation Learning Multimodal Learning 🏢 Ningbo Institute of Digital Twin, Eastern Institute of Technology

GEM, a novel framework, uses a bidirectional graph and MLLMs to achieve fine-grained, relation-aware disentanglement in unsupervised representation learning, surpassing existing methods.

GOMAA-Geo: GOal Modality Agnostic Active Geo-localization

26 September 2024·3664 words·18 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Department of Computer Science and Engineering, Washington University in St. Louis

GOMAA-Geo, a novel framework, enables efficient and accurate goal localization using aerial imagery, regardless of goal description modality (text or images), demonstrating impressive zero-shot genera…

GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning

26 September 2024·2396 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

GITA, a novel framework, integrates visual graphs into language models for superior vision-language graph reasoning, outperforming existing LLMs and introducing the first vision-language dataset, GVLQ…