Multimodal Learning

GenRL: Multimodal-foundation world models for generalization in embodied agents

26 September 2024·2793 words·14 mins· loading · loading

Multimodal Learning Embodied AI 🏢 Ghent University

GenRL: Learn diverse embodied tasks from vision & language, without reward design, using multimodal imagination!

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

26 September 2024·2413 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Tsinghua University

GenArtist uses a multimodal large language model as an AI agent to unify image generation and editing, achieving state-of-the-art performance by decomposing complex tasks and leveraging a comprehensiv…

GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

26 September 2024·2182 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 New York University Abu Dhabi

GAMap: Zero-shot object goal navigation excels by using multi-scale geometric-affordance guidance, significantly boosting robot success rates in unseen environments.

G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models

26 September 2024·2323 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 City University of Hong Kong

G3: A novel framework leverages Retrieval-Augmented Generation to achieve highly accurate worldwide image geolocalization, overcoming limitations of existing methods.

G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training

26 September 2024·2099 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Oxford

G2D: a novel medical VLP framework achieves superior performance in medical image analysis by simultaneously learning global and dense visual features using image-text pairs without extra annotations.

FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion

26 September 2024·2650 words·13 mins· loading · loading

Multimodal Learning Multimodal Understanding 🏢 Department of Computer Science Johns Hopkins University

FuseMoE, a novel mixture-of-experts transformer, efficiently fuses diverse and incomplete multimodal data, achieving superior predictive performance via a unique Laplace gating function.

Frustratingly Easy Test-Time Adaptation of Vision-Language Models

26 September 2024·2379 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Trento

Boost VLM performance with ZERO: a simple, fast Test-Time Adaptation method requiring only a single forward pass and exceeding state-of-the-art accuracy!

Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching

26 September 2024·2541 words·12 mins· loading · loading

AI Generated Multimodal Learning Audio-Visual Learning 🏢 Zhejiang University

FRIEREN: a novel video-to-audio generation network using rectified flow matching achieves state-of-the-art performance by improving audio quality, temporal alignment, and generation efficiency.

FlexCap: Describe Anything in Images in Controllable Detail

26 September 2024·2861 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Google DeepMind

FlexCap generates controllable, region-specific image descriptions of varying lengths, achieving state-of-the-art zero-shot visual question answering.

Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts

26 September 2024·1871 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of North Carolina at Chapel Hill

Flex-MoE: A novel framework flexibly handles arbitrary modality combinations in multimodal learning, even with missing data, achieving robust performance.

FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models

26 September 2024·2833 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Google DeepMind

FineStyle enables fine-grained controllable style personalization for text-to-image models using a novel concept-oriented data scaling and parameter-efficient adapter tuning, mitigating content leakag…

FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding

26 September 2024·2233 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Gaoling School of Artificial Intelligence, Renmin University of China

FineCLIP boosts fine-grained image understanding by combining real-time self-distillation with semantically rich regional contrastive learning, significantly outperforming existing methods.

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

26 September 2024·3674 words·18 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 UC Berkeley

This paper presents a novel RL framework that fine-tunes large vision-language models (VLMs) to become effective decision-making agents. By incorporating chain-of-thought reasoning, the framework enab…

Few-Shot Adversarial Prompt Learning on Vision-Language Models

26 September 2024·3134 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Sydney AI Centre University of Sydney

Few-shot adversarial prompt learning significantly improves vision-language model robustness by learning adversarially correlated text supervision and a novel training objective that enhances multi-mo…

Federated Learning from Vision-Language Foundation Models: Theoretical Analysis and Method

26 September 2024·1331 words·7 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 ShanghaiTech University

PromptFolio optimizes federated learning of vision-language models by combining global and local prompts, improving generalization and personalization, as proven theoretically and empirically.

Facilitating Multimodal Classification via Dynamically Learning Modality Gap

26 September 2024·1770 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Nanjing University of Science and Technology

Researchers dynamically integrate contrastive and supervised learning to overcome the modality imbalance problem in multimodal classification, significantly improving model performance.

EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection

26 September 2024·2156 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 National University of Singapore

EZ-HOI: Efficient Zero-Shot HOI detection adapts Vision-Language Models (VLMs) for Human-Object Interaction (HOI) tasks using a novel prompt learning framework, achieving state-of-the-art performance …

Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning

26 September 2024·2845 words·14 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Harvard University

Eye-gaze data boosts medical image-text alignment!

Extending Multi-modal Contrastive Representations

26 September 2024·2089 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Zhejiang University

Ex-MCR: Efficiently build unified multi-modal representations by extending, not connecting, pre-trained spaces, achieving superior performance with less paired data and training.

Exploratory Retrieval-Augmented Planning For Continual Embodied Instruction Following

26 September 2024·2661 words·13 mins· loading · loading

Multimodal Learning Embodied AI 🏢 Department of Computer Science and Engineering, Sungkyunkwan University

ExRAP: A novel framework boosts embodied AI’s continual instruction following by cleverly combining environment exploration with LLM-based planning, leading to significantly improved task success and …