Multimodal Learning
GenRL: Multimodal-foundation world models for generalization in embodied agents
·2793 words·14 mins·
loading
·
loading
Multimodal Learning
Embodied AI
🏢 Ghent University
GenRL: Learn diverse embodied tasks from vision & language, without reward design, using multimodal imagination!
GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing
·2413 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
GenArtist uses a multimodal large language model as an AI agent to unify image generation and editing, achieving state-of-the-art performance by decomposing complex tasks and leveraging a comprehensiv…
GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance
·2182 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 New York University Abu Dhabi
GAMap: Zero-shot object goal navigation excels by using multi-scale geometric-affordance guidance, significantly boosting robot success rates in unseen environments.
G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models
·2323 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 City University of Hong Kong
G3: A novel framework leverages Retrieval-Augmented Generation to achieve highly accurate worldwide image geolocalization, overcoming limitations of existing methods.
G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training
·2099 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Oxford
G2D: a novel medical VLP framework achieves superior performance in medical image analysis by simultaneously learning global and dense visual features using image-text pairs without extra annotations.
FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion
·2650 words·13 mins·
loading
·
loading
Multimodal Learning
Multimodal Understanding
🏢 Department of Computer Science Johns Hopkins University
FuseMoE, a novel mixture-of-experts transformer, efficiently fuses diverse and incomplete multimodal data, achieving superior predictive performance via a unique Laplace gating function.
Frustratingly Easy Test-Time Adaptation of Vision-Language Models
·2379 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Trento
Boost VLM performance with ZERO: a simple, fast Test-Time Adaptation method requiring only a single forward pass and exceeding state-of-the-art accuracy!
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
·2541 words·12 mins·
loading
·
loading
AI Generated
Multimodal Learning
Audio-Visual Learning
🏢 Zhejiang University
FRIEREN: a novel video-to-audio generation network using rectified flow matching achieves state-of-the-art performance by improving audio quality, temporal alignment, and generation efficiency.
FlexCap: Describe Anything in Images in Controllable Detail
·2861 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
FlexCap generates controllable, region-specific image descriptions of varying lengths, achieving state-of-the-art zero-shot visual question answering.
Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts
·1871 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of North Carolina at Chapel Hill
Flex-MoE: A novel framework flexibly handles arbitrary modality combinations in multimodal learning, even with missing data, achieving robust performance.
FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models
·2833 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
FineStyle enables fine-grained controllable style personalization for text-to-image models using a novel concept-oriented data scaling and parameter-efficient adapter tuning, mitigating content leakag…
FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding
·2233 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Gaoling School of Artificial Intelligence, Renmin University of China
FineCLIP boosts fine-grained image understanding by combining real-time self-distillation with semantically rich regional contrastive learning, significantly outperforming existing methods.
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
·3674 words·18 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UC Berkeley
This paper presents a novel RL framework that fine-tunes large vision-language models (VLMs) to become effective decision-making agents. By incorporating chain-of-thought reasoning, the framework enab…
Few-Shot Adversarial Prompt Learning on Vision-Language Models
·3134 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Sydney AI Centre University of Sydney
Few-shot adversarial prompt learning significantly improves vision-language model robustness by learning adversarially correlated text supervision and a novel training objective that enhances multi-mo…
Federated Learning from Vision-Language Foundation Models: Theoretical Analysis and Method
·1331 words·7 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 ShanghaiTech University
PromptFolio optimizes federated learning of vision-language models by combining global and local prompts, improving generalization and personalization, as proven theoretically and empirically.
Facilitating Multimodal Classification via Dynamically Learning Modality Gap
·1770 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Nanjing University of Science and Technology
Researchers dynamically integrate contrastive and supervised learning to overcome the modality imbalance problem in multimodal classification, significantly improving model performance.
EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection
·2156 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 National University of Singapore
EZ-HOI: Efficient Zero-Shot HOI detection adapts Vision-Language Models (VLMs) for Human-Object Interaction (HOI) tasks using a novel prompt learning framework, achieving state-of-the-art performance …
Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning
·2845 words·14 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Harvard University
Eye-gaze data boosts medical image-text alignment!
Extending Multi-modal Contrastive Representations
·2089 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Zhejiang University
Ex-MCR: Efficiently build unified multi-modal representations by extending, not connecting, pre-trained spaces, achieving superior performance with less paired data and training.
Exploratory Retrieval-Augmented Planning For Continual Embodied Instruction Following
·2661 words·13 mins·
loading
·
loading
Multimodal Learning
Embodied AI
🏢 Department of Computer Science and Engineering, Sungkyunkwan University
ExRAP: A novel framework boosts embodied AI’s continual instruction following by cleverly combining environment exploration with LLM-based planning, leading to significantly improved task success and …