Vision-Language Models

Harmonizing Visual Text Comprehension and Generation

26 September 2024·2525 words·12 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 East China Normal University

TextHarmony: a unified multimodal model harmonizes visual text comprehension & generation, achieving improved performance across benchmarks with minimal parameter increase.

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

26 September 2024·3130 words·15 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Tsinghua University

GuardT2I: A novel framework defends text-to-image models against adversarial prompts by translating latent guidance embeddings into natural language, enabling effective adversarial prompt detection wi…

GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation

26 September 2024·2589 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 KAIST

GrounDiT: Training-free spatial grounding for text-to-image generation using Diffusion Transformers and a novel noisy patch transplantation technique for precise object placement.

GOMAA-Geo: GOal Modality Agnostic Active Geo-localization

26 September 2024·3664 words·18 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Department of Computer Science and Engineering, Washington University in St. Louis

GOMAA-Geo, a novel framework, enables efficient and accurate goal localization using aerial imagery, regardless of goal description modality (text or images), demonstrating impressive zero-shot genera…

GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning

26 September 2024·2396 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

GITA, a novel framework, integrates visual graphs into language models for superior vision-language graph reasoning, outperforming existing LLMs and introducing the first vision-language dataset, GVLQ…

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

26 September 2024·2413 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Tsinghua University

GenArtist uses a multimodal large language model as an AI agent to unify image generation and editing, achieving state-of-the-art performance by decomposing complex tasks and leveraging a comprehensiv…

GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance

26 September 2024·2182 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 New York University Abu Dhabi

GAMap: Zero-shot object goal navigation excels by using multi-scale geometric-affordance guidance, significantly boosting robot success rates in unseen environments.

G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models

26 September 2024·2323 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 City University of Hong Kong

G3: A novel framework leverages Retrieval-Augmented Generation to achieve highly accurate worldwide image geolocalization, overcoming limitations of existing methods.

G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training

26 September 2024·2099 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Oxford

G2D: a novel medical VLP framework achieves superior performance in medical image analysis by simultaneously learning global and dense visual features using image-text pairs without extra annotations.

Frustratingly Easy Test-Time Adaptation of Vision-Language Models

26 September 2024·2379 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Trento

Boost VLM performance with ZERO: a simple, fast Test-Time Adaptation method requiring only a single forward pass and exceeding state-of-the-art accuracy!

Flexible Context-Driven Sensory Processing in Dynamical Vision Models

26 September 2024·2040 words·10 mins· loading · loading

Computer Vision Vision-Language Models 🏢 MIT

Biologically-inspired DCnet neural network flexibly modulates visual processing based on context, outperforming existing models on visual search and attention tasks.

FlexCap: Describe Anything in Images in Controllable Detail

26 September 2024·2861 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Google DeepMind

FlexCap generates controllable, region-specific image descriptions of varying lengths, achieving state-of-the-art zero-shot visual question answering.

Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts

26 September 2024·1871 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of North Carolina at Chapel Hill

Flex-MoE: A novel framework flexibly handles arbitrary modality combinations in multimodal learning, even with missing data, achieving robust performance.

FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models

26 September 2024·2833 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Google DeepMind

FineStyle enables fine-grained controllable style personalization for text-to-image models using a novel concept-oriented data scaling and parameter-efficient adapter tuning, mitigating content leakag…

FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding

26 September 2024·2233 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Gaoling School of Artificial Intelligence, Renmin University of China

FineCLIP boosts fine-grained image understanding by combining real-time self-distillation with semantically rich regional contrastive learning, significantly outperforming existing methods.

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

26 September 2024·3674 words·18 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 UC Berkeley

This paper presents a novel RL framework that fine-tunes large vision-language models (VLMs) to become effective decision-making agents. By incorporating chain-of-thought reasoning, the framework enab…

Few-Shot Adversarial Prompt Learning on Vision-Language Models

26 September 2024·3134 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Sydney AI Centre University of Sydney

Few-shot adversarial prompt learning significantly improves vision-language model robustness by learning adversarially correlated text supervision and a novel training objective that enhances multi-mo…

Federated Learning from Vision-Language Foundation Models: Theoretical Analysis and Method

26 September 2024·1331 words·7 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 ShanghaiTech University

PromptFolio optimizes federated learning of vision-language models by combining global and local prompts, improving generalization and personalization, as proven theoretically and empirically.

Facilitating Multimodal Classification via Dynamically Learning Modality Gap

26 September 2024·1770 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Nanjing University of Science and Technology

Researchers dynamically integrate contrastive and supervised learning to overcome the modality imbalance problem in multimodal classification, significantly improving model performance.

EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection

26 September 2024·2156 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 National University of Singapore

EZ-HOI: Efficient Zero-Shot HOI detection adapts Vision-Language Models (VLMs) for Human-Object Interaction (HOI) tasks using a novel prompt learning framework, achieving state-of-the-art performance …