Vision-Language Models
Harmonizing Visual Text Comprehension and Generation
·2525 words·12 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 East China Normal University
TextHarmony: a unified multimodal model harmonizes visual text comprehension & generation, achieving improved performance across benchmarks with minimal parameter increase.
GuardT2I: Defending Text-to-Image Models from Adversarial Prompts
·3130 words·15 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
GuardT2I: A novel framework defends text-to-image models against adversarial prompts by translating latent guidance embeddings into natural language, enabling effective adversarial prompt detection wi…
GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation
·2589 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 KAIST
GrounDiT: Training-free spatial grounding for text-to-image generation using Diffusion Transformers and a novel noisy patch transplantation technique for precise object placement.
GOMAA-Geo: GOal Modality Agnostic Active Geo-localization
·3664 words·18 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Department of Computer Science and Engineering, Washington University in St. Louis
GOMAA-Geo, a novel framework, enables efficient and accurate goal localization using aerial imagery, regardless of goal description modality (text or images), demonstrating impressive zero-shot genera…
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
·2396 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
GITA, a novel framework, integrates visual graphs into language models for superior vision-language graph reasoning, outperforming existing LLMs and introducing the first vision-language dataset, GVLQ…
GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing
·2413 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
GenArtist uses a multimodal large language model as an AI agent to unify image generation and editing, achieving state-of-the-art performance by decomposing complex tasks and leveraging a comprehensiv…
GAMap: Zero-Shot Object Goal Navigation with Multi-Scale Geometric-Affordance Guidance
·2182 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 New York University Abu Dhabi
GAMap: Zero-shot object goal navigation excels by using multi-scale geometric-affordance guidance, significantly boosting robot success rates in unseen environments.
G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models
·2323 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 City University of Hong Kong
G3: A novel framework leverages Retrieval-Augmented Generation to achieve highly accurate worldwide image geolocalization, overcoming limitations of existing methods.
G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training
·2099 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Oxford
G2D: a novel medical VLP framework achieves superior performance in medical image analysis by simultaneously learning global and dense visual features using image-text pairs without extra annotations.
Frustratingly Easy Test-Time Adaptation of Vision-Language Models
·2379 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Trento
Boost VLM performance with ZERO: a simple, fast Test-Time Adaptation method requiring only a single forward pass and exceeding state-of-the-art accuracy!
Flexible Context-Driven Sensory Processing in Dynamical Vision Models
·2040 words·10 mins·
loading
·
loading
Computer Vision
Vision-Language Models
🏢 MIT
Biologically-inspired DCnet neural network flexibly modulates visual processing based on context, outperforming existing models on visual search and attention tasks.
FlexCap: Describe Anything in Images in Controllable Detail
·2861 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
FlexCap generates controllable, region-specific image descriptions of varying lengths, achieving state-of-the-art zero-shot visual question answering.
Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts
·1871 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of North Carolina at Chapel Hill
Flex-MoE: A novel framework flexibly handles arbitrary modality combinations in multimodal learning, even with missing data, achieving robust performance.
FineStyle: Fine-grained Controllable Style Personalization for Text-to-image Models
·2833 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
FineStyle enables fine-grained controllable style personalization for text-to-image models using a novel concept-oriented data scaling and parameter-efficient adapter tuning, mitigating content leakag…
FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding
·2233 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Gaoling School of Artificial Intelligence, Renmin University of China
FineCLIP boosts fine-grained image understanding by combining real-time self-distillation with semantically rich regional contrastive learning, significantly outperforming existing methods.
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
·3674 words·18 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UC Berkeley
This paper presents a novel RL framework that fine-tunes large vision-language models (VLMs) to become effective decision-making agents. By incorporating chain-of-thought reasoning, the framework enab…
Few-Shot Adversarial Prompt Learning on Vision-Language Models
·3134 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Sydney AI Centre University of Sydney
Few-shot adversarial prompt learning significantly improves vision-language model robustness by learning adversarially correlated text supervision and a novel training objective that enhances multi-mo…
Federated Learning from Vision-Language Foundation Models: Theoretical Analysis and Method
·1331 words·7 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 ShanghaiTech University
PromptFolio optimizes federated learning of vision-language models by combining global and local prompts, improving generalization and personalization, as proven theoretically and empirically.
Facilitating Multimodal Classification via Dynamically Learning Modality Gap
·1770 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Nanjing University of Science and Technology
Researchers dynamically integrate contrastive and supervised learning to overcome the modality imbalance problem in multimodal classification, significantly improving model performance.
EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection
·2156 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 National University of Singapore
EZ-HOI: Efficient Zero-Shot HOI detection adapts Vision-Language Models (VLMs) for Human-Object Interaction (HOI) tasks using a novel prompt learning framework, achieving state-of-the-art performance …