Vision-Language Models

Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning

26 September 2024·2845 words·14 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Harvard University

Eye-gaze data boosts medical image-text alignment!

Extending Multi-modal Contrastive Representations

26 September 2024·2089 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Zhejiang University

Ex-MCR: Efficiently build unified multi-modal representations by extending, not connecting, pre-trained spaces, achieving superior performance with less paired data and training.

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

26 September 2024·2805 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Show Lab, National University of Singapore

EvolveDirector trains competitive text-to-image models using publicly available data by cleverly leveraging large vision-language models to curate and refine training datasets, dramatically reducing d…

Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor

26 September 2024·2050 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Researchers introduce object-aware backdoors in Vision-and-Language Navigation, enabling malicious behavior upon encountering specific objects, demonstrating the vulnerability of real-world AI agents.

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

26 September 2024·3278 words·16 mins· loading · loading

Natural Language Processing Vision-Language Models 🏢 University of Chinese Academy of Sciences

DEVIL: a novel text-to-video evaluation protocol focusing on video dynamics, resulting in more realistic video generation.

Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting

26 September 2024·1899 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

Frolic: A label-free framework boosts zero-shot vision model accuracy by learning prompt distributions and correcting label bias, achieving state-of-the-art performance across multiple datasets.

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

26 September 2024·2723 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University

DEMO framework enhances text-to-video generation by decomposing text encoding and conditioning into content and motion components, resulting in videos with significantly improved motion dynamics.

Enhancing Large Vision Language Models with Self-Training on Image Comprehension

26 September 2024·3514 words·17 mins· loading · loading

AI Generated Natural Language Processing Vision-Language Models 🏢 UC Los Angeles

Self-Training on Image Comprehension (STIC) significantly boosts Large Vision Language Model (LVLM) performance using unlabeled image data. STIC generates a preference dataset for image descriptions …

Easy Regional Contrastive Learning of Expressive Fashion Representations

26 September 2024·3119 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Virginia

E2, a novel regional contrastive learning method, enhances vision-language models for expressive fashion representations by explicitly attending to fashion details with minimal additional parameters, …

Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models

26 September 2024·3018 words·15 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

Dual Risk Minimization (DRM) improves fine-tuned zero-shot models’ robustness by combining empirical and worst-case risk minimization, using LLMs to identify core features, achieving state-of-the-art …

Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

26 September 2024·2027 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

Dual Prototype Evolving (DPE) significantly boosts vision-language model generalization by cumulatively learning multi-modal prototypes from unlabeled test data, outperforming current state-of-the-art…

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

26 September 2024·3604 words·17 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 UC Berkeley

DigiRL: Autonomous RL trains robust in-the-wild device-control agents by offline-to-online RL, surpassing prior methods.

Diffusion PID: Interpreting Diffusion via Partial Information Decomposition

26 September 2024·5438 words·26 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

DiffusionPID unveils the secrets of text-to-image diffusion models by decomposing text prompts into unique, redundant, and synergistic components, providing insights into how individual words and thei…

DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion

26 September 2024·2444 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Zhejiang University

DiffPano generates scalable, consistent, and diverse panoramic images from text descriptions and camera poses using a novel spherical epipolar-aware diffusion model.

Dense Connector for MLLMs

26 September 2024·3198 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Boosting multimodal LLMs, the Dense Connector efficiently integrates multi-layer visual features for significantly enhanced performance.

Déjà Vu Memorization in Vision–Language Models

26 September 2024·2200 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Meta

Vision-language models (VLMs) memorize training data, impacting generalization. This paper introduces ‘déjà vu memorization,’ a novel method measuring this, revealing significant memorization even in…

DeiSAM: Segment Anything with Deictic Prompting

26 September 2024·3865 words·19 mins· loading · loading

AI Generated Natural Language Processing Vision-Language Models 🏢 Technical University of Darmstadt

DeiSAM uses large language models and differentiable logic to achieve highly accurate image segmentation using complex, context-dependent descriptions.

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

26 September 2024·2495 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Microsoft Research

DeepStack: Stacking visual tokens boosts LMMs efficiency and performance!

Deep Correlated Prompting for Visual Recognition with Missing Modalities

26 September 2024·1823 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 College of Intelligence and Computing, Tianjin University

Deep Correlated Prompting enhances large multimodal models’ robustness against missing data by leveraging inter-layer and cross-modality correlations in prompts, achieving superior performance with mi…

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

26 September 2024·2855 words·14 mins· loading · loading

Computer Vision Vision-Language Models 🏢 University of Maryland, College Park

This paper presents a general framework for interpreting Vision Transformer (ViT) components, mapping their contributions to CLIP space for textual interpretation, and introduces a scoring function fo…