Skip to main content

Vision-Language Models

Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning
·2845 words·14 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Harvard University
Eye-gaze data boosts medical image-text alignment!
Extending Multi-modal Contrastive Representations
·2089 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Zhejiang University
Ex-MCR: Efficiently build unified multi-modal representations by extending, not connecting, pre-trained spaces, achieving superior performance with less paired data and training.
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
·2805 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Show Lab, National University of Singapore
EvolveDirector trains competitive text-to-image models using publicly available data by cleverly leveraging large vision-language models to curate and refine training datasets, dramatically reducing d…
Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor
·2050 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Tsinghua University
Researchers introduce object-aware backdoors in Vision-and-Language Navigation, enabling malicious behavior upon encountering specific objects, demonstrating the vulnerability of real-world AI agents.
Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
·3278 words·16 mins· loading · loading
Natural Language Processing Vision-Language Models 🏢 University of Chinese Academy of Sciences
DEVIL: a novel text-to-video evaluation protocol focusing on video dynamics, resulting in more realistic video generation.
Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting
·1899 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China
Frolic: A label-free framework boosts zero-shot vision model accuracy by learning prompt distributions and correcting label bias, achieving state-of-the-art performance across multiple datasets.
Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning
·2723 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University
DEMO framework enhances text-to-video generation by decomposing text encoding and conditioning into content and motion components, resulting in videos with significantly improved motion dynamics.
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
·3514 words·17 mins· loading · loading
AI Generated Natural Language Processing Vision-Language Models 🏢 UC Los Angeles
Self-Training on Image Comprehension (STIC) significantly boosts Large Vision Language Model (LVLM) performance using unlabeled image data. STIC generates a preference dataset for image descriptions …
Easy Regional Contrastive Learning of Expressive Fashion Representations
·3119 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Virginia
E2, a novel regional contrastive learning method, enhances vision-language models for expressive fashion representations by explicitly attending to fashion details with minimal additional parameters, …
Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models
·3018 words·15 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
Dual Risk Minimization (DRM) improves fine-tuned zero-shot models’ robustness by combining empirical and worst-case risk minimization, using LLMs to identify core features, achieving state-of-the-art …
Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models
·2027 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
Dual Prototype Evolving (DPE) significantly boosts vision-language model generalization by cumulatively learning multi-modal prototypes from unlabeled test data, outperforming current state-of-the-art…
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
·3604 words·17 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 UC Berkeley
DigiRL: Autonomous RL trains robust in-the-wild device-control agents by offline-to-online RL, surpassing prior methods.
Diffusion PID: Interpreting Diffusion via Partial Information Decomposition
·5438 words·26 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
DiffusionPID unveils the secrets of text-to-image diffusion models by decomposing text prompts into unique, redundant, and synergistic components, providing insights into how individual words and thei…
DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion
·2444 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Zhejiang University
DiffPano generates scalable, consistent, and diverse panoramic images from text descriptions and camera poses using a novel spherical epipolar-aware diffusion model.
Dense Connector for MLLMs
·3198 words·16 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Tsinghua University
Boosting multimodal LLMs, the Dense Connector efficiently integrates multi-layer visual features for significantly enhanced performance.
Déjà Vu Memorization in Vision–Language Models
·2200 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Meta
Vision-language models (VLMs) memorize training data, impacting generalization. This paper introduces ‘déjà vu memorization,’ a novel method measuring this, revealing significant memorization even in…
DeiSAM: Segment Anything with Deictic Prompting
·3865 words·19 mins· loading · loading
AI Generated Natural Language Processing Vision-Language Models 🏢 Technical University of Darmstadt
DeiSAM uses large language models and differentiable logic to achieve highly accurate image segmentation using complex, context-dependent descriptions.
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
·2495 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Microsoft Research
DeepStack: Stacking visual tokens boosts LMMs efficiency and performance!
Deep Correlated Prompting for Visual Recognition with Missing Modalities
·1823 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 College of Intelligence and Computing, Tianjin University
Deep Correlated Prompting enhances large multimodal models’ robustness against missing data by leveraging inter-layer and cross-modality correlations in prompts, achieving superior performance with mi…
Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
·2855 words·14 mins· loading · loading
Computer Vision Vision-Language Models 🏢 University of Maryland, College Park
This paper presents a general framework for interpreting Vision Transformer (ViT) components, mapping their contributions to CLIP space for textual interpretation, and introduces a scoring function fo…