Vision-Language Models

VideoTetris: Towards Compositional Text-to-Video Generation

26 September 2024·2282 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

VideoTetris: a novel framework enabling compositional text-to-video generation by precisely following complex textual semantics through spatio-temporal compositional diffusion, achieving impressive qu…

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

26 September 2024·2766 words·13 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

VIDEOLLM-MOD boosts online video-language model efficiency by selectively skipping redundant vision token computations, achieving ~42% faster training and ~30% memory savings without sacrificing perfo…

Unveiling the Tapestry of Consistency in Large Vision-Language Models

26 September 2024·2665 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

ConBench: Unveiling Inconsistency in Large Vision-Language Models

Unveiling Encoder-Free Vision-Language Models

26 September 2024·2435 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

EVE, a groundbreaking encoder-free vision-language model, rivals encoder-based counterparts using a fraction of the data and resources, demonstrating efficient, transparent training for pure decoder-o…

UNIT: Unifying Image and Text Recognition in One Vision Encoder

26 September 2024·1581 words·8 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Huawei Noah's Ark Lab

UNIT: One Vision Encoder Unifies Image & Text Recognition!

Unified Lexical Representation for Interpretable Visual-Language Alignment

26 September 2024·1730 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Fudan University

LexVLA: A novel visual-language alignment framework learns unified lexical representations for improved interpretability and efficient cross-modal retrieval.

Unified Generative and Discriminative Training for Multi-modal Large Language Models

26 September 2024·3972 words·19 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Zhejiang University

Unified generative-discriminative training boosts multimodal large language models (MLLMs)! Sugar, a novel approach, leverages dynamic sequence alignment and a triple kernel to enhance global and fin…

UniAR: A Unified model for predicting human Attention and Responses on visual content

26 September 2024·2440 words·12 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Google Research

UniAR: A unified model predicts human attention and preferences across diverse visual content (images, webpages, designs), achieving state-of-the-art performance and enabling human-centric improvement…

Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE

26 September 2024·2974 words·14 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Uni-Med, a novel unified medical foundation model, tackles multi-task learning challenges by using Connector-MoE to efficiently bridge modalities, achieving competitive performance across six medical …

Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem

26 September 2024·2181 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Princeton University

Vision-language models struggle with multi-object reasoning due to the binding problem; this paper reveals human-like capacity limits in VLMs and proposes solutions.

UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models

26 September 2024·2814 words·14 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Institute of Computing Technology, Chinese Academy of Sciences

UMFC: Unsupervised Multi-domain Feature Calibration improves vision-language model transferability by mitigating inherent model biases via a novel, training-free feature calibration method.

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

26 September 2024·3187 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Arizona State University

TripletCLIP boosts CLIP’s compositional reasoning by cleverly generating synthetic hard negative image-text pairs, achieving over 9% absolute improvement on SugarCrepe.

TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration

26 September 2024·1922 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences

TransAgent empowers vision-language models by collaborating with diverse expert agents, achieving state-of-the-art performance in low-shot visual recognition.

Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts

26 September 2024·1784 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

VL-SAM: Training-free open-ended object detection & segmentation using attention maps as prompts, surpassing previous methods on LVIS and CODA datasets.

TPR: Topology-Preserving Reservoirs for Generalized Zero-Shot Learning

26 September 2024·2613 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Xi'an Jiaotong University

Topology-Preserving Reservoirs (TPR) enhances CLIP’s zero-shot learning by using a dual-space alignment and a topology-preserving objective to improve generalization to unseen classes, achieving state…

Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

26 September 2024·1888 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Zhejiang University

UniKE: A unified multimodal editing method achieves superior reliability, generality, and locality by disentangling knowledge into semantic and truthfulness spaces, enabling enhanced collaboration bet…

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

26 September 2024·3139 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Renmin University of China

Stable Diffusion’s text-to-image generation is sped up by 25% by removing text guidance after the initial shape generation, revealing that the [EOS] token is key to early-stage image construction.

Towards Safe Concept Transfer of Multi-Modal Diffusion via Causal Representation Editing

26 September 2024·3866 words·19 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University

Causal Representation Editing (CRE) improves safe image generation by precisely removing unsafe concepts from diffusion models, enhancing efficiency and flexibility.

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

26 September 2024·3938 words·19 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of Wisconsin-Madison

Calibrated robust fine-tuning boosts vision-language model accuracy and confidence in out-of-distribution scenarios by using a constrained multimodal contrastive loss and self-distillation.

Toward Semantic Gaze Target Detection

26 September 2024·2529 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Idiap Research Institute

Researchers developed a novel architecture for semantic gaze target detection, achieving state-of-the-art results by simultaneously predicting gaze target localization and semantic label, surpassing e…