Vision-Language Models
VideoTetris: Towards Compositional Text-to-Video Generation
·2282 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
VideoTetris: a novel framework enabling compositional text-to-video generation by precisely following complex textual semantics through spatio-temporal compositional diffusion, achieving impressive qu…
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
·2766 words·13 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 University of Science and Technology of China
VIDEOLLM-MOD boosts online video-language model efficiency by selectively skipping redundant vision token computations, achieving ~42% faster training and ~30% memory savings without sacrificing perfo…
Unveiling the Tapestry of Consistency in Large Vision-Language Models
·2665 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
ConBench: Unveiling Inconsistency in Large Vision-Language Models
Unveiling Encoder-Free Vision-Language Models
·2435 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
EVE, a groundbreaking encoder-free vision-language model, rivals encoder-based counterparts using a fraction of the data and resources, demonstrating efficient, transparent training for pure decoder-o…
UNIT: Unifying Image and Text Recognition in One Vision Encoder
·1581 words·8 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Huawei Noah's Ark Lab
UNIT: One Vision Encoder Unifies Image & Text Recognition!
Unified Lexical Representation for Interpretable Visual-Language Alignment
·1730 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Fudan University
LexVLA: A novel visual-language alignment framework learns unified lexical representations for improved interpretability and efficient cross-modal retrieval.
Unified Generative and Discriminative Training for Multi-modal Large Language Models
·3972 words·19 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Zhejiang University
Unified generative-discriminative training boosts multimodal large language models (MLLMs)! Sugar, a novel approach, leverages dynamic sequence alignment and a triple kernel to enhance global and fin…
UniAR: A Unified model for predicting human Attention and Responses on visual content
·2440 words·12 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Google Research
UniAR: A unified model predicts human attention and preferences across diverse visual content (images, webpages, designs), achieving state-of-the-art performance and enabling human-centric improvement…
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
·2974 words·14 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
Uni-Med, a novel unified medical foundation model, tackles multi-task learning challenges by using Connector-MoE to efficiently bridge modalities, achieving competitive performance across six medical …
Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem
·2181 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Princeton University
Vision-language models struggle with multi-object reasoning due to the binding problem; this paper reveals human-like capacity limits in VLMs and proposes solutions.
UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models
·2814 words·14 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Institute of Computing Technology, Chinese Academy of Sciences
UMFC: Unsupervised Multi-domain Feature Calibration improves vision-language model transferability by mitigating inherent model biases via a novel, training-free feature calibration method.
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
·3187 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Arizona State University
TripletCLIP boosts CLIP’s compositional reasoning by cleverly generating synthetic hard negative image-text pairs, achieving over 9% absolute improvement on SugarCrepe.
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
·1922 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
TransAgent empowers vision-language models by collaborating with diverse expert agents, achieving state-of-the-art performance in low-shot visual recognition.
Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
·1784 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
VL-SAM: Training-free open-ended object detection & segmentation using attention maps as prompts, surpassing previous methods on LVIS and CODA datasets.
TPR: Topology-Preserving Reservoirs for Generalized Zero-Shot Learning
·2613 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Xi'an Jiaotong University
Topology-Preserving Reservoirs (TPR) enhances CLIP’s zero-shot learning by using a dual-space alignment and a topology-preserving objective to improve generalization to unseen classes, achieving state…
Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration
·1888 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Zhejiang University
UniKE: A unified multimodal editing method achieves superior reliability, generality, and locality by disentangling knowledge into semantic and truthfulness spaces, enabling enhanced collaboration bet…
Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model
·3139 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Renmin University of China
Stable Diffusion’s text-to-image generation is sped up by 25% by removing text guidance after the initial shape generation, revealing that the [EOS] token is key to early-stage image construction.
Towards Safe Concept Transfer of Multi-Modal Diffusion via Causal Representation Editing
·3866 words·19 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Hong Kong Polytechnic University
Causal Representation Editing (CRE) improves safe image generation by precisely removing unsafe concepts from diffusion models, enhancing efficiency and flexibility.
Towards Calibrated Robust Fine-Tuning of Vision-Language Models
·3938 words·19 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 University of Wisconsin-Madison
Calibrated robust fine-tuning boosts vision-language model accuracy and confidence in out-of-distribution scenarios by using a constrained multimodal contrastive loss and self-distillation.
Toward Semantic Gaze Target Detection
·2529 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Idiap Research Institute
Researchers developed a novel architecture for semantic gaze target detection, achieving state-of-the-art results by simultaneously predicting gaze target localization and semantic label, surpassing e…