Multimodal Learning
VideoTetris: Towards Compositional Text-to-Video Generation
·2282 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
VideoTetris: a novel framework enabling compositional text-to-video generation by precisely following complex textual semantics through spatio-temporal compositional diffusion, achieving impressive qu…
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
·2766 words·13 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 University of Science and Technology of China
VIDEOLLM-MOD boosts online video-language model efficiency by selectively skipping redundant vision token computations, achieving ~42% faster training and ~30% memory savings without sacrificing perfo…
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
·2010 words·10 mins·
loading
·
loading
Multimodal Learning
Human-AI Interaction
🏢 Microsoft Research
VASA-1: Real-time, lifelike talking faces generated from a single image and audio!
Unveiling the Tapestry of Consistency in Large Vision-Language Models
·2665 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
ConBench: Unveiling Inconsistency in Large Vision-Language Models
Unveiling Encoder-Free Vision-Language Models
·2435 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
EVE, a groundbreaking encoder-free vision-language model, rivals encoder-based counterparts using a fraction of the data and resources, demonstrating efficient, transparent training for pure decoder-o…
Unity by Diversity: Improved Representation Learning for Multimodal VAEs
·3037 words·15 mins·
loading
·
loading
Multimodal Learning
Multimodal Generation
🏢 ETH Zurich
MMVM VAE enhances multimodal data analysis by using a soft constraint to guide each modality’s latent representation toward a shared aggregate, improving latent representation learning and missing dat…
UNIT: Unifying Image and Text Recognition in One Vision Encoder
·1581 words·8 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Huawei Noah's Ark Lab
UNIT: One Vision Encoder Unifies Image & Text Recognition!
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
·2851 words·14 mins·
loading
·
loading
Multimodal Learning
Audio-Visual Learning
🏢 Imperial College
One model to rule them all! This paper introduces Unified Speech Recognition (USR), a single model trained for auditory, visual, and audiovisual speech recognition, achieving state-of-the-art results …
Unified Lexical Representation for Interpretable Visual-Language Alignment
·1730 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Fudan University
LexVLA: A novel visual-language alignment framework learns unified lexical representations for improved interpretability and efficient cross-modal retrieval.
Unified Generative and Discriminative Training for Multi-modal Large Language Models
·3972 words·19 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Zhejiang University
Unified generative-discriminative training boosts multimodal large language models (MLLMs)! Sugar, a novel approach, leverages dynamic sequence alignment and a triple kernel to enhance global and fin…
UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner
·2866 words·14 mins·
loading
·
loading
AI Generated
Multimodal Learning
Audio-Visual Learning
🏢 Tsinghua University
UniAudio 1.5 uses a novel LLM-driven audio codec to enable frozen LLMs to perform various audio tasks with just a few examples, opening new avenues for efficient few-shot cross-modal learning.
UniAR: A Unified model for predicting human Attention and Responses on visual content
·2440 words·12 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Google Research
UniAR: A unified model predicts human attention and preferences across diverse visual content (images, webpages, designs), achieving state-of-the-art performance and enabling human-centric improvement…
Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
·2974 words·14 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
Uni-Med, a novel unified medical foundation model, tackles multi-task learning challenges by using Connector-MoE to efficiently bridge modalities, achieving competitive performance across six medical …
Understanding the Limits of Vision Language Models Through the Lens of the Binding Problem
·2181 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Princeton University
Vision-language models struggle with multi-object reasoning due to the binding problem; this paper reveals human-like capacity limits in VLMs and proposes solutions.
UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models
·2814 words·14 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Institute of Computing Technology, Chinese Academy of Sciences
UMFC: Unsupervised Multi-domain Feature Calibration improves vision-language model transferability by mitigating inherent model biases via a novel, training-free feature calibration method.
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
·3187 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Arizona State University
TripletCLIP boosts CLIP’s compositional reasoning by cleverly generating synthetic hard negative image-text pairs, achieving over 9% absolute improvement on SugarCrepe.
TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
·1922 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
TransAgent empowers vision-language models by collaborating with diverse expert agents, achieving state-of-the-art performance in low-shot visual recognition.
Trajectory Diffusion for ObjectGoal Navigation
·2125 words·10 mins·
loading
·
loading
Multimodal Learning
Embodied AI
🏢 University of Chinese Academy of Sciences
Trajectory Diffusion (T-Diff) significantly improves object goal navigation by learning sequential planning through trajectory diffusion, resulting in more accurate and efficient navigation.
Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
·1784 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
VL-SAM: Training-free open-ended object detection & segmentation using attention maps as prompts, surpassing previous methods on LVIS and CODA datasets.
TPR: Topology-Preserving Reservoirs for Generalized Zero-Shot Learning
·2613 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Xi'an Jiaotong University
Topology-Preserving Reservoirs (TPR) enhances CLIP’s zero-shot learning by using a dual-space alignment and a topology-preserving objective to improve generalization to unseen classes, achieving state…