Skip to main content

Multimodal Learning

Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective
·2085 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Institute of Software Chinese Academy of Sciences
Vision-language model adaptation struggles with misalignment; this paper introduces Causality-Guided Semantic Decoupling and Classification (CDC) to mitigate this, boosting performance.
Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability, Reproducibility, and Practicality
·3943 words·19 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 University of California San Diego
This paper presents Text-to-Video Human Evaluation (T2VHE), a new protocol for evaluating text-to-video models, improving reliability, reproducibility, and practicality.
ReplaceAnything3D: Text-Guided Object Replacement in 3D Scenes with Compositional Scene Representations
·4073 words·20 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 University College London
ReplaceAnything3D (RAM3D) revolutionizes 3D scene editing with a text-guided, multi-view consistent approach for seamlessly replacing or adding 3D objects in complex scenes.
Renovating Names in Open-Vocabulary Segmentation Benchmarks
·4294 words·21 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Bosch IoC Lab
RENOVATE renovates open-vocabulary segmentation benchmarks by automatically improving class names, leading to stronger models and more accurate evaluations.
Referring Human Pose and Mask Estimation In the Wild
·2191 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Western Australia
RefHuman: a new dataset and UniPHD model achieve state-of-the-art referring human pose and mask estimation in the wild, using text or positional prompts.
Referencing Where to Focus: Improving Visual Grounding with Referential Query
·2958 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 National Key Laboratory of Human-Machine Hybrid Augmented Intelligence
RefFormer boosts visual grounding accuracy by intelligently adapting queries using multi-level image features, effectively guiding the decoder towards the target object.
Recognize Any Regions
·2350 words·12 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 University of Surrey
RegionSpot efficiently integrates pretrained localization and vision-language models for superior open-world object recognition, achieving significant performance gains with minimal training.
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
·2441 words·12 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Stanford University
RAVL: a novel approach that accurately discovers and effectively mitigates spurious correlations in fine-tuned vision-language models, improving zero-shot classification accuracy.
QUEST: Quadruple Multimodal Contrastive Learning with Constraints and Self-Penalization
·1744 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Beihang University
QUEST: Quadruple Multimodal Contrastive Learning tackles feature suppression by using quaternion embedding to extract unique information while penalizing excessive shared information influence, achiev…
Q-VLM: Post-training Quantization for Large Vision-Language Models
·2070 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Tsinghua University
Q-VLM: A novel post-training quantization framework significantly compresses large vision-language models, boosting inference speed without sacrificing accuracy.
Propensity Score Alignment of Unpaired Multimodal Data
·2058 words·10 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 University of British Columbia
Unlocking multimodal learning’s potential with propensity scores: This novel approach aligns unpaired data across modalities, significantly improving representation learning.
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
·2063 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Strasbourg
PeskaVLP: Hierarchical knowledge augmentation boosts surgical video-language pretraining!
Probabilistic Conformal Distillation for Enhancing Missing Modality Robustness
·3353 words·16 mins· loading · loading
AI Generated Multimodal Learning Multimodal Understanding 🏢 Shanghai Jiao Tong University
Enhance multimodal model robustness against missing data with Probabilistic Conformal Distillation (PCD)! PCD models missing modalities probabilistically, achieving superior performance on multiple be…
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
·4267 words·21 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Nanjing University
Prism: a novel framework disentangles perception and reasoning in Vision-Language Models (VLMs) for improved model assessment and efficient VLM development.
Preventing Model Collapse in Deep Canonical Correlation Analysis by Noise Regularization
·2437 words·12 mins· loading · loading
Multimodal Learning Representation Learning 🏢 Hong Kong Polytechnic University
Noise Regularization rescues Deep Canonical Correlation Analysis from model collapse!
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
·2633 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Oxford
Pre-trained text-to-image diffusion models create highly effective, versatile representations for embodied AI control, surpassing previous methods.
PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation
·3082 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Sun Yat-Sen University
PIVOT-R, a novel primitive-driven waypoint-aware world model, significantly boosts robotic manipulation performance and efficiency via an asynchronous hierarchical executor.
PERIA: Perceive, Reason, Imagine, Act via Holistic Language and Vision Planning for Manipulation
·2974 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 College of Intelligence and Computing, Tianjin University
PERIA: Holistic language & vision planning for complex robotic manipulation!
ParallelEdits: Efficient Multi-Aspect Text-Driven Image Editing with Attention Grouping
·3634 words·18 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 University of Buffalo
ParallelEdits efficiently edits multiple image aspects simultaneously, guided by text prompts, surpassing sequential methods in speed and accuracy.
Pandora's Box: Towards Building Universal Attackers against Real-World Large Vision-Language Models
·2651 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Peking University
Researchers developed a universal adversarial patch to fool real-world large vision-language models (LVLMs) across multiple tasks, without needing access to internal model details.