Multimodal Learning

Rethinking Misalignment in Vision-Language Model Adaptation from a Causal Perspective

26 September 2024·2085 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Institute of Software Chinese Academy of Sciences

Vision-language model adaptation struggles with misalignment; this paper introduces Causality-Guided Semantic Decoupling and Classification (CDC) to mitigate this, boosting performance.

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability, Reproducibility, and Practicality

26 September 2024·3943 words·19 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of California San Diego

This paper presents Text-to-Video Human Evaluation (T2VHE), a new protocol for evaluating text-to-video models, improving reliability, reproducibility, and practicality.

ReplaceAnything3D: Text-Guided Object Replacement in 3D Scenes with Compositional Scene Representations

26 September 2024·4073 words·20 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University College London

ReplaceAnything3D (RAM3D) revolutionizes 3D scene editing with a text-guided, multi-view consistent approach for seamlessly replacing or adding 3D objects in complex scenes.

Renovating Names in Open-Vocabulary Segmentation Benchmarks

26 September 2024·4294 words·21 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Bosch IoC Lab

RENOVATE renovates open-vocabulary segmentation benchmarks by automatically improving class names, leading to stronger models and more accurate evaluations.

Referring Human Pose and Mask Estimation In the Wild

26 September 2024·2191 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Western Australia

RefHuman: a new dataset and UniPHD model achieve state-of-the-art referring human pose and mask estimation in the wild, using text or positional prompts.

Referencing Where to Focus: Improving Visual Grounding with Referential Query

26 September 2024·2958 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 National Key Laboratory of Human-Machine Hybrid Augmented Intelligence

RefFormer boosts visual grounding accuracy by intelligently adapting queries using multi-level image features, effectively guiding the decoder towards the target object.

Recognize Any Regions

26 September 2024·2350 words·12 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of Surrey

RegionSpot efficiently integrates pretrained localization and vision-language models for superior open-world object recognition, achieving significant performance gains with minimal training.

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

26 September 2024·2441 words·12 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Stanford University

RAVL: a novel approach that accurately discovers and effectively mitigates spurious correlations in fine-tuned vision-language models, improving zero-shot classification accuracy.

QUEST: Quadruple Multimodal Contrastive Learning with Constraints and Self-Penalization

26 September 2024·1744 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Beihang University

QUEST: Quadruple Multimodal Contrastive Learning tackles feature suppression by using quaternion embedding to extract unique information while penalizing excessive shared information influence, achiev…

Q-VLM: Post-training Quantization for Large Vision-Language Models

26 September 2024·2070 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Q-VLM: A novel post-training quantization framework significantly compresses large vision-language models, boosting inference speed without sacrificing accuracy.

Propensity Score Alignment of Unpaired Multimodal Data

26 September 2024·2058 words·10 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of British Columbia

Unlocking multimodal learning’s potential with propensity scores: This novel approach aligns unpaired data across modalities, significantly improving representation learning.

Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation

26 September 2024·2063 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Strasbourg

PeskaVLP: Hierarchical knowledge augmentation boosts surgical video-language pretraining!

Probabilistic Conformal Distillation for Enhancing Missing Modality Robustness

26 September 2024·3353 words·16 mins· loading · loading

AI Generated Multimodal Learning Multimodal Understanding 🏢 Shanghai Jiao Tong University

Enhance multimodal model robustness against missing data with Probabilistic Conformal Distillation (PCD)! PCD models missing modalities probabilistically, achieving superior performance on multiple be…

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

26 September 2024·4267 words·21 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Nanjing University

Prism: a novel framework disentangles perception and reasoning in Vision-Language Models (VLMs) for improved model assessment and efficient VLM development.

Preventing Model Collapse in Deep Canonical Correlation Analysis by Noise Regularization

26 September 2024·2437 words·12 mins· loading · loading

Multimodal Learning Representation Learning 🏢 Hong Kong Polytechnic University

Noise Regularization rescues Deep Canonical Correlation Analysis from model collapse!

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

26 September 2024·2633 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Oxford

Pre-trained text-to-image diffusion models create highly effective, versatile representations for embodied AI control, surpassing previous methods.

PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

26 September 2024·3082 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Sun Yat-Sen University

PIVOT-R, a novel primitive-driven waypoint-aware world model, significantly boosts robotic manipulation performance and efficiency via an asynchronous hierarchical executor.

PERIA: Perceive, Reason, Imagine, Act via Holistic Language and Vision Planning for Manipulation

26 September 2024·2974 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 College of Intelligence and Computing, Tianjin University

PERIA: Holistic language & vision planning for complex robotic manipulation!

ParallelEdits: Efficient Multi-Aspect Text-Driven Image Editing with Attention Grouping

26 September 2024·3634 words·18 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of Buffalo

ParallelEdits efficiently edits multiple image aspects simultaneously, guided by text prompts, surpassing sequential methods in speed and accuracy.

Pandora's Box: Towards Building Universal Attackers against Real-World Large Vision-Language Models

26 September 2024·2651 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

Researchers developed a universal adversarial patch to fool real-world large vision-language models (LVLMs) across multiple tasks, without needing access to internal model details.