Multimodal Learning

Calibrated Self-Rewarding Vision Language Models

26 September 2024·2260 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 UNC Chapel Hill

Calibrated Self-Rewarding (CSR) significantly improves vision-language models by using a novel iterative approach that incorporates visual constraints into the self-rewarding process, reducing halluci…

Boosting Vision-Language Models with Transduction

26 September 2024·2950 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 UCLouvain

TransCLIP significantly boosts vision-language model accuracy by efficiently integrating transduction, a powerful learning paradigm that leverages the structure of unlabeled data.

Boosting Text-to-Video Generative Model with MLLMs Feedback

26 September 2024·2610 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Microsoft Research

MLLMs enhance text-to-video generation by providing 135k fine-grained video preferences, creating VIDEOPREFER, and a novel reward model, VIDEORM, boosting video quality and alignment.

Boosting Alignment for Post-Unlearning Text-to-Image Generative Models

26 September 2024·4270 words·21 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Virginia Tech

This research introduces a novel framework for post-unlearning in text-to-image generative models, optimizing model updates to ensure both effective forgetting and maintained text-image alignment.

BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping

26 September 2024·3371 words·16 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Tsinghua University

BoostAdapter enhances vision-language model test-time adaptation by combining instance-agnostic historical samples with instance-aware boosting samples for superior out-of-distribution and cross-domai…

Black-Box Forgetting

26 September 2024·2445 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Tokyo University of Science

Black-Box Forgetting achieves selective forgetting in large pre-trained models by optimizing input prompts, not model parameters, thus enabling targeted class removal without requiring internal model …

BendVLM: Test-Time Debiasing of Vision-Language Embeddings

26 September 2024·2604 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 MIT

BEND-VLM: A novel, efficient test-time debiasing method for vision-language models, resolving bias without retraining.

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

26 September 2024·2546 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 State Key Laboratory for Novel Software Technology, Nanjing University

AWT: a novel framework boosts vision-language model’s zero-shot capabilities by augmenting inputs, weighting them dynamically, and leveraging optimal transport to enhance semantic correlations.

AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis

26 September 2024·2151 words·11 mins· loading · loading

Multimodal Learning Audio-Visual Learning 🏢 University of Surrey, UK

AV-GS: A novel Audio-Visual Gaussian Splatting model, uses geometry and material-aware priors to efficiently synthesize realistic binaural audio from a single audio source.

AV-Cloud: Spatial Audio Rendering Through Audio-Visual Cloud Splatting

26 September 2024·2151 words·11 mins· loading · loading

Multimodal Learning Audio-Visual Learning 🏢 University of Washington

AV-Cloud: Real-time, high-quality 3D spatial audio rendering synced with visuals, bypassing pre-rendered images for immersive virtual experiences.

Automated Multi-level Preference for MLLMs

26 September 2024·2098 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Baidu Inc.

Automated Multi-level Preference (AMP) framework significantly improves multimodal large language model (MLLM) performance by using multi-level preferences during training, reducing hallucinations and…

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

26 September 2024·4145 words·20 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Sun Yat-Sen University

AttnDreamBooth: A novel approach to text-to-image generation that overcomes limitations of prior methods by separating learning processes, resulting in significantly improved identity preservation and…

Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication

26 September 2024·2999 words·15 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 UC Los Angeles

Atlas3D enhances text-to-3D generation by integrating physics-based simulations, producing self-supporting 3D models for seamless real-world applications.

Are We on the Right Way for Evaluating Large Vision-Language Models?

26 September 2024·2514 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

MMStar benchmark tackles flawed LVLMs evaluation by focusing on vision-critical samples, minimizing data leakage, and introducing new metrics for fair multi-modal gain assessment.

Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting

26 September 2024·2150 words·11 mins· loading · loading

Multimodal Learning Embodied AI 🏢 MIT

ARCHITECT: Generating realistic 3D scenes using hierarchical 2D inpainting!

Any2Policy: Learning Visuomotor Policy with Any-Modality

26 September 2024·1938 words·10 mins· loading · loading

AI Generated Multimodal Learning Embodied AI 🏢 Midea Group

Any2Policy: a unified multi-modal system enabling robots to perform tasks using diverse instruction and observation modalities (text, image, audio, video, point cloud).

Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding

26 September 2024·2713 words·13 mins· loading · loading

Multimodal Learning Multimodal Understanding 🏢 Beijing University of Posts and Telecommunications

Animal-Bench, a new benchmark, comprehensively evaluates multimodal video models for animal-centric video understanding, featuring 13 diverse tasks across 7 animal categories and 819 species.

An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching

26 September 2024·3208 words·16 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 LTCI, Télécom Paris, Institut Polytechnique De Paris

Leveraging vision-language models, this research introduces a novel unsupervised zero-shot audio captioning method that achieves state-of-the-art performance by aligning audio and image token distribu…

An End-To-End Graph Attention Network Hashing for Cross-Modal Retrieval

26 September 2024·1722 words·9 mins· loading · loading

Multimodal Learning Cross-Modal Retrieval 🏢 Hebei Normal University

EGATH: End-to-End Graph Attention Network Hashing revolutionizes cross-modal retrieval by combining CLIP, transformers, and graph attention networks for superior semantic understanding and hash code g…

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

26 September 2024·1718 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Shenzhen Institute for Advanced Study

New Hallucination-Induced Optimization (HIO) significantly reduces hallucinations in Large Vision-Language Models (LVLMs) by amplifying contrast between correct and incorrect tokens, outperforming exi…