Skip to main content

Multimodal Learning

Calibrated Self-Rewarding Vision Language Models
·2260 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 UNC Chapel Hill
Calibrated Self-Rewarding (CSR) significantly improves vision-language models by using a novel iterative approach that incorporates visual constraints into the self-rewarding process, reducing halluci…
Boosting Vision-Language Models with Transduction
·2950 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 UCLouvain
TransCLIP significantly boosts vision-language model accuracy by efficiently integrating transduction, a powerful learning paradigm that leverages the structure of unlabeled data.
Boosting Text-to-Video Generative Model with MLLMs Feedback
·2610 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Microsoft Research
MLLMs enhance text-to-video generation by providing 135k fine-grained video preferences, creating VIDEOPREFER, and a novel reward model, VIDEORM, boosting video quality and alignment.
Boosting Alignment for Post-Unlearning Text-to-Image Generative Models
·4270 words·21 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Virginia Tech
This research introduces a novel framework for post-unlearning in text-to-image generative models, optimizing model updates to ensure both effective forgetting and maintained text-image alignment.
BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping
·3371 words·16 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Tsinghua University
BoostAdapter enhances vision-language model test-time adaptation by combining instance-agnostic historical samples with instance-aware boosting samples for superior out-of-distribution and cross-domai…
Black-Box Forgetting
·2445 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Tokyo University of Science
Black-Box Forgetting achieves selective forgetting in large pre-trained models by optimizing input prompts, not model parameters, thus enabling targeted class removal without requiring internal model …
BendVLM: Test-Time Debiasing of Vision-Language Embeddings
·2604 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 MIT
BEND-VLM: A novel, efficient test-time debiasing method for vision-language models, resolving bias without retraining.
AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
·2546 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 State Key Laboratory for Novel Software Technology, Nanjing University
AWT: a novel framework boosts vision-language model’s zero-shot capabilities by augmenting inputs, weighting them dynamically, and leveraging optimal transport to enhance semantic correlations.
AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis
·2151 words·11 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 University of Surrey, UK
AV-GS: A novel Audio-Visual Gaussian Splatting model, uses geometry and material-aware priors to efficiently synthesize realistic binaural audio from a single audio source.
AV-Cloud: Spatial Audio Rendering Through Audio-Visual Cloud Splatting
·2151 words·11 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 University of Washington
AV-Cloud: Real-time, high-quality 3D spatial audio rendering synced with visuals, bypassing pre-rendered images for immersive virtual experiences.
Automated Multi-level Preference for MLLMs
·2098 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Baidu Inc.
Automated Multi-level Preference (AMP) framework significantly improves multimodal large language model (MLLM) performance by using multi-level preferences during training, reducing hallucinations and…
AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation
·4145 words·20 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Sun Yat-Sen University
AttnDreamBooth: A novel approach to text-to-image generation that overcomes limitations of prior methods by separating learning processes, resulting in significantly improved identity preservation and…
Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication
·2999 words·15 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 UC Los Angeles
Atlas3D enhances text-to-3D generation by integrating physics-based simulations, producing self-supporting 3D models for seamless real-world applications.
Are We on the Right Way for Evaluating Large Vision-Language Models?
·2514 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China
MMStar benchmark tackles flawed LVLMs evaluation by focusing on vision-critical samples, minimizing data leakage, and introducing new metrics for fair multi-modal gain assessment.
Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting
·2150 words·11 mins· loading · loading
Multimodal Learning Embodied AI 🏢 MIT
ARCHITECT: Generating realistic 3D scenes using hierarchical 2D inpainting!
Any2Policy: Learning Visuomotor Policy with Any-Modality
·1938 words·10 mins· loading · loading
AI Generated Multimodal Learning Embodied AI 🏢 Midea Group
Any2Policy: a unified multi-modal system enabling robots to perform tasks using diverse instruction and observation modalities (text, image, audio, video, point cloud).
Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding
·2713 words·13 mins· loading · loading
Multimodal Learning Multimodal Understanding 🏢 Beijing University of Posts and Telecommunications
Animal-Bench, a new benchmark, comprehensively evaluates multimodal video models for animal-centric video understanding, featuring 13 diverse tasks across 7 animal categories and 819 species.
An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
·3208 words·16 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 LTCI, Télécom Paris, Institut Polytechnique De Paris
Leveraging vision-language models, this research introduces a novel unsupervised zero-shot audio captioning method that achieves state-of-the-art performance by aligning audio and image token distribu…
An End-To-End Graph Attention Network Hashing for Cross-Modal Retrieval
·1722 words·9 mins· loading · loading
Multimodal Learning Cross-Modal Retrieval 🏢 Hebei Normal University
EGATH: End-to-End Graph Attention Network Hashing revolutionizes cross-modal retrieval by combining CLIP, transformers, and graph attention networks for superior semantic understanding and hash code g…
Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization
·1718 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Shenzhen Institute for Advanced Study
New Hallucination-Induced Optimization (HIO) significantly reduces hallucinations in Large Vision-Language Models (LVLMs) by amplifying contrast between correct and incorrect tokens, outperforming exi…