Skip to main content

Multimodal Learning

Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration
·1888 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Zhejiang University
UniKE: A unified multimodal editing method achieves superior reliability, generality, and locality by disentangling knowledge into semantic and truthfulness spaces, enabling enhanced collaboration bet…
Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model
·3139 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Renmin University of China
Stable Diffusion’s text-to-image generation is sped up by 25% by removing text guidance after the initial shape generation, revealing that the [EOS] token is key to early-stage image construction.
Towards Safe Concept Transfer of Multi-Modal Diffusion via Causal Representation Editing
·3866 words·19 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University
Causal Representation Editing (CRE) improves safe image generation by precisely removing unsafe concepts from diffusion models, enhancing efficiency and flexibility.
Towards Calibrated Robust Fine-Tuning of Vision-Language Models
·3938 words·19 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 University of Wisconsin-Madison
Calibrated robust fine-tuning boosts vision-language model accuracy and confidence in out-of-distribution scenarios by using a constrained multimodal contrastive loss and self-distillation.
Toward Semantic Gaze Target Detection
·2529 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Idiap Research Institute
Researchers developed a novel architecture for semantic gaze target detection, achieving state-of-the-art results by simultaneously predicting gaze target localization and semantic label, surpassing e…
Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning
·1671 words·8 mins· loading · loading
Multimodal Learning Sentiment Analysis 🏢 Peking University
Hierarchical Representation Learning Framework (HRLF) significantly improves Multimodal Sentiment Analysis (MSA) accuracy by effectively addressing incomplete data through fine-grained representation …
Toward a Stable, Fair, and Comprehensive Evaluation of Object Hallucination in Large Vision-Language Models
·2235 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Harbin Institute of Technology
LeHaCE: a novel framework for evaluating object hallucination in LVLMs, improving evaluation stability and fairness by accounting for instruction-induced image description length variations.
TFG: Unified Training-Free Guidance for Diffusion Models
·3585 words·17 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Stanford University
TFG: A unified, training-free framework for boosting diffusion model performance by efficiently searching its algorithm-agnostic design space.
Textual Training for the Hassle-Free Removal of Unwanted Visual Data: Case Studies on OOD and Hateful Image Detection
·2145 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Seoul National University
Hassle-Free Textual Training (HFTT) uses only textual data to effectively remove unwanted visual data from AI training datasets, significantly reducing human annotation needs.
Text-Infused Attention and Foreground-Aware Modeling for Zero-Shot Temporal Action Detection
·2535 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Dept. of Artificial Intelligence, Korea University
Ti-FAD: a novel zero-shot temporal action detection model outperforms state-of-the-art methods by enhancing text-related visual focus and foreground awareness.
Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models
·2675 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 School of Computer Science and Engineering, Tianjin University of Technology
Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR) significantly improves vision-language model robustness against adversarial attacks by aligning and constraining text-guided attention, achievi…
Text-DiFuse: An Interactive Multi-Modal Image Fusion Framework based on Text-modulated Diffusion Model
·2167 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Wuhan University
Text-DiFuse: A novel interactive multi-modal image fusion framework leverages text-modulated diffusion models for superior performance in complex scenarios.
Text-Aware Diffusion for Policy Learning
·3340 words·16 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Brown University
Text-Aware Diffusion for Policy Learning (TADPoLe) uses pretrained diffusion models for zero-shot reward generation, enabling natural language-driven policy learning without manual reward design.
Testing Semantic Importance via Betting
·4904 words·24 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Johns Hopkins University
This work presents statistically grounded methods to rank semantic concept importance in black-box models, using conditional independence testing for both global and local interpretations.
Tell What You Hear From What You See - Video to Audio Generation Through Text
·2349 words·12 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 University of Washington
VATT: Text-guided video-to-audio generation, enabling refined audio control via text prompts and improved compatibility.
Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation
·2449 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
Tactile DreamFusion: High-resolution tactile sensing enhances 3D generation, creating realistic geometric details previously unattainable.
Tackling Uncertain Correspondences for Multi-Modal Entity Alignment
·1671 words·8 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
TMEA: A novel approach significantly boosts multi-modal entity alignment accuracy by effectively handling uncertain correspondences between modalities, improving data integration for diverse knowledge…
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
·2422 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China
TabPedia: a novel large vision-language model, achieves superior visual table understanding by seamlessly integrating diverse tasks via a concept synergy mechanism and a new benchmark.
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
·3387 words·16 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 UC Santa Barbara
T2V-Turbo breaks the quality bottleneck of video consistency models by integrating mixed reward feedback during consistency distillation, enabling high-quality video generation with significantly fast…
Synergistic Dual Spatial-aware Generation of Image-to-text and Text-to-image
·2896 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Tianjin University
Synergistic Dual Spatial-aware Generation boosts image-to-text and text-to-image accuracy using a novel 3D scene graph and dual learning framework.