Multimodal Learning

Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

26 September 2024·1888 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Zhejiang University

UniKE: A unified multimodal editing method achieves superior reliability, generality, and locality by disentangling knowledge into semantic and truthfulness spaces, enabling enhanced collaboration bet…

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

26 September 2024·3139 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Renmin University of China

Stable Diffusion’s text-to-image generation is sped up by 25% by removing text guidance after the initial shape generation, revealing that the [EOS] token is key to early-stage image construction.

Towards Safe Concept Transfer of Multi-Modal Diffusion via Causal Representation Editing

26 September 2024·3866 words·19 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University

Causal Representation Editing (CRE) improves safe image generation by precisely removing unsafe concepts from diffusion models, enhancing efficiency and flexibility.

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

26 September 2024·3938 words·19 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of Wisconsin-Madison

Calibrated robust fine-tuning boosts vision-language model accuracy and confidence in out-of-distribution scenarios by using a constrained multimodal contrastive loss and self-distillation.

Toward Semantic Gaze Target Detection

26 September 2024·2529 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Idiap Research Institute

Researchers developed a novel architecture for semantic gaze target detection, achieving state-of-the-art results by simultaneously predicting gaze target localization and semantic label, surpassing e…

Toward Robust Incomplete Multimodal Sentiment Analysis via Hierarchical Representation Learning

26 September 2024·1671 words·8 mins· loading · loading

Multimodal Learning Sentiment Analysis 🏢 Peking University

Hierarchical Representation Learning Framework (HRLF) significantly improves Multimodal Sentiment Analysis (MSA) accuracy by effectively addressing incomplete data through fine-grained representation …

Toward a Stable, Fair, and Comprehensive Evaluation of Object Hallucination in Large Vision-Language Models

26 September 2024·2235 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Harbin Institute of Technology

LeHaCE: a novel framework for evaluating object hallucination in LVLMs, improving evaluation stability and fairness by accounting for instruction-induced image description length variations.

TFG: Unified Training-Free Guidance for Diffusion Models

26 September 2024·3585 words·17 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Stanford University

TFG: A unified, training-free framework for boosting diffusion model performance by efficiently searching its algorithm-agnostic design space.

Textual Training for the Hassle-Free Removal of Unwanted Visual Data: Case Studies on OOD and Hateful Image Detection

26 September 2024·2145 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Seoul National University

Hassle-Free Textual Training (HFTT) uses only textual data to effectively remove unwanted visual data from AI training datasets, significantly reducing human annotation needs.

Text-Infused Attention and Foreground-Aware Modeling for Zero-Shot Temporal Action Detection

26 September 2024·2535 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Dept. of Artificial Intelligence, Korea University

Ti-FAD: a novel zero-shot temporal action detection model outperforms state-of-the-art methods by enhancing text-related visual focus and foreground awareness.

Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models

26 September 2024·2675 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 School of Computer Science and Engineering, Tianjin University of Technology

Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR) significantly improves vision-language model robustness against adversarial attacks by aligning and constraining text-guided attention, achievi…

Text-DiFuse: An Interactive Multi-Modal Image Fusion Framework based on Text-modulated Diffusion Model

26 September 2024·2167 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Wuhan University

Text-DiFuse: A novel interactive multi-modal image fusion framework leverages text-modulated diffusion models for superior performance in complex scenarios.

Text-Aware Diffusion for Policy Learning

26 September 2024·3340 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Brown University

Text-Aware Diffusion for Policy Learning (TADPoLe) uses pretrained diffusion models for zero-shot reward generation, enabling natural language-driven policy learning without manual reward design.

Testing Semantic Importance via Betting

26 September 2024·4904 words·24 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Johns Hopkins University

This work presents statistically grounded methods to rank semantic concept importance in black-box models, using conditional independence testing for both global and local interpretations.

Tell What You Hear From What You See - Video to Audio Generation Through Text

26 September 2024·2349 words·12 mins· loading · loading

Multimodal Learning Audio-Visual Learning 🏢 University of Washington

VATT: Text-guided video-to-audio generation, enabling refined audio control via text prompts and improved compatibility.

Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation

26 September 2024·2449 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

Tactile DreamFusion: High-resolution tactile sensing enhances 3D generation, creating realistic geometric details previously unattainable.

Tackling Uncertain Correspondences for Multi-Modal Entity Alignment

26 September 2024·1671 words·8 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

TMEA: A novel approach significantly boosts multi-modal entity alignment accuracy by effectively handling uncertain correspondences between modalities, improving data integration for diverse knowledge…

TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy

26 September 2024·2422 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

TabPedia: a novel large vision-language model, achieves superior visual table understanding by seamlessly integrating diverse tasks via a concept synergy mechanism and a new benchmark.

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

26 September 2024·3387 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 UC Santa Barbara

T2V-Turbo breaks the quality bottleneck of video consistency models by integrating mixed reward feedback during consistency distillation, enabling high-quality video generation with significantly fast…

Synergistic Dual Spatial-aware Generation of Image-to-text and Text-to-image

26 September 2024·2896 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Tianjin University

Synergistic Dual Spatial-aware Generation boosts image-to-text and text-to-image accuracy using a novel 3D scene graph and dual learning framework.