Skip to main content

Multimodal Learning

Exploiting Descriptive Completeness Prior for Cross Modal Hashing with Incomplete Labels
·2505 words·12 mins· loading · loading
Multimodal Learning Cross-Modal Retrieval 🏢 Harbin Institute of Technology, Shenzhen
PCRIL, a novel prompt contrastive recovery approach, significantly boosts cross-modal hashing accuracy, especially when dealing with incomplete labels by progressively identifying promising positive c…
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
·2805 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Show Lab, National University of Singapore
EvolveDirector trains competitive text-to-image models using publicly available data by cleverly leveraging large vision-language models to curate and refine training datasets, dramatically reducing d…
Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor
·2050 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Tsinghua University
Researchers introduce object-aware backdoors in Vision-and-Language Navigation, enabling malicious behavior upon encountering specific objects, demonstrating the vulnerability of real-world AI agents.
Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting
·1899 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China
Frolic: A label-free framework boosts zero-shot vision model accuracy by learning prompt distributions and correcting label bias, achieving state-of-the-art performance across multiple datasets.
Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning
·2723 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University
DEMO framework enhances text-to-video generation by decomposing text encoding and conditioning into content and motion components, resulting in videos with significantly improved motion dynamics.
Enhancing LLM Reasoning via Vision-Augmented Prompting
·2157 words·11 mins· loading · loading
Multimodal Learning Multimodal Reasoning 🏢 Zhejiang University
Vision-Augmented Prompting (VAP) boosts LLM reasoning by automatically generating images from textual problem descriptions, incorporating visual-spatial clues to significantly improve accuracy across …
Empowering Visible-Infrared Person Re-Identification with Large Foundation Models
·2429 words·12 mins· loading · loading
AI Generated Multimodal Learning Cross-Modal Retrieval 🏢 National Engineering Research Center for Multimedia Software,School of Computer Science,Wuhan University
Large foundation models empower visible-infrared person re-identification by enriching infrared image representations with automatically generated textual descriptions, significantly improving cross-m…
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
·4338 words·21 mins· loading · loading
AI Generated Multimodal Learning Multimodal Reasoning 🏢 Carnegie Mellon University
Emotion-LLaMA: A new multimodal large language model excels at emotion recognition and reasoning, outperforming existing models and leveraging a newly created dataset, MERR.
Easy Regional Contrastive Learning of Expressive Fashion Representations
·3119 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Virginia
E2, a novel regional contrastive learning method, enhances vision-language models for expressive fashion representations by explicitly attending to fashion details with minimal additional parameters, …
Dual Risk Minimization: Towards Next-Level Robustness in Fine-tuning Zero-Shot Models
·3018 words·15 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
Dual Risk Minimization (DRM) improves fine-tuned zero-shot models’ robustness by combining empirical and worst-case risk minimization, using LLMs to identify core features, achieving state-of-the-art …
Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models
·2027 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
Dual Prototype Evolving (DPE) significantly boosts vision-language model generalization by cumulatively learning multi-modal prototypes from unlabeled test data, outperforming current state-of-the-art…
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
·3604 words·17 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 UC Berkeley
DigiRL: Autonomous RL trains robust in-the-wild device-control agents by offline-to-online RL, surpassing prior methods.
Diffusion-Inspired Truncated Sampler for Text-Video Retrieval
·2366 words·12 mins· loading · loading
Multimodal Learning Cross-Modal Retrieval 🏢 Rochester Institute of Technology
Diffusion-Inspired Truncated Sampler (DITS) revolutionizes text-video retrieval by progressively aligning embeddings and enhancing CLIP embedding space structure, achieving state-of-the-art results.
Diffusion PID: Interpreting Diffusion via Partial Information Decomposition
·5438 words·26 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
DiffusionPID unveils the secrets of text-to-image diffusion models by decomposing text prompts into unique, redundant, and synergistic components, providing insights into how individual words and thei…
DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion
·2444 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Zhejiang University
DiffPano generates scalable, consistent, and diverse panoramic images from text descriptions and camera poses using a novel spherical epipolar-aware diffusion model.
Dense Connector for MLLMs
·3198 words·16 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Tsinghua University
Boosting multimodal LLMs, the Dense Connector efficiently integrates multi-layer visual features for significantly enhanced performance.
Déjà Vu Memorization in Vision–Language Models
·2200 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Meta
Vision-language models (VLMs) memorize training data, impacting generalization. This paper introduces ‘déjà vu memorization,’ a novel method measuring this, revealing significant memorization even in…
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
·2495 words·12 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Microsoft Research
DeepStack: Stacking visual tokens boosts LMMs efficiency and performance!
Deep Correlated Prompting for Visual Recognition with Missing Modalities
·1823 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 College of Intelligence and Computing, Tianjin University
Deep Correlated Prompting enhances large multimodal models’ robustness against missing data by leveraging inter-layer and cross-modality correlations in prompts, achieving superior performance with mi…
DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection
·1673 words·8 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 Tsinghua University
DARNet: a dual attention network for auditory attention detection surpasses current state-of-the-art models, especially in short decision windows, achieving this with a 91% reduction in parameters.