Skip to main content

Multimodal Learning

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
·4509 words·22 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Southeast University
This paper presents a novel method to align vision models with human aesthetics in image retrieval, using large language models (LLMs) for query rephrasing and preference-based reinforcement learning …
Aligning Audio-Visual Joint Representations with an Agentic Workflow
·1961 words·10 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 DAMO Academy, Alibaba Group
AVAgent uses an LLM-driven workflow to intelligently align audio and visual data, resulting in improved AV joint representations and state-of-the-art performance on various downstream tasks.
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
·2881 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Apple
Aggregate-and-Adapt Prompt Embedding (AAPE) boosts CLIP’s downstream generalization by distilling textual knowledge from natural language prompts, achieving competitive performance across various visi…
Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare
·2147 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 City University of Hong Kong
Compare2Score: A novel IQA model teaches large multimodal models to translate comparative image quality judgments into continuous quality scores, significantly outperforming existing methods.
AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models
·2295 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University
AdaNeg dynamically generates negative proxies during testing to improve vision-language model OOD detection, significantly outperforming existing methods on ImageNet.
Accelerating Transformers with Spectrum-Preserving Token Merging
·3201 words·16 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 UC San Diego
PITOME: a novel token merging method accelerates Transformers by 40-60% while preserving accuracy, prioritizing informative tokens via an energy score.
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
·2216 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Ant Group
Chain-of-Sight accelerates multimodal LLM pre-training by ~73% using a multi-scale visual resampling technique and a novel post-pretrain token scaling strategy, achieving comparable or superior perfor…
A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation
·2816 words·14 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 Seoul National University
A single model tackles diverse audiovisual generation tasks using a novel Mixture of Noise Levels approach, resulting in temporally consistent and high-quality outputs.
A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks
·2218 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Purdue University
SFID, a novel debiasing method, effectively mitigates bias in vision-language models across various tasks without retraining, improving fairness and efficiency.
A Sober Look at the Robustness of CLIPs to Spurious Features
·4936 words·24 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Hong Kong Baptist University
CounterAnimal: a new dataset exposes CLIP’s reliance on spurious correlations, challenging its perceived robustness and highlighting the need for more comprehensive evaluation benchmarks in vision-lan…
A Concept-Based Explainability Framework for Large Multimodal Models
·7122 words·34 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Sorbonne Université
CoX-LMM unveils a novel concept-based explainability framework for large multimodal models, extracting semantically grounded multimodal concepts to enhance interpretability.
A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization
·3344 words·16 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 National Yang Ming Chiao Tung University
Researchers unveil how causal text encoding in text-to-image models leads to information loss and bias, proposing a novel training-free optimization method that significantly improves information bala…
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
·5846 words·28 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Swiss Federal Institute of Technology Lausanne (EPFL)
4M-21 achieves any-to-any predictions across 21 diverse vision modalities using a single model, exceeding prior state-of-the-art performance.
$ extit{Bifr"ost}$: 3D-Aware Image Compositing with Language Instructions
·3407 words·16 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
Bifröst: A novel 3D-aware framework for instruction-based image compositing, leveraging depth maps and an MLLM for high-fidelity results.