Vision-Language Models
Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization
·1718 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Shenzhen Institute for Advanced Study
New Hallucination-Induced Optimization (HIO) significantly reduces hallucinations in Large Vision-Language Models (LVLMs) by amplifying contrast between correct and incorrect tokens, outperforming exi…
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
·4509 words·22 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Southeast University
This paper presents a novel method to align vision models with human aesthetics in image retrieval, using large language models (LLMs) for query rephrasing and preference-based reinforcement learning …
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
·2881 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Apple
Aggregate-and-Adapt Prompt Embedding (AAPE) boosts CLIP’s downstream generalization by distilling textual knowledge from natural language prompts, achieving competitive performance across various visi…
Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models
·2348 words·12 mins·
loading
·
loading
AI Generated
Natural Language Processing
Vision-Language Models
🏢 Greater Bay Area Institute for Innovation, Hunan University
RAIL, a novel continual learning method for vision-language models, tackles catastrophic forgetting and maintains zero-shot abilities without domain-identity hints or reference data. Using a recursiv…
Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare
·2147 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 City University of Hong Kong
Compare2Score: A novel IQA model teaches large multimodal models to translate comparative image quality judgments into continuous quality scores, significantly outperforming existing methods.
AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models
·2295 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Hong Kong Polytechnic University
AdaNeg dynamically generates negative proxies during testing to improve vision-language model OOD detection, significantly outperforming existing methods on ImageNet.
Accelerating Transformers with Spectrum-Preserving Token Merging
·3201 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UC San Diego
PITOME: a novel token merging method accelerates Transformers by 40-60% while preserving accuracy, prioritizing informative tokens via an energy score.
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
·2216 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Ant Group
Chain-of-Sight accelerates multimodal LLM pre-training by ~73% using a multi-scale visual resampling technique and a novel post-pretrain token scaling strategy, achieving comparable or superior perfor…
A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks
·2218 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Purdue University
SFID, a novel debiasing method, effectively mitigates bias in vision-language models across various tasks without retraining, improving fairness and efficiency.
A Sober Look at the Robustness of CLIPs to Spurious Features
·4936 words·24 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Hong Kong Baptist University
CounterAnimal: a new dataset exposes CLIP’s reliance on spurious correlations, challenging its perceived robustness and highlighting the need for more comprehensive evaluation benchmarks in vision-lan…
A Concept-Based Explainability Framework for Large Multimodal Models
·7122 words·34 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Sorbonne Université
CoX-LMM unveils a novel concept-based explainability framework for large multimodal models, extracting semantically grounded multimodal concepts to enhance interpretability.
A Cat Is A Cat (Not A Dog!): Unraveling Information Mix-ups in Text-to-Image Encoders through Causal Analysis and Embedding Optimization
·3344 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 National Yang Ming Chiao Tung University
Researchers unveil how causal text encoding in text-to-image models leads to information loss and bias, proposing a novel training-free optimization method that significantly improves information bala…
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
·5846 words·28 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Swiss Federal Institute of Technology Lausanne (EPFL)
4M-21 achieves any-to-any predictions across 21 diverse vision modalities using a single model, exceeding prior state-of-the-art performance.
$ extit{Bifr"ost}$: 3D-Aware Image Compositing with Language Instructions
·3407 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
Bifröst: A novel 3D-aware framework for instruction-based image compositing, leveraging depth maps and an MLLM for high-fidelity results.