Vision-Language Models
ChatCam: Empowering Camera Control through Conversational AI
·1805 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
ChatCam empowers users to control cameras via natural language, using CineGPT for text-conditioned trajectory generation and an Anchor Determinator for precise placement, enabling high-quality video r…
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
·4503 words·22 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 New York University
Cambrian-1: Open, vision-centric multimodal LLMs achieve state-of-the-art performance using a novel spatial vision aggregator and high-quality data.
CALVIN: Improved Contextual Video Captioning via Instruction Tuning
·2746 words·13 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Meta AI
CALVIN: Instruction tuning boosts contextual video captioning, achieving state-of-the-art results!
Calibrated Self-Rewarding Vision Language Models
·2260 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UNC Chapel Hill
Calibrated Self-Rewarding (CSR) significantly improves vision-language models by using a novel iterative approach that incorporates visual constraints into the self-rewarding process, reducing halluci…
Bridge the Modality and Capability Gaps in Vision-Language Model Selection
·3390 words·16 mins·
loading
·
loading
AI Generated
Natural Language Processing
Vision-Language Models
🏢 State Key Laboratory for Novel Software Technology, Nanjing University
SWAB bridges modality and capability gaps in Vision-Language Model selection using optimal transport, enabling accurate prediction of VLM performance without images.
Boosting Weakly Supervised Referring Image Segmentation via Progressive Comprehension
·5057 words·24 mins·
loading
·
loading
AI Generated
Natural Language Processing
Vision-Language Models
🏢 City University of Hong Kong
PCNet boosts weakly-supervised referring image segmentation by progressively processing textual cues, mimicking human comprehension, and significantly improving target localization.
Boosting Vision-Language Models with Transduction
·2950 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UCLouvain
TransCLIP significantly boosts vision-language model accuracy by efficiently integrating transduction, a powerful learning paradigm that leverages the structure of unlabeled data.
Boosting Text-to-Video Generative Model with MLLMs Feedback
·2610 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
MLLMs enhance text-to-video generation by providing 135k fine-grained video preferences, creating VIDEOPREFER, and a novel reward model, VIDEORM, boosting video quality and alignment.
Boosting Alignment for Post-Unlearning Text-to-Image Generative Models
·4270 words·21 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Virginia Tech
This research introduces a novel framework for post-unlearning in text-to-image generative models, optimizing model updates to ensure both effective forgetting and maintained text-image alignment.
BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping
·3371 words·16 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
BoostAdapter enhances vision-language model test-time adaptation by combining instance-agnostic historical samples with instance-aware boosting samples for superior out-of-distribution and cross-domai…
Black-Box Forgetting
·2445 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Tokyo University of Science
Black-Box Forgetting achieves selective forgetting in large pre-trained models by optimizing input prompts, not model parameters, thus enabling targeted class removal without requiring internal model …
Beyond Accuracy: Ensuring Correct Predictions With Correct Rationales
·2877 words·14 mins·
loading
·
loading
AI Generated
Natural Language Processing
Vision-Language Models
🏢 Department of Computer & Information Science, University of Delaware
This research introduces a novel two-phase approach to improve AI model trustworthiness by ensuring both correct predictions and correct rationales. A new dataset with structured rationales and a rat…
BendVLM: Test-Time Debiasing of Vision-Language Embeddings
·2604 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 MIT
BEND-VLM: A novel, efficient test-time debiasing method for vision-language models, resolving bias without retraining.
AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
·2546 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 State Key Laboratory for Novel Software Technology, Nanjing University
AWT: a novel framework boosts vision-language model’s zero-shot capabilities by augmenting inputs, weighting them dynamically, and leveraging optimal transport to enhance semantic correlations.
Automated Multi-level Preference for MLLMs
·2098 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Baidu Inc.
Automated Multi-level Preference (AMP) framework significantly improves multimodal large language model (MLLM) performance by using multi-level preferences during training, reducing hallucinations and…
AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation
·4145 words·20 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Sun Yat-Sen University
AttnDreamBooth: A novel approach to text-to-image generation that overcomes limitations of prior methods by separating learning processes, resulting in significantly improved identity preservation and…
Atlas3D: Physically Constrained Self-Supporting Text-to-3D for Simulation and Fabrication
·2999 words·15 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 UC Los Angeles
Atlas3D enhances text-to-3D generation by integrating physics-based simulations, producing self-supporting 3D models for seamless real-world applications.
Ask, Attend, Attack: An Effective Decision-Based Black-Box Targeted Attack for Image-to-Text Models
·3219 words·16 mins·
loading
·
loading
AI Generated
Natural Language Processing
Vision-Language Models
🏢 Xiamen University
This paper introduces AAA, a novel three-stage decision-based black-box targeted attack against image-to-text models. AAA efficiently generates semantically consistent adversarial examples by asking …
Are We on the Right Way for Evaluating Large Vision-Language Models?
·2514 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Science and Technology of China
MMStar benchmark tackles flawed LVLMs evaluation by focusing on vision-critical samples, minimizing data leakage, and introducing new metrics for fair multi-modal gain assessment.
An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
·3208 words·16 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 LTCI, Télécom Paris, Institut Polytechnique De Paris
Leveraging vision-language models, this research introduces a novel unsupervised zero-shot audio captioning method that achieves state-of-the-art performance by aligning audio and image token distribu…