Vision-Language Models

Customized Multiple Clustering via Multi-Modal Subspace Proxy Learning

26 September 2024·1867 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Washington

Multi-Sub leverages multi-modal learning to achieve customized multiple clustering, aligning user-defined textual preferences with visual representations via a subspace proxy learning framework.

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

26 September 2024·2211 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 SHI Labs @ Georgia Tech & UIUC

CuMo boosts multimodal LLMs by efficiently integrating co-upcycled Mixture-of-Experts, achieving state-of-the-art performance with minimal extra parameters during inference.

Cross-modal Representation Flattening for Multi-modal Domain Generalization

26 September 2024·3259 words·16 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University

Cross-Modal Representation Flattening (CMRF) improves multi-modal domain generalization by creating consistent flat loss regions and enhancing knowledge transfer between modalities, outperforming exis…

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

26 September 2024·2284 words·11 mins· loading · loading

Vision-Language Models 🏢 Hong Kong Polytechnic University

Can AI understand humor? A new benchmark, YESBUT, reveals that even state-of-the-art models struggle with the nuanced humor of juxtaposed comics, highlighting the need for improved AI in understandin…

Coupled Mamba: Enhanced Multimodal Fusion with Coupled State Space Model

26 September 2024·2541 words·12 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Huazhong University of Science and Technology

Coupled Mamba: Enhanced multi-modal fusion via coupled state space model boosts accuracy and efficiency.

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

26 September 2024·3057 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University

ControlMLLM: Inject visual prompts into MLLMs via learnable latent variable optimization for training-free referring abilities, supporting box, mask, scribble, and point prompts.

Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities

26 September 2024·1891 words·9 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 New York University

Symile: A simple model-agnostic approach for learning representations from unlimited modalities, outperforming pairwise CLIP by capturing higher-order information.

Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models

26 September 2024·3139 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Institute of Automation, CAS

Boosting zero-shot OOD detection accuracy, this paper introduces a conjugated semantic pool (CSP) improving FPR95 by 7.89%. CSP leverages modified superclass names for superior OOD label identificatio…

Combining Observational Data and Language for Species Range Estimation

26 September 2024·2627 words·13 mins· loading · loading

Natural Language Processing Vision-Language Models 🏢 UMass Amherst University

LE-SINR combines Wikipedia species descriptions with citizen science observations to create accurate species range maps, even with limited data, outperforming existing methods.

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

26 September 2024·4007 words·19 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

CoMat: Aligning text-to-image diffusion models using image-to-text concept matching for superior text-image alignment.

CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models

26 September 2024·3116 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Integrated Vision and Language Lab, KAIST, South Korea

CODE combats LMM hallucinations by contrasting self-generated descriptions with visual content during decoding, enhancing response accuracy without retraining.

Cluster-Learngene: Inheriting Adaptive Clusters for Vision Transformers

26 September 2024·3088 words·15 mins· loading · loading

AI Generated Computer Vision Vision-Language Models 🏢 School of Computer Science and Engineering, Southeast University

Cluster-Learngene efficiently initializes elastic-scale Vision Transformers by adaptively clustering and inheriting key modules from a large ancestry model, saving resources and boosting downstream ta…

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

26 September 2024·2694 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Washington

Boosting multimodal contrastive learning, this research introduces negCLIPLoss and NormSim, novel data selection methods surpassing existing techniques by improving data quality and task relevance. Th…

CLIPCEIL: Domain Generalization through CLIP via Channel rEfinement and Image-text aLignment

26 September 2024·3674 words·18 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Brookhaven National Laboratory

CLIPCEIL enhances CLIP’s domain generalization by refining feature channels for domain invariance and aligning image-text embeddings, achieving state-of-the-art performance.

CLIP in Mirror: Disentangling text from visual images through reflection

26 September 2024·4284 words·21 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Beihang University

MirrorCLIP disentangles text from images in CLIP using mirror reflection differences, enhancing robustness against text-visual image confusion.

Classification Done Right for Vision-Language Pre-Training

26 September 2024·1685 words·8 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 ByteDance Research

SuperClass, a novel vision-language pre-training method, achieves superior performance on various downstream tasks by directly using tokenized raw text as supervised classification labels, eliminating…

CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models

26 September 2024·3973 words·19 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of New South Wales (UNSW Sydney)

CLAP4CLIP enhances vision-language model continual learning by using probabilistic finetuning, improving performance and uncertainty estimation.

CigTime: Corrective Instruction Generation Through Inverse Motion Editing

26 September 2024·2228 words·11 mins· loading · loading

Natural Language Processing Vision-Language Models 🏢 Hong Kong University of Science and Technology

CigTime generates corrective motion instructions from motion pairs using motion editing and large language models. This innovative approach improves upon baselines by leveraging motion triplets for f…

CIFD: Controlled Information Flow to Enhance Knowledge Distillation

26 September 2024·3139 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Samsung Research

CIFD, a novel knowledge distillation method, drastically cuts training costs while boosting performance, particularly for large datasets, by using Rate-Distortion Modules instead of Teacher Assistants…

ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model

26 September 2024·1826 words·9 mins· loading · loading

Natural Language Processing Vision-Language Models 🏢 East China Normal University

ChatTracker boosts visual tracking by intelligently using a large language model to refine object descriptions, achieving performance on par with state-of-the-art methods.