Vision-Language Models
Customized Multiple Clustering via Multi-Modal Subspace Proxy Learning
·1867 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Washington
Multi-Sub leverages multi-modal learning to achieve customized multiple clustering, aligning user-defined textual preferences with visual representations via a subspace proxy learning framework.
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
·2211 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 SHI Labs @ Georgia Tech & UIUC
CuMo boosts multimodal LLMs by efficiently integrating co-upcycled Mixture-of-Experts, achieving state-of-the-art performance with minimal extra parameters during inference.
Cross-modal Representation Flattening for Multi-modal Domain Generalization
·3259 words·16 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Hong Kong Polytechnic University
Cross-Modal Representation Flattening (CMRF) improves multi-modal domain generalization by creating consistent flat loss regions and enhancing knowledge transfer between modalities, outperforming exis…
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
·2284 words·11 mins·
loading
·
loading
Vision-Language Models
🏢 Hong Kong Polytechnic University
Can AI understand humor? A new benchmark, YESBUT, reveals that even state-of-the-art models struggle with the nuanced humor of juxtaposed comics, highlighting the need for improved AI in understandin…
Coupled Mamba: Enhanced Multimodal Fusion with Coupled State Space Model
·2541 words·12 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Huazhong University of Science and Technology
Coupled Mamba: Enhanced multi-modal fusion via coupled state space model boosts accuracy and efficiency.
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
·3057 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University
ControlMLLM: Inject visual prompts into MLLMs via learnable latent variable optimization for training-free referring abilities, supporting box, mask, scribble, and point prompts.
Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities
·1891 words·9 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 New York University
Symile: A simple model-agnostic approach for learning representations from unlimited modalities, outperforming pairwise CLIP by capturing higher-order information.
Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models
·3139 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Institute of Automation, CAS
Boosting zero-shot OOD detection accuracy, this paper introduces a conjugated semantic pool (CSP) improving FPR95 by 7.89%. CSP leverages modified superclass names for superior OOD label identificatio…
Combining Observational Data and Language for Species Range Estimation
·2627 words·13 mins·
loading
·
loading
Natural Language Processing
Vision-Language Models
🏢 UMass Amherst University
LE-SINR combines Wikipedia species descriptions with citizen science observations to create accurate species range maps, even with limited data, outperforming existing methods.
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
·4007 words·19 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
CoMat: Aligning text-to-image diffusion models using image-to-text concept matching for superior text-image alignment.
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
·3116 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Integrated Vision and Language Lab, KAIST, South Korea
CODE combats LMM hallucinations by contrasting self-generated descriptions with visual content during decoding, enhancing response accuracy without retraining.
Cluster-Learngene: Inheriting Adaptive Clusters for Vision Transformers
·3088 words·15 mins·
loading
·
loading
AI Generated
Computer Vision
Vision-Language Models
🏢 School of Computer Science and Engineering, Southeast University
Cluster-Learngene efficiently initializes elastic-scale Vision Transformers by adaptively clustering and inheriting key modules from a large ancestry model, saving resources and boosting downstream ta…
CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
·2694 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Washington
Boosting multimodal contrastive learning, this research introduces negCLIPLoss and NormSim, novel data selection methods surpassing existing techniques by improving data quality and task relevance. Th…
CLIPCEIL: Domain Generalization through CLIP via Channel rEfinement and Image-text aLignment
·3674 words·18 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Brookhaven National Laboratory
CLIPCEIL enhances CLIP’s domain generalization by refining feature channels for domain invariance and aligning image-text embeddings, achieving state-of-the-art performance.
CLIP in Mirror: Disentangling text from visual images through reflection
·4284 words·21 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Beihang University
MirrorCLIP disentangles text from images in CLIP using mirror reflection differences, enhancing robustness against text-visual image confusion.
Classification Done Right for Vision-Language Pre-Training
·1685 words·8 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 ByteDance Research
SuperClass, a novel vision-language pre-training method, achieves superior performance on various downstream tasks by directly using tokenized raw text as supervised classification labels, eliminating…
CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models
·3973 words·19 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of New South Wales (UNSW Sydney)
CLAP4CLIP enhances vision-language model continual learning by using probabilistic finetuning, improving performance and uncertainty estimation.
CigTime: Corrective Instruction Generation Through Inverse Motion Editing
·2228 words·11 mins·
loading
·
loading
Natural Language Processing
Vision-Language Models
🏢 Hong Kong University of Science and Technology
CigTime generates corrective motion instructions from motion pairs using motion editing and large language models. This innovative approach improves upon baselines by leveraging motion triplets for f…
CIFD: Controlled Information Flow to Enhance Knowledge Distillation
·3139 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Samsung Research
CIFD, a novel knowledge distillation method, drastically cuts training costs while boosting performance, particularly for large datasets, by using Rate-Distortion Modules instead of Teacher Assistants…
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model
·1826 words·9 mins·
loading
·
loading
Natural Language Processing
Vision-Language Models
🏢 East China Normal University
ChatTracker boosts visual tracking by intelligently using a large language model to refine object descriptions, achieving performance on par with state-of-the-art methods.