Skip to main content

Multimodal Learning

Customized Multiple Clustering via Multi-Modal Subspace Proxy Learning
·1867 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Washington
Multi-Sub leverages multi-modal learning to achieve customized multiple clustering, aligning user-defined textual preferences with visual representations via a subspace proxy learning framework.
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
·2211 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 SHI Labs @ Georgia Tech & UIUC
CuMo boosts multimodal LLMs by efficiently integrating co-upcycled Mixture-of-Experts, achieving state-of-the-art performance with minimal extra parameters during inference.
Cross-modal Representation Flattening for Multi-modal Domain Generalization
·3259 words·16 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Hong Kong Polytechnic University
Cross-Modal Representation Flattening (CMRF) improves multi-modal domain generalization by creating consistent flat loss regions and enhancing knowledge transfer between modalities, outperforming exis…
Coupled Mamba: Enhanced Multimodal Fusion with Coupled State Space Model
·2541 words·12 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Huazhong University of Science and Technology
Coupled Mamba: Enhanced multi-modal fusion via coupled state space model boosts accuracy and efficiency.
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
·3057 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University
ControlMLLM: Inject visual prompts into MLLMs via learnable latent variable optimization for training-free referring abilities, supporting box, mask, scribble, and point prompts.
Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities
·1891 words·9 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 New York University
Symile: A simple model-agnostic approach for learning representations from unlimited modalities, outperforming pairwise CLIP by capturing higher-order information.
Continual Audio-Visual Sound Separation
·1511 words·8 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 University of Texas at Dallas
ContAV-Sep: a novel approach to continual audio-visual sound separation, effectively mitigating catastrophic forgetting and improving model adaptability by preserving cross-modal semantic similarity a…
Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models
·3139 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Institute of Automation, CAS
Boosting zero-shot OOD detection accuracy, this paper introduces a conjugated semantic pool (CSP) improving FPR95 by 7.89%. CSP leverages modified superclass names for superior OOD label identificatio…
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
·4007 words·19 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
CoMat: Aligning text-to-image diffusion models using image-to-text concept matching for superior text-image alignment.
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
·3116 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Integrated Vision and Language Lab, KAIST, South Korea
CODE combats LMM hallucinations by contrasting self-generated descriptions with visual content during decoding, enhancing response accuracy without retraining.
CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
·2694 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Washington
Boosting multimodal contrastive learning, this research introduces negCLIPLoss and NormSim, novel data selection methods surpassing existing techniques by improving data quality and task relevance. Th…
CLIPCEIL: Domain Generalization through CLIP via Channel rEfinement and Image-text aLignment
·3674 words·18 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Brookhaven National Laboratory
CLIPCEIL enhances CLIP’s domain generalization by refining feature channels for domain invariance and aligning image-text embeddings, achieving state-of-the-art performance.
CLIP in Mirror: Disentangling text from visual images through reflection
·4284 words·21 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Beihang University
MirrorCLIP disentangles text from images in CLIP using mirror reflection differences, enhancing robustness against text-visual image confusion.
Classifier-guided Gradient Modulation for Enhanced Multimodal Learning
·2128 words·10 mins· loading · loading
Multimodal Learning Multimodal Understanding 🏢 Shanghai AI Lab
Classifier-Guided Gradient Modulation (CGGM) enhances multimodal learning by balancing the training process, considering both gradient magnitude and direction, leading to consistent performance improv…
Classification Done Right for Vision-Language Pre-Training
·1685 words·8 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 ByteDance Research
SuperClass, a novel vision-language pre-training method, achieves superior performance on various downstream tasks by directly using tokenized raw text as supervised classification labels, eliminating…
CLAP4CLIP: Continual Learning with Probabilistic Finetuning for Vision-Language Models
·3973 words·19 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of New South Wales (UNSW Sydney)
CLAP4CLIP enhances vision-language model continual learning by using probabilistic finetuning, improving performance and uncertainty estimation.
CIFD: Controlled Information Flow to Enhance Knowledge Distillation
·3139 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Samsung Research
CIFD, a novel knowledge distillation method, drastically cuts training costs while boosting performance, particularly for large datasets, by using Rate-Distortion Modules instead of Teacher Assistants…
ChatCam: Empowering Camera Control through Conversational AI
·1805 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology
ChatCam empowers users to control cameras via natural language, using CineGPT for text-conditioned trajectory generation and an Anchor Determinator for precise placement, enabling high-quality video r…
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
·4503 words·22 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 New York University
Cambrian-1: Open, vision-centric multimodal LLMs achieve state-of-the-art performance using a novel spatial vision aggregator and high-quality data.
CALVIN: Improved Contextual Video Captioning via Instruction Tuning
·2746 words·13 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Meta AI
CALVIN: Instruction tuning boosts contextual video captioning, achieving state-of-the-art results!