Skip to main content

🏢 Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
·3057 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University
ControlMLLM: Inject visual prompts into MLLMs via learnable latent variable optimization for training-free referring abilities, supporting box, mask, scribble, and point prompts.