🏢 Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models
·3057 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University
ControlMLLM: Inject visual prompts into MLLMs via learnable latent variable optimization for training-free referring abilities, supporting box, mask, scribble, and point prompts.