↓Skip to main content

🏢 Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University

ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

26 September 2024·3057 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Key Laboratory of Multimedia Trusted Perception and Efficient Computing,Ministry of Education of China, Xiamen University

ControlMLLM: Inject visual prompts into MLLMs via learnable latent variable optimization for training-free referring abilities, supporting box, mask, scribble, and point prompts.