↓Skip to main content

🏢 Tencent Youtu Lab

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

3 January 2025·2577 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent Youtu Lab

VITA-1.5 achieves near real-time vision and speech interaction by using a novel three-stage training method that progressively integrates speech data into an LLM, enabling fluent conversations.