Skip to main content

🏢 LTCI, Télécom Paris, Institut Polytechnique De Paris

An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
·3208 words·16 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 LTCI, Télécom Paris, Institut Polytechnique De Paris
Leveraging vision-language models, this research introduces a novel unsupervised zero-shot audio captioning method that achieves state-of-the-art performance by aligning audio and image token distribu…