🏢 LTCI, Télécom Paris, Institut Polytechnique De Paris
An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
·3208 words·16 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 LTCI, Télécom Paris, Institut Polytechnique De Paris
Leveraging vision-language models, this research introduces a novel unsupervised zero-shot audio captioning method that achieves state-of-the-art performance by aligning audio and image token distribu…