Skip to main content

Audio-Visual Learning

Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
·2851 words·14 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 Imperial College
One model to rule them all! This paper introduces Unified Speech Recognition (USR), a single model trained for auditory, visual, and audiovisual speech recognition, achieving state-of-the-art results …
UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner
·2866 words·14 mins· loading · loading
AI Generated Multimodal Learning Audio-Visual Learning 🏢 Tsinghua University
UniAudio 1.5 uses a novel LLM-driven audio codec to enable frozen LLMs to perform various audio tasks with just a few examples, opening new avenues for efficient few-shot cross-modal learning.
Tell What You Hear From What You See - Video to Audio Generation Through Text
·2349 words·12 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 University of Washington
VATT: Text-guided video-to-audio generation, enabling refined audio control via text prompts and improved compatibility.
Mixtures of Experts for Audio-Visual Learning
·2112 words·10 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 Fudan University
AVMoE: a novel parameter-efficient transfer learning approach for audio-visual learning, dynamically allocates expert models (unimodal and cross-modal adapters) based on task demands, achieving superi…
Listenable Maps for Zero-Shot Audio Classifiers
·2601 words·13 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 Fondazione Bruno Kessler
LMAC-ZS: First decoder-based method for explaining zero-shot audio classifiers, ensuring transparency and trustworthiness in AI.
Learning Spatially-Aware Language and Audio Embeddings
·3744 words·18 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 Georgia Institute of Technology
ELSA: a new model that learns spatially aware language and audio embeddings, achieving state-of-the-art performance in semantic retrieval and 3D sound source localization.
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
·2541 words·12 mins· loading · loading
AI Generated Multimodal Learning Audio-Visual Learning 🏢 Zhejiang University
FRIEREN: a novel video-to-audio generation network using rectified flow matching achieves state-of-the-art performance by improving audio quality, temporal alignment, and generation efficiency.
DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection
·1673 words·8 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 Tsinghua University
DARNet: a dual attention network for auditory attention detection surpasses current state-of-the-art models, especially in short decision windows, achieving this with a 91% reduction in parameters.
Continual Audio-Visual Sound Separation
·1511 words·8 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 University of Texas at Dallas
ContAV-Sep: a novel approach to continual audio-visual sound separation, effectively mitigating catastrophic forgetting and improving model adaptability by preserving cross-modal semantic similarity a…
AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis
·2151 words·11 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 University of Surrey, UK
AV-GS: A novel Audio-Visual Gaussian Splatting model, uses geometry and material-aware priors to efficiently synthesize realistic binaural audio from a single audio source.
AV-Cloud: Spatial Audio Rendering Through Audio-Visual Cloud Splatting
·2151 words·11 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 University of Washington
AV-Cloud: Real-time, high-quality 3D spatial audio rendering synced with visuals, bypassing pre-rendered images for immersive virtual experiences.
Aligning Audio-Visual Joint Representations with an Agentic Workflow
·1961 words·10 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 DAMO Academy, Alibaba Group
AVAgent uses an LLM-driven workflow to intelligently align audio and visual data, resulting in improved AV joint representations and state-of-the-art performance on various downstream tasks.
A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation
·2816 words·14 mins· loading · loading
Multimodal Learning Audio-Visual Learning 🏢 Seoul National University
A single model tackles diverse audiovisual generation tasks using a novel Mixture of Noise Levels approach, resulting in temporally consistent and high-quality outputs.