Skip to main content

Multimodal Generation

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
·1534 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Illinois Urbana-Champaign
MMAudio achieves state-of-the-art video-to-audio synthesis by jointly training on audio-visual and text-audio data, enabling high-quality, semantically and temporally aligned audio generation.
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
·3107 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Edinburgh
VMB generates music from videos, images, and text, using description and retrieval bridges to improve quality and controllability.
Multimodal Latent Language Modeling with Next-Token Diffusion
·4442 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Microsoft Research
LatentLM: a novel multimodal model unifying discrete & continuous data via next-token diffusion, surpassing existing methods in performance & scalability across various tasks.
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
·5107 words·24 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 UC Los Angeles
OmniFlow: a novel generative model masters any-to-any multi-modal generation, outperforming existing models and offering flexible control!
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
·2599 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Roblox
SmoothCache: A universal technique boosts Diffusion Transformer inference speed by 8-71% across modalities, without sacrificing quality!