Skip to main content

Multimodal Generation

Unified Multimodal Discrete Diffusion
·3324 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Carnegie Mellon University
UniDisc: a unified multimodal discrete diffusion model for joint text and image generation, surpassing autoregressive models in quality & efficiency!
MusicInfuser: Making Video Diffusion Listen and Dance
·4650 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Washington
Sync your moves! MusicInfuser adapts video diffusion to make models listen and dance to music, preserving style and aligning movement.
FlowTok: Flowing Seamlessly Across Text and Image Tokens
·2984 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 ByteDance Seed
FlowTok: Seamlessly flows across text and image tokens!
Uni$ extbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models
·2980 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Peking University
UniFace: a novel UMM tailored for fine-grained face understanding and generation.
Motion Anything: Any to Motion Generation
·7987 words·38 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 ANU
Motion Anything: control human motion generation with multimodal conditions like text and music.
SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing
·2729 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Shanghai Artificial Intelligence Laboratory
SURVEYFORGE automates survey generation, improving quality and evaluation.
VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation
·1959 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Technology Sydney
VideoUFO: A new user-focused, million-scale dataset that improves text-to-video generation by aligning training data with real user interests and preferences!
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
·1534 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Illinois Urbana-Champaign
MMAudio achieves state-of-the-art video-to-audio synthesis by jointly training on audio-visual and text-audio data, enabling high-quality, semantically and temporally aligned audio generation.
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
·3107 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 University of Edinburgh
VMB generates music from videos, images, and text, using description and retrieval bridges to improve quality and controllability.
Multimodal Latent Language Modeling with Next-Token Diffusion
·4442 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Microsoft Research
LatentLM: a novel multimodal model unifying discrete & continuous data via next-token diffusion, surpassing existing methods in performance & scalability across various tasks.
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
·5107 words·24 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 UC Los Angeles
OmniFlow: a novel generative model masters any-to-any multi-modal generation, outperforming existing models and offering flexible control!
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
·2599 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Roblox
SmoothCache: A universal technique boosts Diffusion Transformer inference speed by 8-71% across modalities, without sacrificing quality!