Skip to main content

Computer Vision

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
·4089 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance
Video Depth Anything achieves consistent depth estimation for super-long videos by enhancing Depth Anything V2 with a spatial-temporal head and a novel temporal consistency loss, setting a new state-o…
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
·4649 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Google DeepMind
TokenVerse: Extract & combine visual concepts from multiple images for creative image generation!
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
·3101 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent AI Lab
Hunyuan3D 2.0: A groundbreaking open-source system generating high-resolution, textured 3D assets using scalable diffusion models, exceeding state-of-the-art performance.
GPS as a Control Signal for Image Generation
·3156 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Michigan
GPS-guided image generation is here! This paper leverages GPS data to create highly realistic images reflecting specific locations, even reconstructing 3D models from 2D photos.
EMO2: End-Effector Guided Audio-Driven Avatar Video Generation
·2205 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Alibaba Group
EMO2 achieves realistic audio-driven avatar video generation by employing a two-stage framework: first generating hand poses directly from audio and then using a diffusion model to synthesize full-bod…
X-Dyna: Expressive Dynamic Human Image Animation
·3011 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Southern California
X-Dyna: a novel diffusion-based pipeline generates realistic human image animation using a zero-shot approach by integrating a Dynamics-Adapter for dynamic detail preservation, exceeding state-of-the-…
Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions
·2057 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Alibaba Group
Textoon: Generating vivid 2D cartoon characters from text descriptions in under a minute, revolutionizing animation workflow.
GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor
·2208 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Hong Kong University of Science and Technology
GaussianAvatar-Editor enables photorealistic, text-driven editing of animatable 3D heads, solving motion occlusion and ensuring temporal consistency.
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
·3696 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance Seed
VideoWorld shows AI can learn complex reasoning and planning skills from unlabeled videos alone, achieving professional-level performance in Go and robotics.
SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces
·2347 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Yale University
SynthLight: A novel diffusion model relights portraits realistically by learning to re-render synthetic faces, generalizing remarkably well to real photographs.
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
·4248 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Meta
Scaling visual tokenizers dramatically improves image and video generation, achieving state-of-the-art results and outperforming existing methods with fewer computations by focusing on decoder scaling…
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
·5585 words·27 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 NYU
Boosting diffusion model performance at inference time, this research introduces a novel framework that goes beyond simply increasing denoising steps. By cleverly searching for better noise candidates…
CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation
·3330 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Graphics AI Lab, NC Research
CaPa: Carve-n-Paint Synthesis generates hyper-realistic 4K textured meshes in under 30 seconds, setting a new standard for efficient 3D asset creation.
AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation
·2125 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Alibaba Tongyi Lab
AnyStory: A unified framework enables high-fidelity personalized image generation for single and multiple subjects, addressing subject fidelity challenges in existing methods.
RepVideo: Rethinking Cross-Layer Representation for Video Generation
·2785 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
RepVideo enhances text-to-video generation by enriching feature representations, resulting in significantly improved temporal coherence and spatial detail.
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
·2366 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Rochester
Ouroboros-Diffusion: A novel tuning-free long video generation framework achieving unprecedented content consistency by cleverly integrating information across frames via latent sampling, cross-frame…
CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities
·3972 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent AI Lab
CityDreamer4D generates realistic, unbounded 4D city models by cleverly separating dynamic objects (like vehicles) from static elements (buildings, roads), using multiple neural fields for enhanced re…
GameFactory: Creating New Games with Generative Interactive Videos
·3286 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Hong Kong
GameFactory uses AI to generate entirely new games within diverse, open-domain scenes by learning action controls from a small dataset and transferring them to pre-trained video models.
Do generative video models learn physical principles from watching videos?
·3121 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Google DeepMind
Generative video models struggle to understand physics despite producing visually realistic videos; Physics-IQ benchmark reveals this critical limitation, highlighting the need for improved physical r…
The GAN is dead; long live the GAN! A Modern GAN Baseline
·2531 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Brown University
R3GAN: A modernized GAN baseline achieves state-of-the-art results with a simple, stable loss function and modern architecture, debunking the myth that GANs are hard to train.