Computer Vision
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
·2900 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
Researchers significantly enhanced autoregressive image generation by integrating chain-of-thought reasoning strategies, achieving a remarkable +24% improvement on the GenEval benchmark.
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
·4089 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 ByteDance
Video Depth Anything achieves consistent depth estimation for super-long videos by enhancing Depth Anything V2 with a spatial-temporal head and a novel temporal consistency loss, setting a new state-o…
TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space
·4649 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Google DeepMind
TokenVerse: Extract & combine visual concepts from multiple images for creative image generation!
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
·3101 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Tencent AI Lab
Hunyuan3D 2.0: A groundbreaking open-source system generating high-resolution, textured 3D assets using scalable diffusion models, exceeding state-of-the-art performance.
GPS as a Control Signal for Image Generation
·3156 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Michigan
GPS-guided image generation is here! This paper leverages GPS data to create highly realistic images reflecting specific locations, even reconstructing 3D models from 2D photos.
EMO2: End-Effector Guided Audio-Driven Avatar Video Generation
·2205 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Alibaba Group
EMO2 achieves realistic audio-driven avatar video generation by employing a two-stage framework: first generating hand poses directly from audio and then using a diffusion model to synthesize full-bod…
X-Dyna: Expressive Dynamic Human Image Animation
·3011 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Southern California
X-Dyna: a novel diffusion-based pipeline generates realistic human image animation using a zero-shot approach by integrating a Dynamics-Adapter for dynamic detail preservation, exceeding state-of-the-…
Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions
·2057 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Alibaba Group
Textoon: Generating vivid 2D cartoon characters from text descriptions in under a minute, revolutionizing animation workflow.
GSTAR: Gaussian Surface Tracking and Reconstruction
·2047 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 ETH Zurich
GSTAR: A novel method achieving photorealistic rendering, accurate reconstruction, and reliable 3D tracking of dynamic scenes with changing topology, even handling surfaces appearing, disappearing, or…
GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor
·2208 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Hong Kong University of Science and Technology
GaussianAvatar-Editor enables photorealistic, text-driven editing of animatable 3D heads, solving motion occlusion and ensuring temporal consistency.
DiffuEraser: A Diffusion Model for Video Inpainting
·2356 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Alibaba Group
DiffuEraser: a novel video inpainting model based on stable diffusion, surpasses existing methods by using injected priors and temporal consistency improvements for superior results.
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
·3696 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 ByteDance Seed
VideoWorld shows AI can learn complex reasoning and planning skills from unlabeled videos alone, achieving professional-level performance in Go and robotics.
SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces
·2347 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Yale University
SynthLight: A novel diffusion model relights portraits realistically by learning to re-render synthetic faces, generalizing remarkably well to real photographs.
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
·4248 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Meta
Scaling visual tokenizers dramatically improves image and video generation, achieving state-of-the-art results and outperforming existing methods with fewer computations by focusing on decoder scaling…
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
·5585 words·27 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 NYU
Boosting diffusion model performance at inference time, this research introduces a novel framework that goes beyond simply increasing denoising steps. By cleverly searching for better noise candidates…
CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation
·3330 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Graphics AI Lab, NC Research
CaPa: Carve-n-Paint Synthesis generates hyper-realistic 4K textured meshes in under 30 seconds, setting a new standard for efficient 3D asset creation.
AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation
·2125 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Alibaba Tongyi Lab
AnyStory: A unified framework enables high-fidelity personalized image generation for single and multiple subjects, addressing subject fidelity challenges in existing methods.
RepVideo: Rethinking Cross-Layer Representation for Video Generation
·2785 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Nanyang Technological University
RepVideo enhances text-to-video generation by enriching feature representations, resulting in significantly improved temporal coherence and spatial detail.
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
·2366 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 University of Rochester
Ouroboros-Diffusion: A novel tuning-free long video generation framework achieving unprecedented content consistency by cleverly integrating information across frames via latent sampling, cross-frame…
CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities
·3972 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Tencent AI Lab
CityDreamer4D generates realistic, unbounded 4D city models by cleverly separating dynamic objects (like vehicles) from static elements (buildings, roads), using multiple neural fields for enhanced re…