Computer Vision

Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

23 January 2025·2900 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

Researchers significantly enhanced autoregressive image generation by integrating chain-of-thought reasoning strategies, achieving a remarkable +24% improvement on the GenEval benchmark.

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

21 January 2025·4089 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance

Video Depth Anything achieves consistent depth estimation for super-long videos by enhancing Depth Anything V2 with a spatial-temporal head and a novel temporal consistency loss, setting a new state-o…

TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

21 January 2025·4649 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Google DeepMind

TokenVerse: Extract & combine visual concepts from multiple images for creative image generation!

Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation

21 January 2025·3101 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent AI Lab

Hunyuan3D 2.0: A groundbreaking open-source system generating high-resolution, textured 3D assets using scalable diffusion models, exceeding state-of-the-art performance.

GPS as a Control Signal for Image Generation

21 January 2025·3156 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Michigan

GPS-guided image generation is here! This paper leverages GPS data to create highly realistic images reflecting specific locations, even reconstructing 3D models from 2D photos.

EMO2: End-Effector Guided Audio-Driven Avatar Video Generation

18 January 2025·2205 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Alibaba Group

EMO2 achieves realistic audio-driven avatar video generation by employing a two-stage framework: first generating hand poses directly from audio and then using a diffusion model to synthesize full-bod…

X-Dyna: Expressive Dynamic Human Image Animation

17 January 2025·3011 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Southern California

X-Dyna: a novel diffusion-based pipeline generates realistic human image animation using a zero-shot approach by integrating a Dynamics-Adapter for dynamic detail preservation, exceeding state-of-the-…

Textoon: Generating Vivid 2D Cartoon Characters from Text Descriptions

17 January 2025·2057 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Alibaba Group

Textoon: Generating vivid 2D cartoon characters from text descriptions in under a minute, revolutionizing animation workflow.

GSTAR: Gaussian Surface Tracking and Reconstruction

17 January 2025·2047 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 ETH Zurich

GSTAR: A novel method achieving photorealistic rendering, accurate reconstruction, and reliable 3D tracking of dynamic scenes with changing topology, even handling surfaces appearing, disappearing, or…

GaussianAvatar-Editor: Photorealistic Animatable Gaussian Head Avatar Editor

17 January 2025·2208 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Hong Kong University of Science and Technology

GaussianAvatar-Editor enables photorealistic, text-driven editing of animatable 3D heads, solving motion occlusion and ensuring temporal consistency.

DiffuEraser: A Diffusion Model for Video Inpainting

17 January 2025·2356 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Alibaba Group

DiffuEraser: a novel video inpainting model based on stable diffusion, surpasses existing methods by using injected priors and temporal consistency improvements for superior results.

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

16 January 2025·3696 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance Seed

VideoWorld shows AI can learn complex reasoning and planning skills from unlabeled videos alone, achieving professional-level performance in Go and robotics.

SynthLight: Portrait Relighting with Diffusion Model by Learning to Re-render Synthetic Faces

16 January 2025·2347 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Yale University

SynthLight: A novel diffusion model relights portraits realistically by learning to re-render synthetic faces, generalizing remarkably well to real photographs.

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

16 January 2025·4248 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Meta

Scaling visual tokenizers dramatically improves image and video generation, achieving state-of-the-art results and outperforming existing methods with fewer computations by focusing on decoder scaling…

Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

16 January 2025·5585 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 NYU

Boosting diffusion model performance at inference time, this research introduces a novel framework that goes beyond simply increasing denoising steps. By cleverly searching for better noise candidates…

CaPa: Carve-n-Paint Synthesis for Efficient 4K Textured Mesh Generation

16 January 2025·3330 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Graphics AI Lab, NC Research

CaPa: Carve-n-Paint Synthesis generates hyper-realistic 4K textured meshes in under 30 seconds, setting a new standard for efficient 3D asset creation.

AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation

16 January 2025·2125 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Alibaba Tongyi Lab

AnyStory: A unified framework enables high-fidelity personalized image generation for single and multiple subjects, addressing subject fidelity challenges in existing methods.

RepVideo: Rethinking Cross-Layer Representation for Video Generation

15 January 2025·2785 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University

RepVideo enhances text-to-video generation by enriching feature representations, resulting in significantly improved temporal coherence and spatial detail.

Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

15 January 2025·2366 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Rochester

Ouroboros-Diffusion: A novel tuning-free long video generation framework achieving unprecedented content consistency by cleverly integrating information across frames via latent sampling, cross-frame…

CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities

15 January 2025·3972 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent AI Lab

CityDreamer4D generates realistic, unbounded 4D city models by cleverly separating dynamic objects (like vehicles) from static elements (buildings, roads), using multiple neural fields for enhanced re…