Computer Vision

Concat-ID: Towards Universal Identity-Preserving Video Synthesis

18 March 2025·2138 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Gaoling School of AI, Renmin University of China

Concat-ID: A universal, scalable framework for identity-preserving video synthesis, balancing consistency and editability.

WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes

17 March 2025·1935 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University

WideRange4D: A new benchmark & reconstruction method for high-quality 4D scenes with wide-range movements, pushing the boundaries of 4D reconstruction.

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

17 March 2025·2626 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Liverpool

KDTalker: Accurate & efficient audio-driven talking portrait via implicit keypoint-based spatiotemporal diffusion, unlocking diverse & realistic animations.

Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation

17 March 2025·2806 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Hong Kong University of Science and Technology

Rewards Are Enough!

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

17 March 2025·5602 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Apple

MM-Spatial enhances multimodal LLMs with 3D spatial reasoning via a novel dataset and benchmark, improving performance on spatial understanding tasks.

Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation

17 March 2025·2576 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Shanghai Artificial Intelligence Laboratory

Infinite Mobility: Procedural generation of high-fidelity articulated objects for scalable embodied AI training.

Edit Transfer: Learning Image Editing via Vision In-Context Relations

17 March 2025·3168 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Communication University of China

Edit Transfer: Learns image edits from a single example and applies it to new images, surpassing text/reference-based methods!

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

17 March 2025·3109 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Zhejiang University

DreamRenderer: Taming attribute control in large-scale text-to-image models with a plug-and-play, training-free approach for enhanced content creation.

BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

17 March 2025·2181 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

BlobCtrl: Precisely edit images at the element level with a unified, flexible framework, bridging the gap between generation and editing.

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

16 March 2025·2743 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Zhejiang University

MagicID: ID-consistent & dynamic-preserved video customization via hybrid preference optimization.

Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection

15 March 2025·3299 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 UCLA

Reflect-DiT: Scaling Text-to-Image Diffusion Transformers via In-Context Reflection!

VGGT: Visual Geometry Grounded Transformer

14 March 2025·3346 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Oxford

VGGT: a fast, end-to-end transformer that infers complete 3D scene attributes from multiple views, outperforming optimization-based methods.

Towards a Unified Copernicus Foundation Model for Earth Vision

14 March 2025·4400 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Scene Understanding 🏢 Technical University of Munich

Unified Copernicus Foundation Model for Earth Vision: A multimodal approach to improve scalability, versatility, and adaptability of EO models.

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

14 March 2025·2617 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Zhejiang University

ReCamMaster: Re-shoots videos via generative rendering, controlling camera movement from a single source, for novel perspectives and enhanced video creation.

MTV-Inpaint: Multi-Task Long Video Inpainting

14 March 2025·3551 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 City University of Hong Kong

MTV-Inpaint: A unified framework for multi-task long video inpainting, enabling versatile object insertion, scene completion, editing, and removal.

Long Context Tuning for Video Generation

13 March 2025·2260 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 the Chinese University of Hong Kong

LCT: Fine-tunes single-shot video diffusion models for coherent multi-shot video generation without extra parameters!

LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

13 March 2025·2424 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Alibaba Group

LHM: Animatable 3D avatars from a single image in seconds.

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

13 March 2025·2532 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 CUHK MMLab

GoT: Reasoning guides vivid image generation and editing!

ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

13 March 2025·2550 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Westlake University

ETCH: Equivariantly fitting bodies to clothed humans through tightness for better pose and shape accuracy.

Distilling Diversity and Control in Diffusion Models

13 March 2025·4046 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Northeastern University

Distilling diffusion models?💡 This paper shows you how to retain base model diversity while keeping the distilled model’s speed!