Skip to main content

Computer Vision

Concat-ID: Towards Universal Identity-Preserving Video Synthesis
·2138 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Gaoling School of AI, Renmin University of China
Concat-ID: A universal, scalable framework for identity-preserving video synthesis, balancing consistency and editability.
WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes
·1935 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
WideRange4D: A new benchmark & reconstruction method for high-quality 4D scenes with wide-range movements, pushing the boundaries of 4D reconstruction.
Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait
·2626 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Liverpool
KDTalker: Accurate & efficient audio-driven talking portrait via implicit keypoint-based spatiotemporal diffusion, unlocking diverse & realistic animations.
Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation
·2806 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Hong Kong University of Science and Technology
Rewards Are Enough!
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
·5602 words·27 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Apple
MM-Spatial enhances multimodal LLMs with 3D spatial reasoning via a novel dataset and benchmark, improving performance on spatial understanding tasks.
Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation
·2576 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Shanghai Artificial Intelligence Laboratory
Infinite Mobility: Procedural generation of high-fidelity articulated objects for scalable embodied AI training.
Edit Transfer: Learning Image Editing via Vision In-Context Relations
·3168 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Communication University of China
Edit Transfer: Learns image edits from a single example and applies it to new images, surpassing text/reference-based methods!
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
·3109 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Zhejiang University
DreamRenderer: Taming attribute control in large-scale text-to-image models with a plug-and-play, training-free approach for enhanced content creation.
BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing
·2181 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
BlobCtrl: Precisely edit images at the element level with a unified, flexible framework, bridging the gap between generation and editing.
MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
·2743 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Zhejiang University
MagicID: ID-consistent & dynamic-preserved video customization via hybrid preference optimization.
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
·3299 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 UCLA
Reflect-DiT: Scaling Text-to-Image Diffusion Transformers via In-Context Reflection!
VGGT: Visual Geometry Grounded Transformer
·3346 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Oxford
VGGT: a fast, end-to-end transformer that infers complete 3D scene attributes from multiple views, outperforming optimization-based methods.
Towards a Unified Copernicus Foundation Model for Earth Vision
·4400 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Scene Understanding 🏢 Technical University of Munich
Unified Copernicus Foundation Model for Earth Vision: A multimodal approach to improve scalability, versatility, and adaptability of EO models.
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
·2617 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Zhejiang University
ReCamMaster: Re-shoots videos via generative rendering, controlling camera movement from a single source, for novel perspectives and enhanced video creation.
MTV-Inpaint: Multi-Task Long Video Inpainting
·3551 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 City University of Hong Kong
MTV-Inpaint: A unified framework for multi-task long video inpainting, enabling versatile object insertion, scene completion, editing, and removal.
Long Context Tuning for Video Generation
·2260 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 the Chinese University of Hong Kong
LCT: Fine-tunes single-shot video diffusion models for coherent multi-shot video generation without extra parameters!
LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds
·2424 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Alibaba Group
LHM: Animatable 3D avatars from a single image in seconds.
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
·2532 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 CUHK MMLab
GoT: Reasoning guides vivid image generation and editing!
ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
·2550 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Westlake University
ETCH: Equivariantly fitting bodies to clothed humans through tightness for better pose and shape accuracy.
Distilling Diversity and Control in Diffusion Models
·4046 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Northeastern University
Distilling diffusion models?💡 This paper shows you how to retain base model diversity while keeping the distilled model’s speed!