Computer Vision

SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

2 January 2025·1895 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University

SeedVR: A novel diffusion transformer revolutionizes generic video restoration by efficiently handling arbitrary video lengths and resolutions, achieving state-of-the-art performance.

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

2 January 2025·3436 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Huazhong University of Science and Technology

LightningDiT resolves the optimization dilemma in latent diffusion models by aligning latent space with pre-trained vision models, achieving state-of-the-art ImageNet 256x256 generation with over 21x …

MLLM-as-a-Judge for Image Safety without Human Labeling

31 December 2024·6596 words·31 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Classification 🏢 Meta AI

Zero-shot image safety judgment is achieved using MLLMs and a novel method called CLUE, objectifying safety rules, and significantly reducing the need for human labeling.

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

30 December 2024·8988 words·43 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University

VisionReward, a novel reward model, surpasses existing methods by precisely capturing multi-dimensional human preferences for image and video generation, enabling more accurate and stable model optimi…

Edicho: Consistent Image Editing in the Wild

30 December 2024·2565 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Hong Kong University of Science and Technology

Edicho: a novel training-free method for consistent image editing across diverse images, achieving precise consistency by leveraging explicit correspondence.

Bringing Objects to Life: 4D generation from 3D objects

29 December 2024·2761 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Bar-Ilan University

3to4D: Animate any 3D object with text prompts, preserving visual quality and achieving realistic motion!

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

27 December 2024·4442 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tencent AI Lab

VideoMaker achieves high-fidelity zero-shot customized video generation by cleverly harnessing the inherent power of video diffusion models, eliminating the need for extra feature extraction and injec…

PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models

24 December 2024·3061 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Meta AI

PartGen generates compositional 3D objects with meaningful parts from text, images, or unstructured 3D data using multi-view diffusion models, enabling flexible 3D part editing.

Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models

24 December 2024·3014 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Zhejiang University

Orient Anything: Learning robust object orientation estimation directly from rendered 3D models, achieving state-of-the-art accuracy on real images.

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

24 December 2024·3843 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 MMLab, the Chinese University of Hong Kong

DiTCtrl achieves state-of-the-art multi-prompt video generation without retraining by cleverly controlling attention in a diffusion transformer, enabling smooth transitions between video segments.

DepthLab: From Partial to Complete

24 December 2024·2516 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 HKU

DepthLab: a novel image-conditioned depth inpainting model enhances downstream 3D tasks by effectively completing partial depth information, showing superior performance and generalization.

SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images

23 December 2024·2647 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Visual Question Answering 🏢 Kyoto University

SBS Figures creates a massive, high-quality figure QA dataset via a novel stage-by-stage synthesis pipeline, enabling efficient pre-training of visual language models.

Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

22 December 2024·3841 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University

Distilled Decoding (DD) drastically speeds up image generation from autoregressive models by using flow matching to enable one-step sampling, achieving significant speedups while maintaining acceptabl…

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

20 December 2024·4398 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 National University of Singapore

CLEAR: Conv-Like Linearization boosts pre-trained Diffusion Transformers, achieving 6.3x faster 8K image generation with minimal quality loss.

UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency

19 December 2024·3351 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 ETH Zurich

UIP2P: Unsupervised instruction-based image editing achieves high-fidelity edits by enforcing Cycle Edit Consistency, eliminating the need for ground-truth data.

Parallelized Autoregressive Visual Generation

19 December 2024·4274 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

Boosting autoregressive visual generation speed by 3.6-9.5x, this research introduces parallel processing while preserving model simplicity and generation quality.

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

19 December 2024·2715 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

LeviTor: Revolutionizing image-to-video synthesis with intuitive 3D trajectory control, generating realistic videos from static images by abstracting object masks into depth-aware control points.

DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

19 December 2024·2004 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent PCG

DI-PCG uses a lightweight diffusion transformer to efficiently and accurately estimate parameters of procedural generators from images, enabling high-fidelity 3D asset creation.

Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

19 December 2024·3907 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Harvard University

Affordance-Aware Object Insertion uses a novel Mask-Aware Dual Diffusion model & SAM-FB dataset to realistically place objects in scenes, considering contextual relationships.

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

18 December 2024·4162 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Zhejiang University

Prompting unlocks 4K metric depth from low-cost LiDAR.