Computer Vision

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

17 February 2025·4400 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Meta AI

AI models learn intuitive physics from self-supervised video pretraining.

Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening

17 February 2025·2525 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

Diffusion-Sharpening enhances diffusion model fine-tuning by optimizing sampling trajectories, achieving faster convergence and high inference efficiency without extra NFEs, leading to improved alignm…

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

14 February 2025·4393 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Step-Video Team

Step-Video-T2V: A 30B parameter text-to-video model generating high-quality videos up to 204 frames, pushing the boundaries of video foundation models.

Cluster and Predict Latents Patches for Improved Masked Image Modeling

12 February 2025·7222 words·34 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Segmentation 🏢 Meta FAIR

CAPI: a novel masked image modeling framework boosts self-supervised visual representation learning by predicting latent clusterings, achieving state-of-the-art ImageNet accuracy and mIoU.

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

11 February 2025·3389 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Fudan University

VidCRAFT3 enables high-quality image-to-video generation with precise control over camera movement, object motion, and lighting, pushing the boundaries of visual content creation.

Next Block Prediction: Video Generation via Semi-Autoregressive Modeling

11 February 2025·3939 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University

Next-Block Prediction (NBP) revolutionizes video generation by using a semi-autoregressive model that predicts blocks of video content simultaneously, resulting in significantly faster inference.

MRS: A Fast Sampler for Mean Reverting Diffusion based on ODE and SDE Solvers

11 February 2025·2884 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 School of Artificial Intelligence, University of Chinese Academy of Sciences

MRS: a novel, training-free sampler, drastically speeds up controllable image generation using Mean Reverting Diffusion, achieving 10-20x speedup across various tasks.

Magic 1-For-1: Generating One Minute Video Clips within One Minute

11 February 2025·1947 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

Magic141 generates one-minute video clips in under a minute by cleverly factorizing the generation task and employing optimization techniques.

Enhance-A-Video: Better Generated Video for Free

11 February 2025·3320 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 National University of Singapore

Enhance-A-Video boosts video generation quality without retraining, by enhancing cross-frame correlations in diffusion transformers, resulting in improved coherence and visual fidelity.

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

10 February 2025·2951 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Texas at Austin

TripoSG: High-fidelity 3D shapes synthesized via large-scale rectified flow models, pushing image-to-3D generation to new heights.

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

10 February 2025·3016 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Chinese University of Hong Kong

Lumina-Video: Efficient and flexible video generation using a multi-scale Next-DiT architecture with motion control.

Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

10 February 2025·4798 words·23 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University

EFFICIENT-VDIT accelerates video generation by 7.8x using sparse attention and multi-step distillation.

CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers

10 February 2025·2569 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Hong Kong University of Science and Technology

CustomVideoX: Zero-shot personalized video generation, exceeding existing methods in quality & consistency via 3D reference attention and dynamic adaptation.

Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance

10 February 2025·1752 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tongyi Lab, Alibaba Group

Animate Anyone 2 creates high-fidelity character animations by incorporating environmental context, resulting in seamless character-environment integration and more realistic object interactions.

Dual Caption Preference Optimization for Diffusion Models

9 February 2025·4961 words·24 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Arizona State University

Dual Caption Preference Optimization (DCPO) significantly boosts diffusion model image quality by using paired captions to resolve data distribution conflicts and irrelevant prompt issues.

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly

9 February 2025·3328 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Segmentation 🏢 Shanghai University

3CAD: A new large-scale, real-world dataset with diverse 3C product anomalies boosts unsupervised anomaly detection, enabling superior algorithm development via a novel Coarse-to-Fine framework.

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

7 February 2025·3961 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Fudan University

VideoRoPE enhances video processing in Transformer models by introducing a novel 3D rotary position embedding that preserves spatio-temporal relationships, resulting in superior performance across var…

Goku: Flow Based Video Generative Foundation Models

7 February 2025·3430 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Hong Kong

Goku: a novel family of joint image-and-video generation models uses rectified flow Transformers, achieving industry-leading performance with a robust data pipeline and training infrastructure.

FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation

7 February 2025·4450 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

FlashVideo: Generate stunning high-resolution videos efficiently using a two-stage framework prioritizing fidelity and detail, achieving state-of-the-art results.

AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting

7 February 2025·4072 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 National Yang Ming Chiao Tung University

AuraFusion360: High-quality 360° scene inpainting achieved via novel augmented unseen region alignment and a new benchmark dataset.