Video Understanding

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

27 March 2025·3977 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Shanghai Artificial Intelligence Laboratory

VBench 2.0: A new benchmark suite advancing video generation evaluation with intrinsic faithfulness metrics.

Exploring the Evolution of Physics Cognition in Video Generation: A Survey

27 March 2025·3260 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Huazhong University of Science and Technology

This survey explores the evolution of physics cognition in video generation, addressing the gap between visual realism and physical accuracy.

Synthetic Video Enhances Physical Fidelity in Video Synthesis

26 March 2025·4236 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance Seed

Synthetic data can enhance the physical realism of video synthesis, paving the way for more believable generated content.

Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals

25 March 2025·4505 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Stanford University

Opt-CWM: Self-supervised motion learning via counterfactual optimization, achieving state-of-the-art without labels!

AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

25 March 2025·2413 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Beihang University

AccVideo accelerates video diffusion by 8.5x with a synthetic dataset and trajectory-based distillation, maintaining quality and enabling higher resolution video generation.

Video-T1: Test-Time Scaling for Video Generation

24 March 2025·3231 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University

Video-T1 enhances video generation through test-time scaling, improving quality and consistency by viewing generation as a search for optimal video trajectories.

AMD-Hummingbird: Towards an Efficient Text-to-Video Model

24 March 2025·739 words·4 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Advanced Micro Devices, Inc.

Hummingbird: An efficient text-to-video model that balances quality and computational efficiency via pruning and visual feedback learning.

OmnimatteZero: Training-free Real-time Omnimatte with Pre-trained Video Diffusion Models

23 March 2025·2382 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 OriginAI, Tel-Aviv, Israel

OmnimatteZero: Real-time omnimatte using pre-trained video diffusion, no training needed!

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

22 March 2025·4361 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 King Abdullah University of Science and Technology

4D-Bench: The first benchmark for assessing MLLMs in 4D object understanding, revealing weak temporal understanding and the need for advancements.

Temporal Regularization Makes Your Video Generator Stronger

19 March 2025·3350 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

FluxFlow: Make your video generator stronger via temporal regularization!

Make Your Training Flexible: Towards Deployment-Efficient Video Models

18 March 2025·5609 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Shanghai Jiao Tong University

FluxViT: Flexible video models via adaptive token selection for efficient deployment!

MagicComp: Training-free Dual-Phase Refinement for Compositional Video Generation

18 March 2025·3052 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University

MagicComp: Dual-Phase Refinement Enables Training-Free Compositional Video Generation

Impossible Videos

18 March 2025·4228 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 National University of Singapore

Impossible videos expose AI limits!

Concat-ID: Towards Universal Identity-Preserving Video Synthesis

18 March 2025·2138 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Gaoling School of AI, Renmin University of China

Concat-ID: A universal, scalable framework for identity-preserving video synthesis, balancing consistency and editability.

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

17 March 2025·2626 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Liverpool

KDTalker: Accurate & efficient audio-driven talking portrait via implicit keypoint-based spatiotemporal diffusion, unlocking diverse & realistic animations.

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

16 March 2025·2743 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Zhejiang University

MagicID: ID-consistent & dynamic-preserved video customization via hybrid preference optimization.

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

14 March 2025·2617 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Zhejiang University

ReCamMaster: Re-shoots videos via generative rendering, controlling camera movement from a single source, for novel perspectives and enhanced video creation.

MTV-Inpaint: Multi-Task Long Video Inpainting

14 March 2025·3551 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 City University of Hong Kong

MTV-Inpaint: A unified framework for multi-task long video inpainting, enabling versatile object insertion, scene completion, editing, and removal.

Long Context Tuning for Video Generation

13 March 2025·2260 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 the Chinese University of Hong Kong

LCT: Fine-tunes single-shot video diffusion models for coherent multi-shot video generation without extra parameters!

CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance

13 March 2025·1806 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance Intelligent Creation

CINEMA: MLLM-guided coherent multi-subject video generation for consistent and controllable content creation.