Skip to main content

Video Understanding

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
·4089 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance
Video Depth Anything achieves consistent depth estimation for super-long videos by enhancing Depth Anything V2 with a spatial-temporal head and a novel temporal consistency loss, setting a new state-o…
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
·3696 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance Seed
VideoWorld shows AI can learn complex reasoning and planning skills from unlabeled videos alone, achieving professional-level performance in Go and robotics.
RepVideo: Rethinking Cross-Layer Representation for Video Generation
·2785 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
RepVideo enhances text-to-video generation by enriching feature representations, resulting in significantly improved temporal coherence and spatial detail.
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
·2366 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Rochester
Ouroboros-Diffusion: A novel tuning-free long video generation framework achieving unprecedented content consistency by cleverly integrating information across frames via latent sampling, cross-frame…
GameFactory: Creating New Games with Generative Interactive Videos
·3286 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Hong Kong
GameFactory uses AI to generate entirely new games within diverse, open-domain scenes by learning action controls from a small dataset and transferring them to pre-trained video models.
Do generative video models learn physical principles from watching videos?
·3121 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Google DeepMind
Generative video models struggle to understand physics despite producing visually realistic videos; Physics-IQ benchmark reveals this critical limitation, highlighting the need for improved physical r…
An Empirical Study of Autoregressive Pre-training from Videos
·5733 words·27 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 UC Berkeley
Toto, a new autoregressive video model, achieves competitive performance across various benchmarks by pre-training on over 1 trillion visual tokens, demonstrating the effectiveness of scaling video mo…
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
·3018 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
Diffusion as Shader (DaS) achieves versatile video control by using 3D tracking videos as control signals in a unified video diffusion model, enabling precise manipulation across diverse tasks.
TransPixar: Advancing Text-to-Video Generation with Transparency
·2458 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
TransPixar generates high-quality videos with transparency by jointly training RGB and alpha channels, outperforming sequential generation methods.
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
·3762 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanjing University
STAR: A novel approach uses text-to-video models for realistic, temporally consistent real-world video super-resolution, improving image quality and detail.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
·3666 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
MotionBench, a new benchmark, reveals that existing video models struggle with fine-grained motion understanding. To address this, the authors propose TE Fusion, a novel architecture that improves mo…
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
·2867 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Multimedia Laboratory, the Chinese University of Hong Kong
GS-DiT: Generating high-quality videos with advanced 4D control through efficient dense 3D point tracking and pseudo 4D Gaussian fields.
Ingredients: Blending Custom Photos with Video Diffusion Transformers
·2689 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Kunlun Inc.
Ingredients: A new framework customizes videos by blending multiple photos with video diffusion transformers, enabling realistic and personalized video generation while maintaining consistent identity…
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
·3152 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
VideoAnydoor: High-fidelity video object insertion with precise motion control, achieved via an end-to-end framework leveraging an ID extractor and a pixel warper for robust detail preservation and fi…
SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
·1895 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
SeedVR: A novel diffusion transformer revolutionizes generic video restoration by efficiently handling arbitrary video lengths and resolutions, achieving state-of-the-art performance.
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
·3843 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 MMLab, the Chinese University of Hong Kong
DiTCtrl achieves state-of-the-art multi-prompt video generation without retraining by cleverly controlling attention in a diffusion transformer, enabling smooth transitions between video segments.
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis
·2715 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
LeviTor: Revolutionizing image-to-video synthesis with intuitive 3D trajectory control, generating realistic videos from static images by abstracting object masks into depth-aware control points.
AniDoc: Animation Creation Made Easier
·2223 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
AniDoc automates cartoon animation line art video colorization, making animation creation easier!
VidTok: A Versatile and Open-Source Video Tokenizer
·2918 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Microsoft Research
VidTok: an open-source, top performing video tokenizer.
Move-in-2D: 2D-Conditioned Human Motion Generation
·2569 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Adobe Research
Move-in-2D generates realistic human motion sequences conditioned on a 2D scene image and text prompt, overcoming limitations of existing approaches and improving video synthesis.