Skip to main content

Video Understanding

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
·2596 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Toronto
AC3D achieves precise 3D camera control in video diffusion transformers by analyzing camera motion’s spectral properties, optimizing pose conditioning, and using a curated dataset of dynamic videos.
WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model
·4024 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University
WF-VAE boosts video VAE performance with wavelet-driven energy flow and causal caching, enabling 2x higher throughput and 4x lower memory usage in latent video diffusion models.
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation
·3751 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 UNC Chapel Hill
DREAMRUNNER generates high-quality storytelling videos by using LLMs for hierarchical planning, motion retrieval, and a novel spatial-temporal region-based diffusion model for fine-grained control.
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
·2991 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 KAIST
CoordTok: a novel video tokenizer drastically reduces token count for long videos, enabling memory-efficient training of diffusion models for high-quality, long video generation.
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
·4302 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
MagicDriveDiT generates high-resolution, long street-view videos with precise control, exceeding limitations of previous methods in autonomous driving.
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
·2784 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Washington
SAMURAI enhances the Segment Anything Model 2 for real-time, zero-shot visual object tracking by incorporating motion-aware memory and motion modeling, significantly improving accuracy and robustness.
AnimateAnything: Consistent and Controllable Animation for Video Generation
·2615 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
AnimateAnything: A unified approach enabling precise & consistent video manipulation via a novel optical flow representation and frequency stabilization.
Number it: Temporal Grounding Videos like Flipping Manga
·2758 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Southeast University
Boosting video temporal grounding, NumPro empowers Vid-LLMs by adding frame numbers, making temporal localization as easy as flipping through manga.
Sharingan: Extract User Action Sequence from Desktop Recordings
·9852 words·47 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
Sharingan extracts user action sequences from desktop recordings using novel VLM-based methods, achieving 70-80% accuracy and enabling RPA.
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
·1627 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Alibaba
EgoVid-5M: First high-quality dataset for egocentric video generation, enabling realistic human-centric world simulations.
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
·2474 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Google
ReCapture generates videos with novel camera angles from user videos using masked video fine-tuning, preserving scene motion and plausibly hallucinating unseen parts.
How Far is Video Generation from World Model: A Physical Law Perspective
·3657 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance Research
Scaling video generation models doesn’t guarantee they’ll learn physics; this study reveals they prioritize visual cues over true physical understanding.
Adaptive Caching for Faster Video Generation with Diffusion Transformers
·3142 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Meta AI
Adaptive Caching (AdaCache) dramatically speeds up video generation with diffusion transformers by cleverly caching and reusing computations, tailoring the process to each video’s complexity and motio…
Learning Video Representations without Natural Videos
·3154 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ShanghaiTech University
High-performing video representation models can be trained using only synthetic videos and images, eliminating the need for large natural video datasets.