Video Understanding
Dynamic Concepts Personalization from Single Videos
·2668 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Snap Research
Personalizing video models for dynamic concepts is now achievable with Set-and-Sequence: enabling high-fidelity generation, editing, and composition!
Intuitive physics understanding emerges from self-supervised pretraining on natural videos
·4400 words·21 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Meta AI
AI models learn intuitive physics from self-supervised video pretraining.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
·4393 words·21 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Step-Video Team
Step-Video-T2V: A 30B parameter text-to-video model generating high-quality videos up to 204 frames, pushing the boundaries of video foundation models.
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
·3939 words·19 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Peking University
Next-Block Prediction (NBP) revolutionizes video generation by using a semi-autoregressive model that predicts blocks of video content simultaneously, resulting in significantly faster inference.
Enhance-A-Video: Better Generated Video for Free
·3320 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ National University of Singapore
Enhance-A-Video boosts video generation quality without retraining, by enhancing cross-frame correlations in diffusion transformers, resulting in improved coherence and visual fidelity.
Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT
·3016 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Chinese University of Hong Kong
Lumina-Video: Efficient and flexible video generation using a multi-scale Next-DiT architecture with motion control.
Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile
·4798 words·23 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Tsinghua University
EFFICIENT-VDIT accelerates video generation by 7.8x using sparse attention and multi-step distillation.
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
·3961 words·19 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Fudan University
VideoRoPE enhances video processing in Transformer models by introducing a novel 3D rotary position embedding that preserves spatio-temporal relationships, resulting in superior performance across var…
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
·4450 words·21 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Hong Kong University of Science and Technology
FlashVideo: Generate stunning high-resolution videos efficiently using a two-stage framework prioritizing fidelity and detail, achieving state-of-the-art results.
MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation
·2615 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Chinese University of Hong Kong
MotionCanvas lets users design cinematic video shots with intuitive controls for camera and object movements, translating scene-space intentions into video animations.
Fast Video Generation with Sliding Tile Attention
·4012 words·19 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ University of California, San Diego
Sliding Tile Attention (STA) boosts video generation speed by 2.43-3.53x without losing quality by exploiting inherent data redundancy in video diffusion models.
Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach
·2337 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ UC Los Angeles
This paper introduces PointVid, a 3D-aware video generation framework using 3D point regularization to enhance video realism and address common issues like object morphing.
DynVFX: Augmenting Real Videos with Dynamic Content
·3393 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Weizmann Institute of Science
DynVFX: Effortlessly integrate dynamic content into real videos using simple text prompts. Zero-shot learning and novel attention mechanisms deliver seamless and realistic results.
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
·3510 words·17 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Meta AI
VideoJAM enhances video generation by jointly learning appearance and motion representations, achieving state-of-the-art motion coherence.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
·4575 words·22 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Nanyang Technological University
Video-MMMU benchmark systematically evaluates Large Multimodal Modelsβ knowledge acquisition from videos across multiple disciplines and cognitive stages, revealing significant gaps between human and …
Temporal Preference Optimization for Long-Form Video Understanding
·2626 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Stanford University
Boosting long-form video understanding, Temporal Preference Optimization (TPO) enhances video-LLMs by leveraging preference learning. It achieves this through a self-training method using preference …
Improving Video Generation with Human Feedback
·4418 words·21 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Tsinghua University
Human feedback boosts video generation! New VideoReward model & alignment algorithms significantly improve video quality and user prompt alignment, exceeding prior methods.
EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion
·2578 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ ByteDance
EchoVideo generates high-fidelity, identity-preserving videos by cleverly fusing text and image features, overcoming limitations of prior methods.
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
·4089 words·20 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ ByteDance
Video Depth Anything achieves consistent depth estimation for super-long videos by enhancing Depth Anything V2 with a spatial-temporal head and a novel temporal consistency loss, setting a new state-o…
DiffuEraser: A Diffusion Model for Video Inpainting
·2356 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Alibaba Group
DiffuEraser: a novel video inpainting model based on stable diffusion, surpasses existing methods by using injected priors and temporal consistency improvements for superior results.