Video Understanding

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

13 March 2025·2631 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Harvard University

4D LangSplat learns 4D language fields for dynamic scenes using multimodal large language models, enabling time-sensitive open-vocabulary queries.

TPDiff: Temporal Pyramid Video Diffusion Model

12 March 2025·2081 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 National University of Singapore

TPDiff accelerates video diffusion by progressively increasing frame rates during diffusion, optimizing computational efficiency with a novel stage-wise training strategy.

Reangle-A-Video: 4D Video Generation as Video-to-Video Translation

12 March 2025·2533 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 KAIST AI

Reangle-A-Video generates synchronized multi-view videos from a single video via video-to-video translation, surpassing existing methods without specialized 4D training.

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

12 March 2025·2200 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 HPC-AI Tech

Open-Sora 2.0: A commercial-level video generation model trained for only $200k, achieving comparable results to state-of-the-art models.

Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework

12 March 2025·3325 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Sea AI Lab

Unified framework reveals and mitigates error sources in autoregressive video diffusion models.

Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling

11 March 2025·3192 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 KAIST

SynCoS: Synchronized sampling generates high-quality & coherent long videos from text, without extra training!

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

11 March 2025·3039 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Xiamen University

QuoTA: Task-aware token assignment boosts long video comprehension in LVLMs via query-decoupled processing, without extra training!

Open-World Skill Discovery from Unsegmented Demonstrations

11 March 2025·3148 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University

SBD: Self-supervised skill discovery from unsegmented videos!

AnyMoLe: Any Character Motion In-betweening Leveraging Video Diffusion Models

11 March 2025·2590 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 KAIST, Visual Media Lab

AnyMoLe: Generate character motion in-between frames for diverse characters by video diffusion models without external data. Code: project page.

DreamRelation: Relation-Centric Video Customization

10 March 2025·2731 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Fudan University

DreamRelation: Personalize videos by customizing relationships between subjects, generalizing to new domains.

DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

8 March 2025·4887 words·23 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 IEIT System Co., Ltd.

DropletVideo: A dataset and approach to explore integral spatio-temporal consistent video generation.

VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

7 March 2025·3223 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Chinese University of Hong Kong

VideoPainter: Edit any video, any length, with user-guided instructions!

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

7 March 2025·2590 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Chinese University of Hong Kong

TrajectoryCrafter: Precisely control camera movement in monocular videos with a novel diffusion model for coherent 4D content generation.

MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

7 March 2025·1539 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hedra Inc.

MagicInfinite: Infinite talking videos from words and voice!

AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM

6 March 2025·4656 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Yonsei University

AnyAnomaly: LVLM for customizable zero-shot video anomaly detection, adapting to diverse environments without retraining.

Mobius: Text to Seamless Looping Video Generation via Latent Shift

27 February 2025·2353 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Chongqing University of Post and Telecommunications, China

Mobius generates seamless looping videos from text using latent shift, repurposing pre-trained models without training.

X-Dancer: Expressive Music to Human Dance Video Generation

24 February 2025·1759 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 UC San Diego

X-Dancer: Expressive dance video generation from music and a single image!

VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

24 February 2025·2983 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ReLER Lab, AAII, University of Technology Sydney

VideoGrain: Fine-grained video editing via space-time attention!

Dynamic Concepts Personalization from Single Videos

20 February 2025·2668 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 SNAP RESEARCH

Personalizing video models for dynamic concepts is now achievable with Set-and-Sequence: enabling high-fidelity generation, editing, and composition!

MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

17 February 2025·221 words·2 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 SenseTime Research

MaskGWM: Improves driving world models by using video mask reconstruction for better generalization.