Skip to main content

Video Understanding

Number it: Temporal Grounding Videos like Flipping Manga
·2758 words·13 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Southeast University
Boosting video temporal grounding, NumPro empowers Vid-LLMs by adding frame numbers, making temporal localization as easy as flipping through manga.
Sharingan: Extract User Action Sequence from Desktop Recordings
·9852 words·47 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
Sharingan extracts user action sequences from desktop recordings using novel VLM-based methods, achieving 70-80% accuracy and enabling RPA.
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
·1627 words·8 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Alibaba
EgoVid-5M: First high-quality dataset for egocentric video generation, enabling realistic human-centric world simulations.
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
·2474 words·12 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Google
ReCapture generates videos with novel camera angles from user videos using masked video fine-tuning, preserving scene motion and plausibly hallucinating unseen parts.
How Far is Video Generation from World Model: A Physical Law Perspective
·3657 words·18 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Bytedance Research
Scaling video generation models doesn’t guarantee they’ll learn physics; this study reveals they prioritize visual cues over true physical understanding.
Adaptive Caching for Faster Video Generation with Diffusion Transformers
·3142 words·15 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Meta AI
Adaptive Caching (AdaCache) dramatically speeds up video generation with diffusion transformers by cleverly caching and reusing computations, tailoring the process to each video’s complexity and motio…
Learning Video Representations without Natural Videos
·3154 words·15 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ShanghaiTech University
High-performing video representation models can be trained using only synthetic videos and images, eliminating the need for large natural video datasets.