Skip to main content

Video Understanding

MIVE: New Design and Benchmark for Multi-Instance Video Editing
·7714 words·37 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 KAIST
Edit many objects at once in videos! MIVE does it accurately without affecting other areas, a big step for AI video editing.
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption
·4018 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanjing University
InstanceCap improves text-to-video generation through detailed, instance-aware captions.
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
·6779 words·32 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Georgia Institute of Technology
Gaze-LLE achieves state-of-the-art gaze estimation by using a frozen DINOv2 encoder and a lightweight decoder, simplifying architecture and improving efficiency.
Video Motion Transfer with Diffusion Transformers
·3141 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Oxford
DiTFlow: training-free video motion transfer using Diffusion Transformers, enabling realistic motion control in synthesized videos via Attention Motion Flow.
ObjCtrl-2.5D: Training-free Object Control with Camera Poses
·3506 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
ObjCtrl-2.5D: Training-free, precise image-to-video object control using 3D trajectories and camera poses.
Mobile Video Diffusion
·3393 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Qualcomm AI Research
MobileVD: The first mobile-optimized video diffusion model, achieving 523x efficiency improvement over state-of-the-art with minimal quality loss, enabling realistic video generation on smartphones.
MoViE: Mobile Diffusion for Video Editing
·2482 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Qualcomm AI Research
MoViE: Mobile Diffusion for Video Editing achieves 12 FPS video editing on mobile phones by optimizing existing image editing models, achieving a major breakthrough in on-device video processing.
Mind the Time: Temporally-Controlled Multi-Event Video Generation
·4541 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Toronto
MinT: Generating coherent videos with precisely timed, multiple events via temporal control, surpassing existing methods.
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
·276 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Fudan University
LiFT leverages human feedback, including reasoning, to effectively align text-to-video models with human preferences, significantly improving video quality.
Mimir: Improving Video Diffusion Models for Precise Text Understanding
·3398 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Ant Group
Mimir: A novel framework harmonizes LLMs and video diffusion models for precise text understanding in video generation, producing high-quality videos with superior text comprehension.
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
·2511 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
VideoGen-of-Thought (VGoT) creates high-quality, multi-shot videos by collaboratively generating scripts, keyframes, and video clips, ensuring narrative consistency and visual coherence.
OmniCreator: Self-Supervised Unified Generation with Universal Editing
·5399 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
OmniCreator: Self-supervised unified image+video generation & universal editing.
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
·3776 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Department of Electrical and Computer Engineering, North South University
VideoLights: a novel framework for joint video highlight detection & moment retrieval, boosts performance via feature refinement, cross-modal & cross-task alignment, achieving state-of-the-art results…
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
·4589 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Mohamed Bin Zayed University of Artificial Intelligence
PhysGame benchmark unveils video LLMs’ weaknesses in understanding physical commonsense from gameplay videos, prompting the creation of PhysVLM, a knowledge-enhanced model that outperforms existing mo…
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
·1734 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 01.AI
Presto: a novel video diffusion model generates 15-second, high-quality videos with unparalleled long-range coherence and rich content, achieved through a segmented cross-attention mechanism and the L…
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
·3029 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Waterloo
VISTA synthesizes long-duration, high-resolution video instruction data, creating VISTA-400K and HRVideoBench to significantly boost video LMM performance.
Video Depth without Video Models
·3150 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Carnegie Mellon University
RollingDepth: Achieving state-of-the-art video depth estimation without using complex video models, by cleverly extending a single-image depth estimator.
Trajectory Attention for Fine-grained Video Motion Control
·4421 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
Trajectory Attention enhances video motion control by injecting trajectory information, improving precision and long-range consistency in video generation.
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
·271 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Alibaba Group
TeaCache: a training-free method boosts video diffusion model speed by up to 4.41x with minimal quality loss by cleverly caching intermediate outputs.
TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video
·4076 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
TAPTRv3 achieves state-of-the-art long-video point tracking by cleverly using spatial and temporal context to enhance feature querying, surpassing previous methods and demonstrating strong performance…