Skip to main content

Video Understanding

Video Token Merging for Long Video Understanding
·2290 words·11 mins· loading · loading
Computer Vision Video Understanding 🏢 Korea University
Researchers boost long-form video understanding efficiency by 6.89x and reduce memory usage by 84% using a novel learnable video token merging algorithm.
Video Diffusion Models are Training-free Motion Interpreter and Controller
·2252 words·11 mins· loading · loading
Computer Vision Video Understanding 🏢 Peking University
Training-free video motion control achieved via novel Motion Feature (MOFT) extraction from existing video diffusion models, offering architecture-agnostic insights and high performance.
VFIMamba: Video Frame Interpolation with State Space Models
·2179 words·11 mins· loading · loading
Computer Vision Video Understanding 🏢 Tencent AI Lab
VFIMamba uses state-space models for efficient and dynamic video frame interpolation, achieving state-of-the-art results by introducing a novel Mixed-SSM Block and curriculum learning.
TrackIME: Enhanced Video Point Tracking via Instance Motion Estimation
·2140 words·11 mins· loading · loading
Video Understanding 🏢 KAIST
TrackIME enhances video point tracking by cleverly pruning the search space, resulting in improved accuracy and efficiency.
Towards Multi-Domain Learning for Generalizable Video Anomaly Detection
·2936 words·14 mins· loading · loading
Computer Vision Video Understanding 🏢 Kyung Hee University
Researchers propose Multi-Domain learning for Video Anomaly Detection (MDVAD) to create generalizable models handling conflicting abnormality criteria across diverse datasets, improving accuracy and a…
Temporally Consistent Atmospheric Turbulence Mitigation with Neural Representations
·1994 words·10 mins· loading · loading
Computer Vision Video Understanding 🏢 University of Maryland
ConVRT: A novel framework restores turbulence-distorted videos by decoupling spatial and temporal information in a neural representation, achieving temporally consistent mitigation.
TAPTRv2: Attention-based Position Update Improves Tracking Any Point
·1868 words·9 mins· loading · loading
Computer Vision Video Understanding 🏢 South China University of Technology
TAPTRv2 enhances point tracking by introducing an attention-based position update, eliminating cost-volume reliance for improved accuracy and efficiency.
SyncVIS: Synchronized Video Instance Segmentation
·2160 words·11 mins· loading · loading
Computer Vision Video Understanding 🏢 University of Hong Kong
SyncVIS: A new framework for video instance segmentation achieves state-of-the-art results by synchronously modeling video and frame-level information, overcoming limitations of asynchronous approache…
StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences
·2803 words·14 mins· loading · loading
AI Generated Computer Vision Video Understanding 🏢 Peking University
StreamFlow accelerates video optical flow estimation by 44% via a streamlined in-batch multi-frame pipeline and innovative spatiotemporal modeling, achieving state-of-the-art results.
Splatter a Video: Video Gaussian Representation for Versatile Processing
·2610 words·13 mins· loading · loading
Computer Vision Video Understanding 🏢 University of Hong Kong
Researchers introduce Video Gaussian Representation (VGR) for versatile video processing, embedding videos into explicit 3D Gaussians for intuitive motion and appearance modeling.
Slot State Space Models
·2613 words·13 mins· loading · loading
Computer Vision Video Understanding 🏢 Rutgers University
SlotSSMs: a novel framework for modular sequence modeling, achieving significant performance gains by incorporating independent mechanisms and sparse interactions into State Space Models.
SF-V: Single Forward Video Generation Model
·1607 words·8 mins· loading · loading
Computer Vision Video Understanding 🏢 Snap Inc.
Researchers developed SF-V, a single-step image-to-video generation model, achieving a 23x speedup compared to existing models without sacrificing quality, paving the way for real-time video synthesis…
ReVideo: Remake a Video with Motion and Content Control
·2423 words·12 mins· loading · loading
Computer Vision Video Understanding 🏢 Peking University
ReVideo enables precise local video editing by independently controlling content and motion, overcoming limitations of existing methods and paving the way for advanced video manipulation.
OPEL: Optimal Transport Guided ProcedurE Learning
·2652 words·13 mins· loading · loading
Computer Vision Video Understanding 🏢 Purdue University
OPEL: a novel optimal transport framework for procedure learning, significantly outperforms SOTA methods by aligning similar video frames and relaxing strict temporal assumptions.
OnlineTAS: An Online Baseline for Temporal Action Segmentation
·2736 words·13 mins· loading · loading
AI Generated Computer Vision Video Understanding 🏢 National University of Singapore
OnlineTAS, a novel framework, achieves state-of-the-art performance in online temporal action segmentation by using an adaptive memory and a post-processing method to mitigate over-segmentation.
On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
·2133 words·11 mins· loading · loading
Computer Vision Video Understanding 🏢 Shanghai Jiao Tong University
MM-Det, a novel algorithm, uses multimodal learning and spatiotemporal attention to detect diffusion-generated videos, achieving state-of-the-art performance on the new DVF dataset.
NVRC: Neural Video Representation Compression
·1996 words·10 mins· loading · loading
Computer Vision Video Understanding 🏢 Visual Information Lab, University of Bristol, UK
NVRC: A novel end-to-end neural video codec achieves 23% coding gain over VVC VTM by optimizing representation compression.
NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction
·2374 words·12 mins· loading · loading
Video Understanding 🏢 Tongji University
NeuroClips: groundbreaking fMRI-to-video reconstruction, achieving high-fidelity smooth video up to 6s at 8FPS by decoding both high-level semantics and low-level perception flows.
NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing
·2217 words·11 mins· loading · loading
Computer Vision Video Understanding 🏢 National Yang Ming Chiao Tung University
NaRCan: High-quality video editing via diffusion priors and hybrid deformation fields.
Multi-view Masked Contrastive Representation Learning for Endoscopic Video Analysis
·2187 words·11 mins· loading · loading
Computer Vision Video Understanding 🏢 Xiangtan University
Multi-view Masked Contrastive Representation Learning (M²CRL) significantly boosts endoscopic video analysis by using a novel multi-view masking strategy and contrastive learning, achieving state-of-t…