Computer Vision

GameFactory: Creating New Games with Generative Interactive Videos

14 January 2025·3286 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Hong Kong

GameFactory uses AI to generate entirely new games within diverse, open-domain scenes by learning action controls from a small dataset and transferring them to pre-trained video models.

Do generative video models learn physical principles from watching videos?

14 January 2025·3121 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Google DeepMind

Generative video models struggle to understand physics despite producing visually realistic videos; Physics-IQ benchmark reveals this critical limitation, highlighting the need for improved physical r…

The GAN is dead; long live the GAN! A Modern GAN Baseline

9 January 2025·2531 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Brown University

R3GAN: A modernized GAN baseline achieves state-of-the-art results with a simple, stable loss function and modern architecture, debunking the myth that GANs are hard to train.

An Empirical Study of Autoregressive Pre-training from Videos

9 January 2025·5733 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 UC Berkeley

Toto, a new autoregressive video model, achieves competitive performance across various benchmarks by pre-training on over 1 trillion visual tokens, demonstrating the effectiveness of scaling video mo…

SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images

8 January 2025·2783 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Stability AI

SPAR3D: Fast, accurate single-image 3D reconstruction via a novel two-stage approach using point clouds for high-fidelity mesh generation.

On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

8 January 2025·285 words·2 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University

This paper unveils critical thresholds for efficient visual autoregressive model computation, proving sub-quartic time is impossible beyond a certain input matrix norm while establishing efficient app…

MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting

7 January 2025·3325 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Electronics and Telecommunications Research Institute

MoDec-GS: a novel framework achieving 70% model size reduction in dynamic 3D Gaussian splatting while improving visual quality by cleverly decomposing complex motions and optimizing temporal intervals…

Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback

7 January 2025·3489 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Fudan University

DOLPHIN: AI automates scientific research from idea generation to experimental validation.

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

7 January 2025·3018 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

Diffusion as Shader (DaS) achieves versatile video control by using 3D tracking videos as control signals in a unified video diffusion model, enabling precise manipulation across diverse tasks.

Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation

7 January 2025·5463 words·26 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Cambridge

Chirpy3D: Generating creative, high-quality 3D birds with intricate details by learning a continuous part latent space from 2D images.

TransPixar: Advancing Text-to-Video Generation with Transparency

6 January 2025·2458 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

TransPixar generates high-quality videos with transparency by jointly training RGB and alpha channels, outperforming sequential generation methods.

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

6 January 2025·3304 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Meta

Through-The-Mask uses mask-based motion trajectories to generate realistic videos from images and text, overcoming limitations of existing methods in handling complex multi-object motion.

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

6 January 2025·3762 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanjing University

STAR: A novel approach uses text-to-video models for realistic, temporally consistent real-world video super-resolution, improving image quality and detail.

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

6 January 2025·3666 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University

MotionBench, a new benchmark, reveals that existing video models struggle with fine-grained motion understanding. To address this, the authors propose TE Fusion, a novel architecture that improves mo…

GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

5 January 2025·2867 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Multimedia Laboratory, the Chinese University of Hong Kong

GS-DiT: Generating high-quality videos with advanced 4D control through efficient dense 3D point tracking and pseudo 4D Gaussian fields.

DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

5 January 2025·2694 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Science and Technology of China

DepthMaster tames diffusion models for faster, more accurate monocular depth estimation by aligning generative features with high-quality semantic features and adaptively balancing low and high-freque…

MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

4 January 2025·3209 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu

MagicFace achieves high-fidelity facial expression editing via AU control, preserving identity and background using a diffusion model and ID encoder, significantly outperforming existing methods.

Ingredients: Blending Custom Photos with Video Diffusion Transformers

3 January 2025·2689 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Kunlun Inc.

Ingredients: A new framework customizes videos by blending multiple photos with video diffusion transformers, enabling realistic and personalized video generation while maintaining consistent identity…

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

2 January 2025·3152 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology

VideoAnydoor: High-fidelity video object insertion with precise motion control, achieved via an end-to-end framework leveraging an ID extractor and a pixel warper for robust detail preservation and fi…

SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization

2 January 2025·4234 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Action Recognition 🏢 Unmanned System Research Institute, Northwestern Polytechnical University

SeFAR: a novel semi-supervised framework for fine-grained action recognition, achieves state-of-the-art results by using dual-level temporal modeling, moderate temporal perturbation, and adaptive regu…