Skip to main content

Computer Vision

AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation
·2125 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Alibaba Tongyi Lab
AnyStory: A unified framework enables high-fidelity personalized image generation for single and multiple subjects, addressing subject fidelity challenges in existing methods.
RepVideo: Rethinking Cross-Layer Representation for Video Generation
·2785 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
RepVideo enhances text-to-video generation by enriching feature representations, resulting in significantly improved temporal coherence and spatial detail.
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
·2366 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Rochester
Ouroboros-Diffusion: A novel tuning-free long video generation framework achieving unprecedented content consistency by cleverly integrating information across frames via latent sampling, cross-frame…
CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities
·3972 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent AI Lab
CityDreamer4D generates realistic, unbounded 4D city models by cleverly separating dynamic objects (like vehicles) from static elements (buildings, roads), using multiple neural fields for enhanced re…
GameFactory: Creating New Games with Generative Interactive Videos
·3286 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Hong Kong
GameFactory uses AI to generate entirely new games within diverse, open-domain scenes by learning action controls from a small dataset and transferring them to pre-trained video models.
Do generative video models learn physical principles from watching videos?
·3121 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Google DeepMind
Generative video models struggle to understand physics despite producing visually realistic videos; Physics-IQ benchmark reveals this critical limitation, highlighting the need for improved physical r…
The GAN is dead; long live the GAN! A Modern GAN Baseline
·2531 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Brown University
R3GAN: A modernized GAN baseline achieves state-of-the-art results with a simple, stable loss function and modern architecture, debunking the myth that GANs are hard to train.
An Empirical Study of Autoregressive Pre-training from Videos
·5733 words·27 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 UC Berkeley
Toto, a new autoregressive video model, achieves competitive performance across various benchmarks by pre-training on over 1 trillion visual tokens, demonstrating the effectiveness of scaling video mo…
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
·2783 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Stability AI
SPAR3D: Fast, accurate single-image 3D reconstruction via a novel two-stage approach using point clouds for high-fidelity mesh generation.
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis
·285 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University
This paper unveils critical thresholds for efficient visual autoregressive model computation, proving sub-quartic time is impossible beyond a certain input matrix norm while establishing efficient app…
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
·3325 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Electronics and Telecommunications Research Institute
MoDec-GS: a novel framework achieving 70% model size reduction in dynamic 3D Gaussian splatting while improving visual quality by cleverly decomposing complex motions and optimizing temporal intervals…
Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback
·3489 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Fudan University
DOLPHIN: AI automates scientific research from idea generation to experimental validation.
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
·3018 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
Diffusion as Shader (DaS) achieves versatile video control by using 3D tracking videos as control signals in a unified video diffusion model, enabling precise manipulation across diverse tasks.
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation
·5463 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Cambridge
Chirpy3D: Generating creative, high-quality 3D birds with intricate details by learning a continuous part latent space from 2D images.
TransPixar: Advancing Text-to-Video Generation with Transparency
·2458 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
TransPixar generates high-quality videos with transparency by jointly training RGB and alpha channels, outperforming sequential generation methods.
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
·3304 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Meta
Through-The-Mask uses mask-based motion trajectories to generate realistic videos from images and text, overcoming limitations of existing methods in handling complex multi-object motion.
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
·3762 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanjing University
STAR: A novel approach uses text-to-video models for realistic, temporally consistent real-world video super-resolution, improving image quality and detail.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
·3666 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
MotionBench, a new benchmark, reveals that existing video models struggle with fine-grained motion understanding. To address this, the authors propose TE Fusion, a novel architecture that improves mo…
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
·2867 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Multimedia Laboratory, the Chinese University of Hong Kong
GS-DiT: Generating high-quality videos with advanced 4D control through efficient dense 3D point tracking and pseudo 4D Gaussian fields.
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
·2694 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Science and Technology of China
DepthMaster tames diffusion models for faster, more accurate monocular depth estimation by aligning generative features with high-quality semantic features and adaptively balancing low and high-freque…