Skip to main content

Computer Vision

An Empirical Study of Autoregressive Pre-training from Videos
·5733 words·27 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 UC Berkeley
Toto, a new autoregressive video model, achieves competitive performance across various benchmarks by pre-training on over 1 trillion visual tokens, demonstrating the effectiveness of scaling video mo…
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
·2783 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Stability AI
SPAR3D: Fast, accurate single-image 3D reconstruction via a novel two-stage approach using point clouds for high-fidelity mesh generation.
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis
·285 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University
This paper unveils critical thresholds for efficient visual autoregressive model computation, proving sub-quartic time is impossible beyond a certain input matrix norm while establishing efficient app…
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
·3325 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Electronics and Telecommunications Research Institute
MoDec-GS: a novel framework achieving 70% model size reduction in dynamic 3D Gaussian splatting while improving visual quality by cleverly decomposing complex motions and optimizing temporal intervals…
Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback
·3489 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Fudan University
DOLPHIN: AI automates scientific research from idea generation to experimental validation.
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
·3018 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
Diffusion as Shader (DaS) achieves versatile video control by using 3D tracking videos as control signals in a unified video diffusion model, enabling precise manipulation across diverse tasks.
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation
·5463 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Cambridge
Chirpy3D: Generating creative, high-quality 3D birds with intricate details by learning a continuous part latent space from 2D images.
TransPixar: Advancing Text-to-Video Generation with Transparency
·2458 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
TransPixar generates high-quality videos with transparency by jointly training RGB and alpha channels, outperforming sequential generation methods.
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
·3304 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Meta
Through-The-Mask uses mask-based motion trajectories to generate realistic videos from images and text, overcoming limitations of existing methods in handling complex multi-object motion.
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
·3762 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanjing University
STAR: A novel approach uses text-to-video models for realistic, temporally consistent real-world video super-resolution, improving image quality and detail.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
·3666 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
MotionBench, a new benchmark, reveals that existing video models struggle with fine-grained motion understanding. To address this, the authors propose TE Fusion, a novel architecture that improves mo…
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
·2867 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Multimedia Laboratory, the Chinese University of Hong Kong
GS-DiT: Generating high-quality videos with advanced 4D control through efficient dense 3D point tracking and pseudo 4D Gaussian fields.
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
·2694 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Science and Technology of China
DepthMaster tames diffusion models for faster, more accurate monocular depth estimation by aligning generative features with high-quality semantic features and adaptively balancing low and high-freque…
MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control
·3209 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu
MagicFace achieves high-fidelity facial expression editing via AU control, preserving identity and background using a diffusion model and ID encoder, significantly outperforming existing methods.
Ingredients: Blending Custom Photos with Video Diffusion Transformers
·2689 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Kunlun Inc.
Ingredients: A new framework customizes videos by blending multiple photos with video diffusion transformers, enabling realistic and personalized video generation while maintaining consistent identity…
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
·3152 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
VideoAnydoor: High-fidelity video object insertion with precise motion control, achieved via an end-to-end framework leveraging an ID extractor and a pixel warper for robust detail preservation and fi…
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
·4234 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Action Recognition 🏢 Unmanned System Research Institute, Northwestern Polytechnical University
SeFAR: a novel semi-supervised framework for fine-grained action recognition, achieves state-of-the-art results by using dual-level temporal modeling, moderate temporal perturbation, and adaptive regu…
SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
·1895 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
SeedVR: A novel diffusion transformer revolutionizes generic video restoration by efficiently handling arbitrary video lengths and resolutions, achieving state-of-the-art performance.
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
·3436 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Huazhong University of Science and Technology
LightningDiT resolves the optimization dilemma in latent diffusion models by aligning latent space with pre-trained vision models, achieving state-of-the-art ImageNet 256x256 generation with over 21x …
MLLM-as-a-Judge for Image Safety without Human Labeling
·6596 words·31 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Classification 🏢 Meta AI
Zero-shot image safety judgment is achieved using MLLMs and a novel method called CLUE, objectifying safety rules, and significantly reducing the need for human labeling.