Skip to main content

Computer Vision

Structured 3D Latents for Scalable and Versatile 3D Generation
·4249 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University
Unified 3D latent representation (SLAT) enables versatile high-quality 3D asset generation, significantly outperforming existing methods.
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
·4589 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Mohamed Bin Zayed University of Artificial Intelligence
PhysGame benchmark unveils video LLMs’ weaknesses in understanding physical commonsense from gameplay videos, prompting the creation of PhysVLM, a knowledge-enhanced model that outperforms existing mo…
One Shot, One Talk: Whole-body Talking Avatar from a Single Image
·2297 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Science and Technology of China
One-shot image to realistic, animatable talking avatar! Novel pipeline uses diffusion models and a hybrid 3DGS-mesh representation, achieving seamless generalization and precise control.
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
·2333 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 SketchX, CVSSP, University of Surrey
NitroFusion achieves high-fidelity single-step image generation using a dynamic adversarial training approach with a specialized discriminator pool, dramatically improving speed and quality.
Negative Token Merging: Image-based Adversarial Feature Guidance
·2311 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Washington
NegToMe: Image-based adversarial guidance improves image generation diversity and reduces similarity to copyrighted content without training, simply by using images instead of negative text prompts.
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
·1734 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 01.AI
Presto: a novel video diffusion model generates 15-second, high-quality videos with unparalleled long-range coherence and rich content, achieved through a segmented cross-attention mechanism and the L…
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
·3029 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Waterloo
VISTA synthesizes long-duration, high-resolution video instruction data, creating VISTA-400K and HRVideoBench to significantly boost video LMM performance.
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
·5447 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Action Recognition 🏢 Yonsei University
DisCoRD: Rectified flow decodes discrete motion tokens into continuous, natural movement, balancing faithfulness and realism.
AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular Videos
·2678 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University
AlphaTablets: A novel 3D plane representation enabling accurate, consistent, and flexible 3D planar reconstruction from monocular videos, achieving state-of-the-art results.
Video Depth without Video Models
·3150 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Carnegie Mellon University
RollingDepth: Achieving state-of-the-art video depth estimation without using complex video models, by cleverly extending a single-image depth estimator.
Trajectory Attention for Fine-grained Video Motion Control
·4421 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
Trajectory Attention enhances video motion control by injecting trajectory information, improving precision and long-range consistency in video generation.
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
·271 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Alibaba Group
TeaCache: a training-free method boosts video diffusion model speed by up to 4.41x with minimal quality loss by cleverly caching intermediate outputs.
Open-Sora Plan: Open-Source Large Video Generation Model
·4618 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Open-Sora Plan introduces an open-source large video generation model capable of producing high-resolution videos with long durations, based on various user inputs.
Efficient Track Anything
·2319 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Segmentation 🏢 Meta AI
EfficientTAMs achieve comparable video object segmentation accuracy to SAM 2 with ~2x speedup using lightweight ViTs and efficient cross-attention.
TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models
·6566 words·31 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Machine Learning Group, CITEC, Bielefeld University
TryOffDiff generates realistic garment images from single photos, solving virtual try-on limitations.
TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video
·4076 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
TAPTRv3 achieves state-of-the-art long-video point tracking by cleverly using spatial and temporal context to enhance feature querying, surpassing previous methods and demonstrating strong performance…
ROICtrl: Boosting Instance Control for Visual Generation
·3855 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Show Lab, National University of Singapore
ROICtrl boosts visual generation’s instance control by using regional instance control via ROI-Align and a new ROI-Unpool operation, resulting in precise regional control and high efficiency.
Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters
·4458 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent PCG
Make-It-Animatable: Instantly create animation-ready 3D characters, regardless of pose or shape, using a novel data-driven framework.
FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion
·5402 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Cambridge
FAM Diffusion: Generate high-res images seamlessly from pre-trained diffusion models, solving structural and texture inconsistencies without retraining!
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
·3896 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Google DeepMind
CAT4D: Create realistic 4D scenes from single-view videos using a novel multi-view video diffusion model.