Computer Vision
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
·2511 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Hong Kong University of Science and Technology
VideoGen-of-Thought (VGoT) creates high-quality, multi-shot videos by collaboratively generating scripts, keyframes, and video clips, ensuring narrative consistency and visual coherence.
SNOOPI: Supercharged One-step Diffusion Distillation with Proper Guidance
·4159 words·20 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ VinAI Research
SNOOPI supercharges one-step diffusion model distillation with enhanced guidance, achieving state-of-the-art performance by stabilizing training and enabling negative prompt control.
Scaling Image Tokenizers with Grouped Spherical Quantization
·7140 words·34 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ JΓΌlich Supercomputing Centre
GSQ-GAN, a novel image tokenizer, achieves superior reconstruction quality with 16x downsampling using grouped spherical quantization, enabling efficient scaling for high-fidelity image generation.
OmniCreator: Self-Supervised Unified Generation with Universal Editing
·5399 words·26 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Hong Kong University of Science and Technology
OmniCreator: Self-supervised unified image+video generation & universal editing.
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
·3776 words·18 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Department of Electrical and Computer Engineering, North South University
VideoLights: a novel framework for joint video highlight detection & moment retrieval, boosts performance via feature refinement, cross-modal & cross-task alignment, achieving state-of-the-art results…
TinyFusion: Diffusion Transformers Learned Shallow
·4225 words·20 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ National University of Singapore
TinyFusion, a novel learnable depth pruning method, crafts efficient shallow diffusion transformers with superior post-fine-tuning performance, achieving a 2x speedup with less than 7% of the original…
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
·3884 words·19 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Yandex Research
SWITTI: a novel scale-wise transformer achieves 7x faster text-to-image generation than state-of-the-art diffusion models, while maintaining competitive image quality.
Structured 3D Latents for Scalable and Versatile 3D Generation
·4249 words·20 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
3D Vision
π’ Tsinghua University
Unified 3D latent representation (SLAT) enables versatile high-quality 3D asset generation, significantly outperforming existing methods.
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
·4589 words·22 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Mohamed Bin Zayed University of Artificial Intelligence
PhysGame benchmark unveils video LLMs’ weaknesses in understanding physical commonsense from gameplay videos, prompting the creation of PhysVLM, a knowledge-enhanced model that outperforms existing mo…
One Shot, One Talk: Whole-body Talking Avatar from a Single Image
·2297 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
3D Vision
π’ University of Science and Technology of China
One-shot image to realistic, animatable talking avatar! Novel pipeline uses diffusion models and a hybrid 3DGS-mesh representation, achieving seamless generalization and precise control.
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
·2333 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ SketchX, CVSSP, University of Surrey
NitroFusion achieves high-fidelity single-step image generation using a dynamic adversarial training approach with a specialized discriminator pool, dramatically improving speed and quality.
Negative Token Merging: Image-based Adversarial Feature Guidance
·2311 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ University of Washington
NegToMe: Image-based adversarial guidance improves image generation diversity and reduces similarity to copyrighted content without training, simply by using images instead of negative text prompts.
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
·1734 words·9 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ 01.AI
Presto: a novel video diffusion model generates 15-second, high-quality videos with unparalleled long-range coherence and rich content, achieved through a segmented cross-attention mechanism and the L…
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation
·3029 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ University of Waterloo
VISTA synthesizes long-duration, high-resolution video instruction data, creating VISTA-400K and HRVideoBench to significantly boost video LMM performance.
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
·5447 words·26 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Action Recognition
π’ Yonsei University
DisCoRD: Rectified flow decodes discrete motion tokens into continuous, natural movement, balancing faithfulness and realism.
AlphaTablets: A Generic Plane Representation for 3D Planar Reconstruction from Monocular Videos
·2678 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
3D Vision
π’ Tsinghua University
AlphaTablets: A novel 3D plane representation enabling accurate, consistent, and flexible 3D planar reconstruction from monocular videos, achieving state-of-the-art results.
Video Depth without Video Models
·3150 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Carnegie Mellon University
RollingDepth: Achieving state-of-the-art video depth estimation without using complex video models, by cleverly extending a single-image depth estimator.
Trajectory Attention for Fine-grained Video Motion Control
·4421 words·21 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Nanyang Technological University
Trajectory Attention enhances video motion control by injecting trajectory information, improving precision and long-range consistency in video generation.
Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model
·271 words·2 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Alibaba Group
TeaCache: a training-free method boosts video diffusion model speed by up to 4.41x with minimal quality loss by cleverly caching intermediate outputs.
Open-Sora Plan: Open-Source Large Video Generation Model
·4618 words·22 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Peking University
Open-Sora Plan introduces an open-source large video generation model capable of producing high-resolution videos with long durations, based on various user inputs.