Skip to main content

Computer Vision

RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers
·2754 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Science and Technology of China
RelaCtrl: Relevance-guided control boosts diffusion transformer efficiency, cutting parameters by intelligently allocating resources.
Dynamic Concepts Personalization from Single Videos
·2668 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Snap Research
Personalizing video models for dynamic concepts is now achievable with Set-and-Sequence: enabling high-fidelity generation, editing, and composition!
Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework
·2585 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Scene Understanding 🏢 MBZUAI
New geolocation dataset & reasoning framework enhance accuracy and interpretability by leveraging human gameplay data.
MagicArticulate: Make Your 3D Models Articulation-Ready
·4321 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Nanyang Technological University
MagicArticulate automates 3D model animation preparation by generating skeletons and skinning weights, overcoming prior manual methods’ limitations, and introducing Articulation-XL, a large-scale benc…
Intuitive physics understanding emerges from self-supervised pretraining on natural videos
·4400 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Meta AI
AI models learn intuitive physics from self-supervised video pretraining.
Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening
·2525 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Diffusion-Sharpening enhances diffusion model fine-tuning by optimizing sampling trajectories, achieving faster convergence and high inference efficiency without extra NFEs, leading to improved alignm…
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
·4393 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Step-Video Team
Step-Video-T2V: A 30B parameter text-to-video model generating high-quality videos up to 204 frames, pushing the boundaries of video foundation models.
Cluster and Predict Latents Patches for Improved Masked Image Modeling
·7222 words·34 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Segmentation 🏢 Meta FAIR
CAPI: a novel masked image modeling framework boosts self-supervised visual representation learning by predicting latent clusterings, achieving state-of-the-art ImageNet accuracy and mIoU.
VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
·3389 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Fudan University
VidCRAFT3 enables high-quality image-to-video generation with precise control over camera movement, object motion, and lighting, pushing the boundaries of visual content creation.
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
·3939 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Peking University
Next-Block Prediction (NBP) revolutionizes video generation by using a semi-autoregressive model that predicts blocks of video content simultaneously, resulting in significantly faster inference.
MRS: A Fast Sampler for Mean Reverting Diffusion based on ODE and SDE Solvers
·2884 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 School of Artificial Intelligence, University of Chinese Academy of Sciences
MRS: a novel, training-free sampler, drastically speeds up controllable image generation using Mean Reverting Diffusion, achieving 10-20x speedup across various tasks.
Magic 1-For-1: Generating One Minute Video Clips within One Minute
·1947 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Magic141 generates one-minute video clips in under a minute by cleverly factorizing the generation task and employing optimization techniques.
Enhance-A-Video: Better Generated Video for Free
·3320 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 National University of Singapore
Enhance-A-Video boosts video generation quality without retraining, by enhancing cross-frame correlations in diffusion transformers, resulting in improved coherence and visual fidelity.
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
·2951 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Texas at Austin
TripoSG: High-fidelity 3D shapes synthesized via large-scale rectified flow models, pushing image-to-3D generation to new heights.
Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT
·3016 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Chinese University of Hong Kong
Lumina-Video: Efficient and flexible video generation using a multi-scale Next-DiT architecture with motion control.
Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile
·4798 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
EFFICIENT-VDIT accelerates video generation by 7.8x using sparse attention and multi-step distillation.
CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers
·2569 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Hong Kong University of Science and Technology
CustomVideoX: Zero-shot personalized video generation, exceeding existing methods in quality & consistency via 3D reference attention and dynamic adaptation.
Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
·1752 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tongyi Lab, Alibaba Group
Animate Anyone 2 creates high-fidelity character animations by incorporating environmental context, resulting in seamless character-environment integration and more realistic object interactions.
Dual Caption Preference Optimization for Diffusion Models
·4961 words·24 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Arizona State University
Dual Caption Preference Optimization (DCPO) significantly boosts diffusion model image quality by using paired captions to resolve data distribution conflicts and irrelevant prompt issues.
3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly
·3328 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Segmentation 🏢 Shanghai University
3CAD: A new large-scale, real-world dataset with diverse 3C product anomalies boosts unsupervised anomaly detection, enabling superior algorithm development via a novel Coarse-to-Fine framework.