Skip to main content

Computer Vision

FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction
·4390 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tencent AI Lab
FreeSplatter: a novel feed-forward framework reconstructs high-quality 3D scenes from uncalibrated sparse-view images, estimating camera poses in seconds.
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
·2401 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Nanyang Technological University
FreeScale generates stunning 8K images and high-fidelity videos without retraining.
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
·2812 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Virginia Tech
Edit images precisely with AI, no masks needed!
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
·3185 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 CUHK MMLab
EasyRef uses multimodal LLMs to generate images from multiple references, overcoming limitations of prior methods by capturing consistent visual elements and offering improved zero-shot generalization…
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
·3252 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
DisPose disentangles pose guidance for controllable human image animation, generating diverse animations while preserving appearance consistency using only sparse skeleton pose input, eliminating the …
Arbitrary-steps Image Super-resolution via Diffusion Inversion
·3889 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Nanyang Technological University
InvSR: a novel image super-resolution technique using diffusion inversion, enabling flexible sampling steps for efficient and high-fidelity results.
Video Motion Transfer with Diffusion Transformers
·3141 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Oxford
DiTFlow: training-free video motion transfer using Diffusion Transformers, enabling realistic motion control in synthesized videos via Attention Motion Flow.
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
·3117 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Hong Kong
UniReal: a universal framework for image generation and editing, unifying diverse tasks via learning real-world dynamics from video data, achieving highly realistic and versatile results.
STIV: Scalable Text and Image Conditioned Video Generation
·5285 words·25 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Apple
STIV: A novel, scalable method for text and image-conditioned video generation, systematically improving model architectures, training, and data curation for superior performance.
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
·6546 words·31 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Document Parsing 🏢 Shanghai AI Laboratory
OmniDocBench, a novel benchmark, tackles limitations in current document parsing by introducing a diverse, high-quality dataset with comprehensive annotations, enabling fair multi-level evaluation of …
ObjCtrl-2.5D: Training-free Object Control with Camera Poses
·3506 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
ObjCtrl-2.5D: Training-free, precise image-to-video object control using 3D trajectories and camera poses.
Mobile Video Diffusion
·3393 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Qualcomm AI Research
MobileVD: The first mobile-optimized video diffusion model, achieving 523x efficiency improvement over state-of-the-art with minimal quality loss, enabling realistic video generation on smartphones.
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
·3186 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Stanford University
FiVA dataset and its adaptation framework enable unprecedented fine-grained control over visual attributes in text-to-image generation, empowering users to craft highly customized images.
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
·3317 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Institute of Automation, Chinese Academy of Sciences
FireFlow makes editing images faster and better.
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
·4676 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Shanghai Artificial Intelligence Laboratory
Introducing Evaluation Agent, a faster, more flexible human-like framework for evaluating visual generative AI.
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
·3918 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
DiffSensei: A new framework generates customized manga with dynamic multi-character control using multi-modal LLMs and diffusion models, outperforming existing methods.
MoViE: Mobile Diffusion for Video Editing
·2482 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Qualcomm AI Research
MoViE: Mobile Diffusion for Video Editing achieves 12 FPS video editing on mobile phones by optimizing existing image editing models, achieving a major breakthrough in on-device video processing.
EMOv2: Pushing 5M Vision Model Frontier
·6258 words·30 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Classification 🏢 Tencent AI Lab
EMOv2 achieves state-of-the-art performance in various vision tasks using a novel Meta Mobile Block, pushing the 5M parameter lightweight model frontier.
Mind the Time: Temporally-Controlled Multi-Event Video Generation
·4541 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Toronto
MinT: Generating coherent videos with precisely timed, multiple events via temporal control, surpassing existing methods.
Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment
·2984 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Robotics 🏢 UC Berkeley
RAPL efficiently aligns robots with human preferences using minimal feedback by aligning visual representations before reward learning.