Skip to main content

Computer Vision

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
·3185 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 CUHK MMLab
EasyRef uses multimodal LLMs to generate images from multiple references, overcoming limitations of prior methods by capturing consistent visual elements and offering improved zero-shot generalization…
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation
·3252 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
DisPose disentangles pose guidance for controllable human image animation, generating diverse animations while preserving appearance consistency using only sparse skeleton pose input, eliminating the …
Arbitrary-steps Image Super-resolution via Diffusion Inversion
·3889 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Nanyang Technological University
InvSR: a novel image super-resolution technique using diffusion inversion, enabling flexible sampling steps for efficient and high-fidelity results.
Video Motion Transfer with Diffusion Transformers
·3141 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Oxford
DiTFlow: training-free video motion transfer using Diffusion Transformers, enabling realistic motion control in synthesized videos via Attention Motion Flow.
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
·3117 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Hong Kong
UniReal: a universal framework for image generation and editing, unifying diverse tasks via learning real-world dynamics from video data, achieving highly realistic and versatile results.
STIV: Scalable Text and Image Conditioned Video Generation
·5285 words·25 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Apple
STIV: A novel, scalable method for text and image-conditioned video generation, systematically improving model architectures, training, and data curation for superior performance.
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
·6546 words·31 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Document Parsing 🏢 Shanghai AI Laboratory
OmniDocBench, a novel benchmark, tackles limitations in current document parsing by introducing a diverse, high-quality dataset with comprehensive annotations, enabling fair multi-level evaluation of …
ObjCtrl-2.5D: Training-free Object Control with Camera Poses
·3506 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University
ObjCtrl-2.5D: Training-free, precise image-to-video object control using 3D trajectories and camera poses.
Mobile Video Diffusion
·3393 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Qualcomm AI Research
MobileVD: The first mobile-optimized video diffusion model, achieving 523x efficiency improvement over state-of-the-art with minimal quality loss, enabling realistic video generation on smartphones.
FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models
·3186 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Stanford University
FiVA dataset and its adaptation framework enable unprecedented fine-grained control over visual attributes in text-to-image generation, empowering users to craft highly customized images.
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
·3317 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Institute of Automation, Chinese Academy of Sciences
FireFlow makes editing images faster and better.
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
·4676 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Shanghai Artificial Intelligence Laboratory
Introducing Evaluation Agent, a faster, more flexible human-like framework for evaluating visual generative AI.
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
·3918 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
DiffSensei: A new framework generates customized manga with dynamic multi-character control using multi-modal LLMs and diffusion models, outperforming existing methods.
MoViE: Mobile Diffusion for Video Editing
·2482 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Qualcomm AI Research
MoViE: Mobile Diffusion for Video Editing achieves 12 FPS video editing on mobile phones by optimizing existing image editing models, achieving a major breakthrough in on-device video processing.
EMOv2: Pushing 5M Vision Model Frontier
·6258 words·30 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Classification 🏢 Tencent AI Lab
EMOv2 achieves state-of-the-art performance in various vision tasks using a novel Meta Mobile Block, pushing the 5M parameter lightweight model frontier.
Mind the Time: Temporally-Controlled Multi-Event Video Generation
·4541 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Toronto
MinT: Generating coherent videos with precisely timed, multiple events via temporal control, surpassing existing methods.
Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment
·2984 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Robotics 🏢 UC Berkeley
RAPL efficiently aligns robots with human preferences using minimal feedback by aligning visual representations before reward learning.
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
·276 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Fudan University
LiFT leverages human feedback, including reasoning, to effectively align text-to-video models with human preferences, significantly improving video quality.
ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality
·2050 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Zhejiang University
ZipAR accelerates autoregressive image generation by up to 91% through parallel decoding leveraging spatial locality in images, making high-resolution image generation significantly faster.
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
·3379 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 VinAI Research
SwiftEdit achieves lightning-fast, high-quality text-guided image editing in just 0.23 seconds via a novel one-step diffusion process.