Computer Vision
MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control
·3209 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Center for Machine Vision and Signal Analysis, Faculty of Information Technology and Electrical Engineering, University of Oulu
MagicFace achieves high-fidelity facial expression editing via AU control, preserving identity and background using a diffusion model and ID encoder, significantly outperforming existing methods.
Ingredients: Blending Custom Photos with Video Diffusion Transformers
·2689 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Kunlun Inc.
Ingredients: A new framework customizes videos by blending multiple photos with video diffusion transformers, enabling realistic and personalized video generation while maintaining consistent identity…
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control
·3152 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
VideoAnydoor: High-fidelity video object insertion with precise motion control, achieved via an end-to-end framework leveraging an ID extractor and a pixel warper for robust detail preservation and fi…
SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization
·4234 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Action Recognition
🏢 Unmanned System Research Institute, Northwestern Polytechnical University
SeFAR: a novel semi-supervised framework for fine-grained action recognition, achieves state-of-the-art results by using dual-level temporal modeling, moderate temporal perturbation, and adaptive regu…
SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration
·1895 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Nanyang Technological University
SeedVR: A novel diffusion transformer revolutionizes generic video restoration by efficiently handling arbitrary video lengths and resolutions, achieving state-of-the-art performance.
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
·3436 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Huazhong University of Science and Technology
LightningDiT resolves the optimization dilemma in latent diffusion models by aligning latent space with pre-trained vision models, achieving state-of-the-art ImageNet 256x256 generation with over 21x …
MLLM-as-a-Judge for Image Safety without Human Labeling
·6596 words·31 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Classification
🏢 Meta AI
Zero-shot image safety judgment is achieved using MLLMs and a novel method called CLUE, objectifying safety rules, and significantly reducing the need for human labeling.
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
·8988 words·43 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Tsinghua University
VisionReward, a novel reward model, surpasses existing methods by precisely capturing multi-dimensional human preferences for image and video generation, enabling more accurate and stable model optimi…
Edicho: Consistent Image Editing in the Wild
·2565 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Hong Kong University of Science and Technology
Edicho: a novel training-free method for consistent image editing across diverse images, achieving precise consistency by leveraging explicit correspondence.
Bringing Objects to Life: 4D generation from 3D objects
·2761 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Bar-Ilan University
3to4D: Animate any 3D object with text prompts, preserving visual quality and achieving realistic motion!
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
·4442 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Tencent AI Lab
VideoMaker achieves high-fidelity zero-shot customized video generation by cleverly harnessing the inherent power of video diffusion models, eliminating the need for extra feature extraction and injec…
PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models
·3061 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Meta AI
PartGen generates compositional 3D objects with meaningful parts from text, images, or unstructured 3D data using multi-view diffusion models, enabling flexible 3D part editing.
Orient Anything: Learning Robust Object Orientation Estimation from Rendering 3D Models
·3014 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Zhejiang University
Orient Anything: Learning robust object orientation estimation directly from rendered 3D models, achieving state-of-the-art accuracy on real images.
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
·3843 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 MMLab, the Chinese University of Hong Kong
DiTCtrl achieves state-of-the-art multi-prompt video generation without retraining by cleverly controlling attention in a diffusion transformer, enabling smooth transitions between video segments.
DepthLab: From Partial to Complete
·2516 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 HKU
DepthLab: a novel image-conditioned depth inpainting model enhances downstream 3D tasks by effectively completing partial depth information, showing superior performance and generalization.
SBS Figures: Pre-training Figure QA from Stage-by-Stage Synthesized Images
·2647 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Visual Question Answering
🏢 Kyoto University
SBS Figures creates a massive, high-quality figure QA dataset via a novel stage-by-stage synthesis pipeline, enabling efficient pre-training of visual language models.
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
·3841 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Tsinghua University
Distilled Decoding (DD) drastically speeds up image generation from autoregressive models by using flow matching to enable one-step sampling, achieving significant speedups while maintaining acceptabl…
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up
·4398 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 National University of Singapore
CLEAR: Conv-Like Linearization boosts pre-trained Diffusion Transformers, achieving 6.3x faster 8K image generation with minimal quality loss.
UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency
·3351 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 ETH Zurich
UIP2P: Unsupervised instruction-based image editing achieves high-fidelity edits by enforcing Cycle Edit Consistency, eliminating the need for ground-truth data.
Parallelized Autoregressive Visual Generation
·4274 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
Boosting autoregressive visual generation speed by 3.6-9.5x, this research introduces parallel processing while preserving model simplicity and generation quality.