Computer Vision
Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
·2982 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 NVIDIA
DIFIX3D+ improves 3D reconstructions by reducing artifacts via single-step diffusion models, enhancing novel-view synthesis quality and consistency.
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
·2523 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
DREAM ENGINE: Text-image interleaved control made easy, unifying text and visual cues for creative image generation.
Mobius: Text to Seamless Looping Video Generation via Latent Shift
·2353 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Chongqing University of Post and Telecommunications, China
Mobius generates seamless looping videos from text using latent shift, repurposing pre-trained models without training.
Efficient Gaussian Splatting for Monocular Dynamic Scene Rendering via Sparse Time-Variant Attribute Modeling
·3037 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 National University of Singapore
EDGS: Achieves faster, high-quality dynamic scene rendering by sparse time-variant attribute modeling and intelligent static area filtering.
Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting
·3674 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Tsinghua University
ArtGS: Achieves state-of-the-art, efficient interactable replicas of complex articulated objects via Gaussian Splatting.
X-Dancer: Expressive Music to Human Dance Video Generation
·1759 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 UC San Diego
X-Dancer: Expressive dance video generation from music and a single image!
VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing
·2983 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 ReLER Lab, AAII, University of Technology Sydney
VideoGrain: Fine-grained video editing via space-time attention!
Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model
·3468 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Scene Understanding
🏢 Hong Kong Center for Construction Robotics, the Hong Kong University of Science and Technology
Plane-DUSt3R: Leveraging pre-trained models for unposed sparse views room layout reconstruction, enhancing robustness and generalization.
GCC: Generative Color Constancy via Diffusing a Color Checker
·362 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 National Yang Ming Chiao Tung University
GCC: Color constancy through diffusion, inpainting a color checker for stable illumination estimation.
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
·3965 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Segmentation
🏢 Zhejiang University
DICEPTION: A generalist diffusion model for visual perceptual tasks.
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment
·1433 words·7 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Heifei University of Technology
M3-AGIQA: A multimodal AI solution that comprehensively assesses AI-generated image quality, achieving state-of-the-art performance by distilling online MLLM capabilities into a local model.
RelaCtrl: Relevance-Guided Efficient Control for Diffusion Transformers
·2754 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Science and Technology of China
RelaCtrl: Relevance-guided control boosts diffusion transformer efficiency, cutting parameters by intelligently allocating resources.
PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
·1606 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 National University of Singapore
PhotoDoodle: Mimicking artistic image editing with personalized decorative elements through learning from few-shot pairwise data.
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
·3707 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Scene Understanding
🏢 MBZUAI
KITAB-Bench: A new multi-domain Arabic OCR benchmark to bridge the performance gap with English OCR technologies.
Dynamic Concepts Personalization from Single Videos
·2668 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 SNAP RESEARCH
Personalizing video models for dynamic concepts is now achievable with Set-and-Sequence: enabling high-fidelity generation, editing, and composition!
CrossOver: 3D Scene Cross-Modal Alignment
·5760 words·28 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Scene Understanding
🏢 Stanford University
CrossOver: Flexible scene-level cross-modal alignment via modality-agnostic embeddings, unlocking robust 3D scene understanding.
JL1-CD: A New Benchmark for Remote Sensing Change Detection and a Robust Multi-Teacher Knowledge Distillation Framework
·3675 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Segmentation
🏢 Tsinghua University
JL1-CD: New all-inclusive dataset & multi-teacher knowledge distillation framework for robust remote sensing change detection, achieving state-of-the-art results!
Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework
·2585 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Scene Understanding
🏢 MBZUAI
New geolocation dataset & reasoning framework enhance accuracy and interpretability by leveraging human gameplay data.
MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
·221 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 SenseTime Research
MaskGWM: Improves driving world models by using video mask reconstruction for better generalization.
MagicArticulate: Make Your 3D Models Articulation-Ready
·4321 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Nanyang Technological University
MagicArticulate automates 3D model animation preparation by generating skeletons and skinning weights, overcoming prior manual methods’ limitations, and introducing Articulation-XL, a large-scale benc…