Skip to main content

Computer Vision

KMM: Key Frame Mask Mamba for Extended Motion Generation
·2527 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
KMM: Key Frame Mask Mamba generates extended, diverse human motion from text prompts by innovatively masking key frames in the Mamba architecture and using contrastive learning for improved text-motio…
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
·2454 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tencent AI Lab
StdGEN: Generate high-quality, semantically decomposed 3D characters from a single image in minutes, enabling flexible customization for various applications.
SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
·4041 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 MIT
SVDQuant boosts 4-bit diffusion models by absorbing outliers via low-rank components, achieving 3.5x memory reduction and 3x speedup on 12B parameter models.
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation
·3777 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Toronto
SG-I2V: Zero-shot controllable image-to-video generation using a self-guided approach that leverages pre-trained models for precise object and camera motion control.
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
·2474 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Google
ReCapture generates videos with novel camera angles from user videos using masked video fine-tuning, preserving scene motion and plausibly hallucinating unseen parts.
GazeGen: Gaze-Driven User Interaction for Visual Content Generation
·2843 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Human-AI Interaction 🏢 Harvard University
GazeGen uses real-time gaze tracking to enable intuitive hands-free visual content creation and editing, setting a new standard for accessible AR/VR interaction.
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
·2263 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University
DimensionX generates photorealistic 3D and 4D scenes from a single image via controllable video diffusion, enabling precise manipulation of spatial structure and temporal dynamics.
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details
·5135 words·25 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 SSE, CUHKSZ, China
GarVerseLOD introduces a novel dataset and framework for high-fidelity 3D garment reconstruction from a single image, achieving unprecedented robustness via a hierarchical approach and leveraging a ma…
Correlation of Object Detection Performance with Visual Saliency and Depth Estimation
·1673 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Object Detection 🏢 Dept. of Artificial Intelligence University of Malta
Visual saliency boosts object detection accuracy more than depth estimation, especially for larger objects, offering valuable insights for model and dataset improvement.
Training-free Regional Prompting for Diffusion Transformers
·1817 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Training-free Regional Prompting for FLUX boosts compositional text-to-image generation by cleverly manipulating attention mechanisms, achieving fine-grained control without retraining.
How Far is Video Generation from World Model: A Physical Law Perspective
·3657 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance Research
Scaling video generation models doesn’t guarantee they’ll learn physics; this study reveals they prioritize visual cues over true physical understanding.
GenXD: Generating Any 3D and 4D Scenes
·2731 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 National University of Singapore
GenXD: A unified model generating high-quality 3D & 4D scenes from any number of images, advancing the field of dynamic scene generation.
Adaptive Caching for Faster Video Generation with Diffusion Transformers
·3142 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Meta AI
Adaptive Caching (AdaCache) dramatically speeds up video generation with diffusion transformers by cleverly caching and reusing computations, tailoring the process to each video’s complexity and motio…
DreamPolish: Domain Score Distillation With Progressive Geometry Generation
·2197 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
DreamPolish: A new text-to-3D model generates highly detailed 3D objects with polished surfaces and realistic textures using progressive geometry refinement and a novel domain score distillation tech…
Randomized Autoregressive Visual Generation
·4145 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 ByteDance
Randomized Autoregressive Modeling (RAR) sets a new state-of-the-art in image generation by cleverly introducing randomness during training to improve the model’s ability to learn from bidirectional c…
Constant Acceleration Flow
·3289 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Korea University
Constant Acceleration Flow (CAF) dramatically speeds up diffusion model generation by using a constant acceleration equation, outperforming state-of-the-art methods with improved accuracy and few-step…
Learning Video Representations without Natural Videos
·3154 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ShanghaiTech University
High-performing video representation models can be trained using only synthetic videos and images, eliminating the need for large natural video datasets.
In-Context LoRA for Diffusion Transformers
·392 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tongyi Lab
In-Context LoRA empowers existing text-to-image models for high-fidelity multi-image generation by simply concatenating images and using minimal task-specific LoRA tuning.
DELTA: Dense Efficient Long-range 3D Tracking for any video
·3706 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 UMass Amherst
DELTA: A new method efficiently tracks every pixel in 3D space from monocular videos, enabling accurate motion estimation across entire videos with state-of-the-art accuracy and over 8x speed improvem…
HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models
·2152 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
HelloMeme enhances text-to-image models by integrating spatial knitting attentions, enabling high-fidelity meme video generation while preserving model generalization.