Skip to main content

Computer Vision

Number it: Temporal Grounding Videos like Flipping Manga
·2758 words·13 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Southeast University
Boosting video temporal grounding, NumPro empowers Vid-LLMs by adding frame numbers, making temporal localization as easy as flipping through manga.
MagicQuill: An Intelligent Interactive Image Editing System
·4923 words·24 mins
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 HKUST
MagicQuill: an intelligent interactive image editing system enabling intuitive, precise image edits via brushstrokes and real-time intent prediction by a multimodal LLM.
Sharingan: Extract User Action Sequence from Desktop Recordings
·9852 words·47 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
Sharingan extracts user action sequences from desktop recordings using novel VLM-based methods, achieving 70-80% accuracy and enabling RPA.
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
·1627 words·8 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Alibaba
EgoVid-5M: First high-quality dataset for egocentric video generation, enabling realistic human-centric world simulations.
Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings
·3736 words·18 mins
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Autodesk
WaLa: a billion-parameter 3D generative model using wavelet encodings achieves state-of-the-art results, generating high-quality 3D shapes in seconds.
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
·2630 words·13 mins
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
GaussianAnything: Interactive point cloud latent diffusion enables high-quality, editable 3D models from images or text, overcoming existing 3D generation limitations.
SAMPart3D: Segment Any Part in 3D Objects
·3136 words·15 mins
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Hong Kong
SAMPart3D: Zero-shot 3D part segmentation across granularities, scaling to large datasets & handling part ambiguity.
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
·3438 words·17 mins
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Waterloo
OmniEdit, a novel instruction-based image editing model, surpasses existing methods by leveraging specialist supervision and high-quality data, achieving superior performance across diverse editing ta…
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models
·3087 words·15 mins
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 NVIDIA Research
Edify Image: groundbreaking pixel-perfect photorealistic image generation using cascaded pixel-space diffusion models with a novel Laplacian diffusion process, enabling diverse applications including …
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
·3359 words·16 mins
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 NVIDIA Research
Add-it: Training-free object insertion in images using pretrained diffusion models by cleverly balancing information from the scene, text prompt, and generated image, achieving state-of-the-art result…
KMM: Key Frame Mask Mamba for Extended Motion Generation
·2527 words·12 mins
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
KMM: Key Frame Mask Mamba generates extended, diverse human motion from text prompts by innovatively masking key frames in the Mamba architecture and using contrastive learning for improved text-motio…
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
·2454 words·12 mins
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tencent AI Lab
StdGEN: Generate high-quality, semantically decomposed 3D characters from a single image in minutes, enabling flexible customization for various applications.
SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
·4041 words·19 mins
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 MIT
SVDQuant boosts 4-bit diffusion models by absorbing outliers via low-rank components, achieving 3.5x memory reduction and 3x speedup on 12B parameter models.
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation
·3777 words·18 mins
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Toronto
SG-I2V: Zero-shot controllable image-to-video generation using a self-guided approach that leverages pre-trained models for precise object and camera motion control.
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
·2474 words·12 mins
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Google
ReCapture generates videos with novel camera angles from user videos using masked video fine-tuning, preserving scene motion and plausibly hallucinating unseen parts.
GazeGen: Gaze-Driven User Interaction for Visual Content Generation
·2843 words·14 mins
AI Generated 🤗 Daily Papers Computer Vision Human-AI Interaction 🏢 Harvard University
GazeGen uses real-time gaze tracking to enable intuitive hands-free visual content creation and editing, setting a new standard for accessible AR/VR interaction.
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
·2263 words·11 mins
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University
DimensionX generates photorealistic 3D and 4D scenes from a single image via controllable video diffusion, enabling precise manipulation of spatial structure and temporal dynamics.
GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details
·5135 words·25 mins
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 SSE, CUHKSZ, China
GarVerseLOD introduces a novel dataset and framework for high-fidelity 3D garment reconstruction from a single image, achieving unprecedented robustness via a hierarchical approach and leveraging a ma…
Correlation of Object Detection Performance with Visual Saliency and Depth Estimation
·1673 words·8 mins
AI Generated 🤗 Daily Papers Computer Vision Object Detection 🏢 Dept. of Artificial Intelligence University of Malta
Visual saliency boosts object detection accuracy more than depth estimation, especially for larger objects, offering valuable insights for model and dataset improvement.
Training-free Regional Prompting for Diffusion Transformers
·1817 words·9 mins
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University
Training-free Regional Prompting for FLUX boosts compositional text-to-image generation by cleverly manipulating attention mechanisms, achieving fine-grained control without retraining.