Computer Vision
AnimateAnything: Consistent and Controllable Animation for Video Generation
·2615 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Tsinghua University
AnimateAnything: A unified approach enabling precise & consistent video manipulation via a novel optical flow representation and frequency stabilization.
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning
·2623 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Quality Assessment
🏢 State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA
SEAGULL: A novel network uses vision-language instruction tuning to assess image quality for regions of interest (ROIs) with high accuracy, leveraging masks and a new dataset for fine-grained IQA.
Number it: Temporal Grounding Videos like Flipping Manga
·2758 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Southeast University
Boosting video temporal grounding, NumPro empowers Vid-LLMs by adding frame numbers, making temporal localization as easy as flipping through manga.
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on
·2555 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Tencent
FitDiT boosts virtual try-on realism by enhancing garment details via Diffusion Transformers, improving texture and size accuracy for high-fidelity virtual fashion.
MagicQuill: An Intelligent Interactive Image Editing System
·4923 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 HKUST
MagicQuill: an intelligent interactive image editing system enabling intuitive, precise image edits via brushstrokes and real-time intent prediction by a multimodal LLM.
Sharingan: Extract User Action Sequence from Desktop Recordings
·9852 words·47 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Tsinghua University
Sharingan extracts user action sequences from desktop recordings using novel VLM-based methods, achieving 70-80% accuracy and enabling RPA.
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
·1627 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Alibaba
EgoVid-5M: First high-quality dataset for egocentric video generation, enabling realistic human-centric world simulations.
Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings
·3736 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Autodesk
WaLa: a billion-parameter 3D generative model using wavelet encodings achieves state-of-the-art results, generating high-quality 3D shapes in seconds.
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
·2630 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Peking University
GaussianAnything: Interactive point cloud latent diffusion enables high-quality, editable 3D models from images or text, overcoming existing 3D generation limitations.
SAMPart3D: Segment Any Part in 3D Objects
·3136 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 University of Hong Kong
SAMPart3D: Zero-shot 3D part segmentation across granularities, scaling to large datasets & handling part ambiguity.
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
·3438 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Waterloo
OmniEdit, a novel instruction-based image editing model, surpasses existing methods by leveraging specialist supervision and high-quality data, achieving superior performance across diverse editing ta…
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models
·3087 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 NVIDIA Research
Edify Image: groundbreaking pixel-perfect photorealistic image generation using cascaded pixel-space diffusion models with a novel Laplacian diffusion process, enabling diverse applications including …
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
·3359 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 NVIDIA Research
Add-it: Training-free object insertion in images using pretrained diffusion models by cleverly balancing information from the scene, text prompt, and generated image, achieving state-of-the-art result…
KMM: Key Frame Mask Mamba for Extended Motion Generation
·2527 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Peking University
KMM: Key Frame Mask Mamba generates extended, diverse human motion from text prompts by innovatively masking key frames in the Mamba architecture and using contrastive learning for improved text-motio…
StdGEN: Semantic-Decomposed 3D Character Generation from Single Images
·2454 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Tencent AI Lab
StdGEN: Generate high-quality, semantically decomposed 3D characters from a single image in minutes, enabling flexible customization for various applications.
SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
·4041 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 MIT
SVDQuant boosts 4-bit diffusion models by absorbing outliers via low-rank components, achieving 3.5x memory reduction and 3x speedup on 12B parameter models.
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation
·3777 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Toronto
SG-I2V: Zero-shot controllable image-to-video generation using a self-guided approach that leverages pre-trained models for precise object and camera motion control.
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
·2474 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Google
ReCapture generates videos with novel camera angles from user videos using masked video fine-tuning, preserving scene motion and plausibly hallucinating unseen parts.
GazeGen: Gaze-Driven User Interaction for Visual Content Generation
·2843 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Human-AI Interaction
🏢 Harvard University
GazeGen uses real-time gaze tracking to enable intuitive hands-free visual content creation and editing, setting a new standard for accessible AR/VR interaction.
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
·2263 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Tsinghua University
DimensionX generates photorealistic 3D and 4D scenes from a single image via controllable video diffusion, enabling precise manipulation of spatial structure and temporal dynamics.