Computer Vision
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
·4302 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
MagicDriveDiT generates high-resolution, long street-view videos with precise control, exceeding limitations of previous methods in autonomous driving.
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
·3966 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Nanyang Technological University
VBench++: A new benchmark suite meticulously evaluates video generative models across 16 diverse dimensions, aligning with human perception for improved model development and fairer comparisons.
Stylecodes: Encoding Stylistic Information For Image Generation
·237 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 String
StyleCodes enables easy style sharing for image generation by encoding styles as compact strings, enhancing control and collaboration while minimizing quality loss.
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
·2784 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 University of Washington
SAMURAI enhances the Segment Anything Model 2 for real-time, zero-shot visual object tracking by incorporating motion-aware memory and motion modeling, significantly improving accuracy and robustness.
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
·3219 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Segmentation
🏢 Bilkent University
ITACLIP boosts training-free semantic segmentation by architecturally enhancing CLIP, integrating LLM-generated class descriptions, and employing image engineering; achieving state-of-the-art results.
Generative World Explorer
·1739 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Johns Hopkins University
Generative World Explorer (Genex) enables agents to imaginatively explore environments, updating beliefs with generated observations for better decision-making.
Continuous Speculative Decoding for Autoregressive Image Generation
·1799 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Chinese Academy of Sciences
Researchers have developed Continuous Speculative Decoding, boosting autoregressive image generation speed by up to 2.33x while maintaining image quality.
AnimateAnything: Consistent and Controllable Animation for Video Generation
·2615 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Tsinghua University
AnimateAnything: A unified approach enabling precise & consistent video manipulation via a novel optical flow representation and frequency stabilization.
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning
·2623 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Quality Assessment
🏢 State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA
SEAGULL: A novel network uses vision-language instruction tuning to assess image quality for regions of interest (ROIs) with high accuracy, leveraging masks and a new dataset for fine-grained IQA.
Number it: Temporal Grounding Videos like Flipping Manga
·2758 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Southeast University
Boosting video temporal grounding, NumPro empowers Vid-LLMs by adding frame numbers, making temporal localization as easy as flipping through manga.
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on
·2555 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Tencent
FitDiT boosts virtual try-on realism by enhancing garment details via Diffusion Transformers, improving texture and size accuracy for high-fidelity virtual fashion.
MagicQuill: An Intelligent Interactive Image Editing System
·4923 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 HKUST
MagicQuill: an intelligent interactive image editing system enabling intuitive, precise image edits via brushstrokes and real-time intent prediction by a multimodal LLM.
Sharingan: Extract User Action Sequence from Desktop Recordings
·9852 words·47 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Tsinghua University
Sharingan extracts user action sequences from desktop recordings using novel VLM-based methods, achieving 70-80% accuracy and enabling RPA.
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
·1627 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Alibaba
EgoVid-5M: First high-quality dataset for egocentric video generation, enabling realistic human-centric world simulations.
Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings
·3736 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Autodesk
WaLa: a billion-parameter 3D generative model using wavelet encodings achieves state-of-the-art results, generating high-quality 3D shapes in seconds.
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
·2630 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Peking University
GaussianAnything: Interactive point cloud latent diffusion enables high-quality, editable 3D models from images or text, overcoming existing 3D generation limitations.
SAMPart3D: Segment Any Part in 3D Objects
·3136 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 University of Hong Kong
SAMPart3D: Zero-shot 3D part segmentation across granularities, scaling to large datasets & handling part ambiguity.
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
·3438 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Waterloo
OmniEdit, a novel instruction-based image editing model, surpasses existing methods by leveraging specialist supervision and high-quality data, achieving superior performance across diverse editing ta…
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models
·3087 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 NVIDIA Research
Edify Image: groundbreaking pixel-perfect photorealistic image generation using cascaded pixel-space diffusion models with a novel Laplacian diffusion process, enabling diverse applications including …
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
·3359 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 NVIDIA Research
Add-it: Training-free object insertion in images using pretrained diffusion models by cleverly balancing information from the scene, text prompt, and generated image, achieving state-of-the-art result…