Skip to main content

Computer Vision

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
·4302 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
MagicDriveDiT generates high-resolution, long street-view videos with precise control, exceeding limitations of previous methods in autonomous driving.
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
·3966 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Nanyang Technological University
VBench++: A new benchmark suite meticulously evaluates video generative models across 16 diverse dimensions, aligning with human perception for improved model development and fairer comparisons.
Stylecodes: Encoding Stylistic Information For Image Generation
·237 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 String
StyleCodes enables easy style sharing for image generation by encoding styles as compact strings, enhancing control and collaboration while minimizing quality loss.
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
·2784 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Washington
SAMURAI enhances the Segment Anything Model 2 for real-time, zero-shot visual object tracking by incorporating motion-aware memory and motion modeling, significantly improving accuracy and robustness.
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements
·3219 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Segmentation 🏢 Bilkent University
ITACLIP boosts training-free semantic segmentation by architecturally enhancing CLIP, integrating LLM-generated class descriptions, and employing image engineering; achieving state-of-the-art results.
Generative World Explorer
·1739 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Johns Hopkins University
Generative World Explorer (Genex) enables agents to imaginatively explore environments, updating beliefs with generated observations for better decision-making.
Continuous Speculative Decoding for Autoregressive Image Generation
·1799 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Chinese Academy of Sciences
Researchers have developed Continuous Speculative Decoding, boosting autoregressive image generation speed by up to 2.33x while maintaining image quality.
AnimateAnything: Consistent and Controllable Animation for Video Generation
·2615 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
AnimateAnything: A unified approach enabling precise & consistent video manipulation via a novel optical flow representation and frequency stabilization.
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning
·2623 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Quality Assessment 🏢 State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA
SEAGULL: A novel network uses vision-language instruction tuning to assess image quality for regions of interest (ROIs) with high accuracy, leveraging masks and a new dataset for fine-grained IQA.
Number it: Temporal Grounding Videos like Flipping Manga
·2758 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Southeast University
Boosting video temporal grounding, NumPro empowers Vid-LLMs by adding frame numbers, making temporal localization as easy as flipping through manga.
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on
·2555 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tencent
FitDiT boosts virtual try-on realism by enhancing garment details via Diffusion Transformers, improving texture and size accuracy for high-fidelity virtual fashion.
MagicQuill: An Intelligent Interactive Image Editing System
·4923 words·24 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 HKUST
MagicQuill: an intelligent interactive image editing system enabling intuitive, precise image edits via brushstrokes and real-time intent prediction by a multimodal LLM.
Sharingan: Extract User Action Sequence from Desktop Recordings
·9852 words·47 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University
Sharingan extracts user action sequences from desktop recordings using novel VLM-based methods, achieving 70-80% accuracy and enabling RPA.
EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
·1627 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Alibaba
EgoVid-5M: First high-quality dataset for egocentric video generation, enabling realistic human-centric world simulations.
Wavelet Latent Diffusion (Wala): Billion-Parameter 3D Generative Model with Compact Wavelet Encodings
·3736 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Autodesk
WaLa: a billion-parameter 3D generative model using wavelet encodings achieves state-of-the-art results, generating high-quality 3D shapes in seconds.
GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation
·2630 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Peking University
GaussianAnything: Interactive point cloud latent diffusion enables high-quality, editable 3D models from images or text, overcoming existing 3D generation limitations.
SAMPart3D: Segment Any Part in 3D Objects
·3136 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Hong Kong
SAMPart3D: Zero-shot 3D part segmentation across granularities, scaling to large datasets & handling part ambiguity.
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
·3438 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Waterloo
OmniEdit, a novel instruction-based image editing model, surpasses existing methods by leveraging specialist supervision and high-quality data, achieving superior performance across diverse editing ta…
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models
·3087 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 NVIDIA Research
Edify Image: groundbreaking pixel-perfect photorealistic image generation using cascaded pixel-space diffusion models with a novel Laplacian diffusion process, enabling diverse applications including …
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models
·3359 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 NVIDIA Research
Add-it: Training-free object insertion in images using pretrained diffusion models by cleverly balancing information from the scene, text prompt, and generated image, achieving state-of-the-art result…