Skip to main content

Computer Vision

Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
·4088 words·20 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Classification 🏒 Johns Hopkins University
Smaller image patches improve vision transformer performance, defying conventional wisdom and revealing a new scaling law for enhanced visual understanding.
MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation
·2615 words·13 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Video Understanding 🏒 Chinese University of Hong Kong
MotionCanvas lets users design cinematic video shots with intuitive controls for camera and object movements, translating scene-space intentions into video animations.
Fast Video Generation with Sliding Tile Attention
·4012 words·19 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Video Understanding 🏒 University of California, San Diego
Sliding Tile Attention (STA) boosts video generation speed by 2.43-3.53x without losing quality by exploiting inherent data redundancy in video diffusion models.
Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach
·2337 words·11 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Video Understanding 🏒 UC Los Angeles
This paper introduces PointVid, a 3D-aware video generation framework using 3D point regularization to enhance video realism and address common issues like object morphing.
On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices
·3325 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Ulsan National Institute of Science and Technology
On-device Sora makes high-quality, diffusion-based text-to-video generation possible on smartphones, overcoming computational and memory limitations through novel techniques.
DynVFX: Augmenting Real Videos with Dynamic Content
·3393 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Video Understanding 🏒 Weizmann Institute of Science
DynVFX: Effortlessly integrate dynamic content into real videos using simple text prompts. Zero-shot learning and novel attention mechanisms deliver seamless and realistic results.
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
·3510 words·17 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Video Understanding 🏒 Meta AI
VideoJAM enhances video generation by jointly learning appearance and motion representations, achieving state-of-the-art motion coherence.
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
·4621 words·22 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision 3D Vision 🏒 Singapore University of Technology and Design
MotionLab: One framework to rule them all! Unifying human motion generation & editing via a novel Motion-Condition-Motion paradigm, boosting efficiency and generalization.
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
·2129 words·10 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 ByteDance
OmniHuman-1: Scaling up one-stage conditioned human animation through novel mixed-condition training.
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
·2423 words·12 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Show Lab, National University of Singapore
LayerTracer innovatively synthesizes cognitive-aligned layered SVGs via diffusion transformers, bridging the gap between AI and professional design standards by learning from a novel dataset of sequen…
Inverse Bridge Matching Distillation
·4522 words·22 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Skolkovo Institute of Science and Technology
Boosting Diffusion Bridge Models: A new distillation technique accelerates inference speed by 4x to 100x, sometimes even improving image quality!
Improved Training Technique for Latent Consistency Models
·3409 words·17 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Rutgers University
Researchers significantly enhance latent consistency models’ performance by introducing Cauchy loss, mitigating outlier effects, and employing novel training strategies, thus bridging the gap with dif…
A Study on the Performance of U-Net Modifications in Retroperitoneal Tumor Segmentation
·1561 words·8 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Segmentation 🏒 University of British Columbia
ViLU-Net, a novel U-Net modification using Vision-xLSTM, achieves superior retroperitoneal tumor segmentation accuracy and efficiency, exceeding existing state-of-the-art methods.
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
·3227 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Generation 🏒 Peking University
DIFFSPLAT repurposes 2D image diffusion models to natively generate high-quality 3D Gaussian splats, overcoming limitations in existing 3D generation methods.
iFormer: Integrating ConvNet and Transformer for Mobile Application
·7046 words·34 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Image Classification 🏒 Shanghai Jiao Tong University
iFormer: A new family of mobile hybrid vision networks that expertly blends ConvNeXt’s fast local feature extraction with the efficient global modeling of self-attention, achieving top-tier accuracy a…
Relightable Full-Body Gaussian Codec Avatars
·3832 words·18 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision 3D Vision 🏒 ETH Zurich
Relightable Full-Body Gaussian Codec Avatars: Realistic, animatable full-body avatars are now possible using learned radiance transfer and efficient 3D Gaussian splatting.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
·4575 words·22 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Video Understanding 🏒 Nanyang Technological University
Video-MMMU benchmark systematically evaluates Large Multimodal Models’ knowledge acquisition from videos across multiple disciplines and cognitive stages, revealing significant gaps between human and …
Temporal Preference Optimization for Long-Form Video Understanding
·2626 words·13 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Video Understanding 🏒 Stanford University
Boosting long-form video understanding, Temporal Preference Optimization (TPO) enhances video-LLMs by leveraging preference learning. It achieves this through a self-training method using preference …
Improving Video Generation with Human Feedback
·4418 words·21 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Video Understanding 🏒 Tsinghua University
Human feedback boosts video generation! New VideoReward model & alignment algorithms significantly improve video quality and user prompt alignment, exceeding prior methods.
EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion
·2578 words·13 mins· loading · loading
AI Generated πŸ€— Daily Papers Computer Vision Video Understanding 🏒 ByteDance
EchoVideo generates high-fidelity, identity-preserving videos by cleverly fusing text and image features, overcoming limitations of prior methods.