Computer Vision
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
·4088 words·20 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Classification
π’ Johns Hopkins University
Smaller image patches improve vision transformer performance, defying conventional wisdom and revealing a new scaling law for enhanced visual understanding.
MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation
·2615 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Chinese University of Hong Kong
MotionCanvas lets users design cinematic video shots with intuitive controls for camera and object movements, translating scene-space intentions into video animations.
Fast Video Generation with Sliding Tile Attention
·4012 words·19 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ University of California, San Diego
Sliding Tile Attention (STA) boosts video generation speed by 2.43-3.53x without losing quality by exploiting inherent data redundancy in video diffusion models.
Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach
·2337 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ UC Los Angeles
This paper introduces PointVid, a 3D-aware video generation framework using 3D point regularization to enhance video realism and address common issues like object morphing.
On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices
·3325 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Ulsan National Institute of Science and Technology
On-device Sora makes high-quality, diffusion-based text-to-video generation possible on smartphones, overcoming computational and memory limitations through novel techniques.
DynVFX: Augmenting Real Videos with Dynamic Content
·3393 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Weizmann Institute of Science
DynVFX: Effortlessly integrate dynamic content into real videos using simple text prompts. Zero-shot learning and novel attention mechanisms deliver seamless and realistic results.
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
·3510 words·17 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Meta AI
VideoJAM enhances video generation by jointly learning appearance and motion representations, achieving state-of-the-art motion coherence.
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
·4621 words·22 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
3D Vision
π’ Singapore University of Technology and Design
MotionLab: One framework to rule them all! Unifying human motion generation & editing via a novel Motion-Condition-Motion paradigm, boosting efficiency and generalization.
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
·2129 words·10 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ ByteDance
OmniHuman-1: Scaling up one-stage conditioned human animation through novel mixed-condition training.
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
·2423 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Show Lab, National University of Singapore
LayerTracer innovatively synthesizes cognitive-aligned layered SVGs via diffusion transformers, bridging the gap between AI and professional design standards by learning from a novel dataset of sequen…
Inverse Bridge Matching Distillation
·4522 words·22 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Skolkovo Institute of Science and Technology
Boosting Diffusion Bridge Models: A new distillation technique accelerates inference speed by 4x to 100x, sometimes even improving image quality!
Improved Training Technique for Latent Consistency Models
·3409 words·17 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Rutgers University
Researchers significantly enhance latent consistency models’ performance by introducing Cauchy loss, mitigating outlier effects, and employing novel training strategies, thus bridging the gap with dif…
A Study on the Performance of U-Net Modifications in Retroperitoneal Tumor Segmentation
·1561 words·8 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Segmentation
π’ University of British Columbia
ViLU-Net, a novel U-Net modification using Vision-xLSTM, achieves superior retroperitoneal tumor segmentation accuracy and efficiency, exceeding existing state-of-the-art methods.
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
·3227 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Peking University
DIFFSPLAT repurposes 2D image diffusion models to natively generate high-quality 3D Gaussian splats, overcoming limitations in existing 3D generation methods.
iFormer: Integrating ConvNet and Transformer for Mobile Application
·7046 words·34 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Classification
π’ Shanghai Jiao Tong University
iFormer: A new family of mobile hybrid vision networks that expertly blends ConvNeXt’s fast local feature extraction with the efficient global modeling of self-attention, achieving top-tier accuracy a…
Relightable Full-Body Gaussian Codec Avatars
·3832 words·18 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
3D Vision
π’ ETH Zurich
Relightable Full-Body Gaussian Codec Avatars: Realistic, animatable full-body avatars are now possible using learned radiance transfer and efficient 3D Gaussian splatting.
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
·4575 words·22 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Nanyang Technological University
Video-MMMU benchmark systematically evaluates Large Multimodal Modelsβ knowledge acquisition from videos across multiple disciplines and cognitive stages, revealing significant gaps between human and …
Temporal Preference Optimization for Long-Form Video Understanding
·2626 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Stanford University
Boosting long-form video understanding, Temporal Preference Optimization (TPO) enhances video-LLMs by leveraging preference learning. It achieves this through a self-training method using preference …
Improving Video Generation with Human Feedback
·4418 words·21 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Tsinghua University
Human feedback boosts video generation! New VideoReward model & alignment algorithms significantly improve video quality and user prompt alignment, exceeding prior methods.
EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion
·2578 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ ByteDance
EchoVideo generates high-fidelity, identity-preserving videos by cleverly fusing text and image features, overcoming limitations of prior methods.