Computer Vision
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
·3961 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Fudan University
VideoRoPE enhances video processing in Transformer models by introducing a novel 3D rotary position embedding that preserves spatio-temporal relationships, resulting in superior performance across var…
Goku: Flow Based Video Generative Foundation Models
·3430 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Hong Kong
Goku: a novel family of joint image-and-video generation models uses rectified flow Transformers, achieving industry-leading performance with a robust data pipeline and training infrastructure.
FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
·4450 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
FlashVideo: Generate stunning high-resolution videos efficiently using a two-stage framework prioritizing fidelity and detail, achieving state-of-the-art results.
AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting
·4072 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 National Yang Ming Chiao Tung University
AuraFusion360: High-quality 360° scene inpainting achieved via novel augmented unseen region alignment and a new benchmark dataset.
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
·4088 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Classification
🏢 Johns Hopkins University
Smaller image patches improve vision transformer performance, defying conventional wisdom and revealing a new scaling law for enhanced visual understanding.
MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation
·2615 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Chinese University of Hong Kong
MotionCanvas lets users design cinematic video shots with intuitive controls for camera and object movements, translating scene-space intentions into video animations.
Fast Video Generation with Sliding Tile Attention
·4012 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 University of California, San Diego
Sliding Tile Attention (STA) boosts video generation speed by 2.43-3.53x without losing quality by exploiting inherent data redundancy in video diffusion models.
Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach
·2337 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 UC Los Angeles
This paper introduces PointVid, a 3D-aware video generation framework using 3D point regularization to enhance video realism and address common issues like object morphing.
On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices
·3325 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Ulsan National Institute of Science and Technology
On-device Sora makes high-quality, diffusion-based text-to-video generation possible on smartphones, overcoming computational and memory limitations through novel techniques.
DynVFX: Augmenting Real Videos with Dynamic Content
·3393 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Weizmann Institute of Science
DynVFX: Effortlessly integrate dynamic content into real videos using simple text prompts. Zero-shot learning and novel attention mechanisms deliver seamless and realistic results.
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
·3510 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Meta AI
VideoJAM enhances video generation by jointly learning appearance and motion representations, achieving state-of-the-art motion coherence.
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
·4621 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Singapore University of Technology and Design
MotionLab: One framework to rule them all! Unifying human motion generation & editing via a novel Motion-Condition-Motion paradigm, boosting efficiency and generalization.
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
·2129 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 ByteDance
OmniHuman-1: Scaling up one-stage conditioned human animation through novel mixed-condition training.
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
·2423 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Show Lab, National University of Singapore
LayerTracer innovatively synthesizes cognitive-aligned layered SVGs via diffusion transformers, bridging the gap between AI and professional design standards by learning from a novel dataset of sequen…
Inverse Bridge Matching Distillation
·4522 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Skolkovo Institute of Science and Technology
Boosting Diffusion Bridge Models: A new distillation technique accelerates inference speed by 4x to 100x, sometimes even improving image quality!
Improved Training Technique for Latent Consistency Models
·3409 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Rutgers University
Researchers significantly enhance latent consistency models’ performance by introducing Cauchy loss, mitigating outlier effects, and employing novel training strategies, thus bridging the gap with dif…
A Study on the Performance of U-Net Modifications in Retroperitoneal Tumor Segmentation
·1561 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Segmentation
🏢 University of British Columbia
ViLU-Net, a novel U-Net modification using Vision-xLSTM, achieves superior retroperitoneal tumor segmentation accuracy and efficiency, exceeding existing state-of-the-art methods.
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
·3227 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
DIFFSPLAT repurposes 2D image diffusion models to natively generate high-quality 3D Gaussian splats, overcoming limitations in existing 3D generation methods.
iFormer: Integrating ConvNet and Transformer for Mobile Application
·7046 words·34 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Classification
🏢 Shanghai Jiao Tong University
iFormer: A new family of mobile hybrid vision networks that expertly blends ConvNeXt’s fast local feature extraction with the efficient global modeling of self-attention, achieving top-tier accuracy a…
Relightable Full-Body Gaussian Codec Avatars
·3832 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 ETH Zurich
Relightable Full-Body Gaussian Codec Avatars: Realistic, animatable full-body avatars are now possible using learned radiance transfer and efficient 3D Gaussian splatting.