Computer Vision

Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More

6 February 2025·4088 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Classification 🏢 Johns Hopkins University

Smaller image patches improve vision transformer performance, defying conventional wisdom and revealing a new scaling law for enhanced visual understanding.

MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation

6 February 2025·2615 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Chinese University of Hong Kong

MotionCanvas lets users design cinematic video shots with intuitive controls for camera and object movements, translating scene-space intentions into video animations.

Fast Video Generation with Sliding Tile Attention

6 February 2025·4012 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of California, San Diego

Sliding Tile Attention (STA) boosts video generation speed by 2.43-3.53x without losing quality by exploiting inherent data redundancy in video diffusion models.

Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach

5 February 2025·2337 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 UC Los Angeles

This paper introduces PointVid, a 3D-aware video generation framework using 3D point regularization to enhance video realism and address common issues like object morphing.

On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

5 February 2025·3325 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Ulsan National Institute of Science and Technology

On-device Sora makes high-quality, diffusion-based text-to-video generation possible on smartphones, overcoming computational and memory limitations through novel techniques.

DynVFX: Augmenting Real Videos with Dynamic Content

5 February 2025·3393 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Weizmann Institute of Science

DynVFX: Effortlessly integrate dynamic content into real videos using simple text prompts. Zero-shot learning and novel attention mechanisms deliver seamless and realistic results.

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

4 February 2025·3510 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Meta AI

VideoJAM enhances video generation by jointly learning appearance and motion representations, achieving state-of-the-art motion coherence.

MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

4 February 2025·4621 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Singapore University of Technology and Design

MotionLab: One framework to rule them all! Unifying human motion generation & editing via a novel Motion-Condition-Motion paradigm, boosting efficiency and generalization.

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

3 February 2025·2129 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 ByteDance

OmniHuman-1: Scaling up one-stage conditioned human animation through novel mixed-condition training.

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

3 February 2025·2423 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Show Lab, National University of Singapore

LayerTracer innovatively synthesizes cognitive-aligned layered SVGs via diffusion transformers, bridging the gap between AI and professional design standards by learning from a novel dataset of sequen…

Inverse Bridge Matching Distillation

3 February 2025·4522 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Skolkovo Institute of Science and Technology

Boosting Diffusion Bridge Models: A new distillation technique accelerates inference speed by 4x to 100x, sometimes even improving image quality!

Improved Training Technique for Latent Consistency Models

3 February 2025·3409 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Rutgers University

Researchers significantly enhance latent consistency models’ performance by introducing Cauchy loss, mitigating outlier effects, and employing novel training strategies, thus bridging the gap with dif…

A Study on the Performance of U-Net Modifications in Retroperitoneal Tumor Segmentation

1 February 2025·1561 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Segmentation 🏢 University of British Columbia

ViLU-Net, a novel U-Net modification using Vision-xLSTM, achieves superior retroperitoneal tumor segmentation accuracy and efficiency, exceeding existing state-of-the-art methods.

DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation

28 January 2025·3227 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

DIFFSPLAT repurposes 2D image diffusion models to natively generate high-quality 3D Gaussian splats, overcoming limitations in existing 3D generation methods.

iFormer: Integrating ConvNet and Transformer for Mobile Application

26 January 2025·7046 words·34 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Classification 🏢 Shanghai Jiao Tong University

iFormer: A new family of mobile hybrid vision networks that expertly blends ConvNeXt’s fast local feature extraction with the efficient global modeling of self-attention, achieving top-tier accuracy a…

Relightable Full-Body Gaussian Codec Avatars

24 January 2025·3832 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 ETH Zurich

Relightable Full-Body Gaussian Codec Avatars: Realistic, animatable full-body avatars are now possible using learned radiance transfer and efficient 3D Gaussian splatting.

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

23 January 2025·4575 words·22 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Nanyang Technological University

Video-MMMU benchmark systematically evaluates Large Multimodal Models’ knowledge acquisition from videos across multiple disciplines and cognitive stages, revealing significant gaps between human and …

Temporal Preference Optimization for Long-Form Video Understanding

23 January 2025·2626 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Stanford University

Boosting long-form video understanding, Temporal Preference Optimization (TPO) enhances video-LLMs by leveraging preference learning. It achieves this through a self-training method using preference …

Improving Video Generation with Human Feedback

23 January 2025·4418 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Tsinghua University

Human feedback boosts video generation! New VideoReward model & alignment algorithms significantly improve video quality and user prompt alignment, exceeding prior methods.

EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

23 January 2025·2578 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance

EchoVideo generates high-fidelity, identity-preserving videos by cleverly fusing text and image features, overcoming limitations of prior methods.