Computer Vision
How Far is Video Generation from World Model: A Physical Law Perspective
·3657 words·18 mins
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Bytedance Research
Scaling video generation models doesn’t guarantee they’ll learn physics; this study reveals they prioritize visual cues over true physical understanding.
GenXD: Generating Any 3D and 4D Scenes
·2731 words·13 mins
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 National University of Singapore
GenXD: A unified model generating high-quality 3D & 4D scenes from any number of images, advancing the field of dynamic scene generation.
Adaptive Caching for Faster Video Generation with Diffusion Transformers
·3142 words·15 mins
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Meta AI
Adaptive Caching (AdaCache) dramatically speeds up video generation with diffusion transformers by cleverly caching and reusing computations, tailoring the process to each video’s complexity and motio…
DreamPolish: Domain Score Distillation With Progressive Geometry Generation
·2197 words·11 mins
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Peking University
DreamPolish: A new text-to-3D model generates highly detailed 3D objects with polished surfaces and realistic textures using progressive geometry refinement and a novel domain score distillation tech…
Randomized Autoregressive Visual Generation
·4145 words·20 mins
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 ByteDance
Randomized Autoregressive Modeling (RAR) sets a new state-of-the-art in image generation by cleverly introducing randomness during training to improve the model’s ability to learn from bidirectional c…
Constant Acceleration Flow
·3289 words·16 mins
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Korea University
Constant Acceleration Flow (CAF) dramatically speeds up diffusion model generation by using a constant acceleration equation, outperforming state-of-the-art methods with improved accuracy and few-step…
Learning Video Representations without Natural Videos
·3154 words·15 mins
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 ShanghaiTech University
High-performing video representation models can be trained using only synthetic videos and images, eliminating the need for large natural video datasets.
In-Context LoRA for Diffusion Transformers
·392 words·2 mins
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Tongyi Lab
In-Context LoRA empowers existing text-to-image models for high-fidelity multi-image generation by simply concatenating images and using minimal task-specific LoRA tuning.
DELTA: Dense Efficient Long-range 3D Tracking for any video
·3706 words·18 mins
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 UMass Amherst
DELTA: A new method efficiently tracks every pixel in 3D space from monocular videos, enabling accurate motion estimation across entire videos with state-of-the-art accuracy and over 8x speed improvem…
HelloMeme: Integrating Spatial Knitting Attentions to Embed High-Level and Fidelity-Rich Conditions in Diffusion Models
·2152 words·11 mins
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Peking University
HelloMeme enhances text-to-image models by integrating spatial knitting attentions, enabling high-fidelity meme video generation while preserving model generalization.
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
·3392 words·16 mins
AI Generated
🤗 Daily Papers
Computer Vision
Visual Question Answering
🏢 University of California, Berkeley
DynaMath, a novel benchmark, reveals that state-of-the-art VLMs struggle with variations of simple math problems, showcasing their reasoning fragility. It offers 501 high-quality seed questions, dyna…