Paper Reviews by AI

Long-Video Audio Synthesis with Multi-Agent Collaboration

13 March 2025·2152 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Audio-Visual Learning 🏢 Hong Kong University of Science and Technology

LVAS-Agent: Multi-agent system conquers long-video audio synthesis with collaborative dubbing, script, design, & more!

Long Context Tuning for Video Generation

13 March 2025·2260 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 the Chinese University of Hong Kong

LCT: Fine-tunes single-shot video diffusion models for coherent multi-shot video generation without extra parameters!

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

13 March 2025·1730 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Qiyuan Tech

Light-R1: Trains long COT models from scratch using curriculum SFT, DPO, and RL, achieving SOTA performance and strong generalization with limited resources.

LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

13 March 2025·2424 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Alibaba Group

LHM: Animatable 3D avatars from a single image in seconds.

Large-scale Pre-training for Grounded Video Caption Generation

13 March 2025·2703 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Czech Institute of Informatics, Robotics and Cybernetics

GROVE: Pre-training on large-scale data for grounded video caption generation.

KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

13 March 2025·1710 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers AI Applications Robotics 🏢 Tsinghua University

KUDA unifies dynamics learning and visual prompting with keypoints for open-vocabulary robot manipulation.

Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

13 March 2025·3607 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Machine Learning Deep Learning 🏢 University of Central Florida

KArAt: Can Learnable Attention Beat Standard Attention in Vision Transformers?

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

13 March 2025·2562 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Huazhong University of Science & Technology

GroundingSuite: A new benchmark that measures complex multi-granular pixel grounding to overcome current dataset limitations and push forward vision-language understanding.

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

13 March 2025·2532 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 CUHK MMLab

GoT: Reasoning guides vivid image generation and editing!

From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

13 March 2025·1953 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Paris-Saclay University

SPIRE: Adds speech to text-only LLMs, maintaining text performance via discretized speech and continued pre-training.

FlowTok: Flowing Seamlessly Across Text and Image Tokens

13 March 2025·2984 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 ByteDance Seed

FlowTok: Seamlessly flows across text and image tokens!

ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

13 March 2025·2550 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Westlake University

ETCH: Equivariantly fitting bodies to clothed humans through tightness for better pose and shape accuracy.

Distilling Diversity and Control in Diffusion Models

13 March 2025·4046 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Northeastern University

Distilling diffusion models?💡 This paper shows you how to retain base model diversity while keeping the distilled model’s speed!

CoSTA$st$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

13 March 2025·5298 words·25 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Maryland, College Park

COSTA*: A cost-effective agent that smartly navigates AI tools to edit images with high quality and low cost, balancing user preferences!

CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance

13 March 2025·1806 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 ByteDance Intelligent Creation

CINEMA: MLLM-guided coherent multi-subject video generation for consistent and controllable content creation.

Charting and Navigating Hugging Face's Model Atlas

13 March 2025·3697 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Machine Learning Deep Learning 🏢 School of Computer Science and Engineering

Navigating millions of models is hard. This paper charts Hugging Face, revealing model relationships and attribute predictions.

Autoregressive Image Generation with Randomized Parallel Decoding

13 March 2025·3693 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Westlake University

ARPG: Randomly generate high-quality images by parallel decoding, outperforming existing methods in efficiency, memory, and quality.

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

13 March 2025·2631 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Harvard University

4D LangSplat learns 4D language fields for dynamic scenes using multimodal large language models, enabling time-sensitive open-vocabulary queries.

TPDiff: Temporal Pyramid Video Diffusion Model

12 March 2025·2081 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 National University of Singapore

TPDiff accelerates video diffusion by progressively increasing frame rates during diffusion, optimizing computational efficiency with a novel stage-wise training strategy.

Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models

12 March 2025·410 words·2 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 KAIST

New ‘Silent Branding Attack’ poisons text-to-image models, embedding brand logos without text prompts, raising ethical issues for image generation tools.