Paper Reviews by AI
2025
Long-Video Audio Synthesis with Multi-Agent Collaboration
·2152 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Audio-Visual Learning
🏢 Hong Kong University of Science and Technology
LVAS-Agent: Multi-agent system conquers long-video audio synthesis with collaborative dubbing, script, design, & more!
Long Context Tuning for Video Generation
·2260 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 the Chinese University of Hong Kong
LCT: Fine-tunes single-shot video diffusion models for coherent multi-shot video generation without extra parameters!
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
·1730 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Qiyuan Tech
Light-R1: Trains long COT models from scratch using curriculum SFT, DPO, and RL, achieving SOTA performance and strong generalization with limited resources.
LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds
·2424 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Alibaba Group
LHM: Animatable 3D avatars from a single image in seconds.
Large-scale Pre-training for Grounded Video Caption Generation
·2703 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Czech Institute of Informatics, Robotics and Cybernetics
GROVE: Pre-training on large-scale data for grounded video caption generation.
KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation
·1710 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Applications
Robotics
🏢 Tsinghua University
KUDA unifies dynamics learning and visual prompting with keypoints for open-vocabulary robot manipulation.
Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?
·3607 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Machine Learning
Deep Learning
🏢 University of Central Florida
KArAt: Can Learnable Attention Beat Standard Attention in Vision Transformers?
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
·2562 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Huazhong University of Science & Technology
GroundingSuite: A new benchmark that measures complex multi-granular pixel grounding to overcome current dataset limitations and push forward vision-language understanding.
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
·2532 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 CUHK MMLab
GoT: Reasoning guides vivid image generation and editing!
From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
·1953 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Paris-Saclay University
SPIRE: Adds speech to text-only LLMs, maintaining text performance via discretized speech and continued pre-training.
FlowTok: Flowing Seamlessly Across Text and Image Tokens
·2984 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Generation
🏢 ByteDance Seed
FlowTok: Seamlessly flows across text and image tokens!
ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
·2550 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Westlake University
ETCH: Equivariantly fitting bodies to clothed humans through tightness for better pose and shape accuracy.
Distilling Diversity and Control in Diffusion Models
·4046 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Northeastern University
Distilling diffusion models?💡 This paper shows you how to retain base model diversity while keeping the distilled model’s speed!
CoSTA$st$: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing
·5298 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Maryland, College Park
COSTA*: A cost-effective agent that smartly navigates AI tools to edit images with high quality and low cost, balancing user preferences!
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance
·1806 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 ByteDance Intelligent Creation
CINEMA: MLLM-guided coherent multi-subject video generation for consistent and controllable content creation.
Charting and Navigating Hugging Face's Model Atlas
·3697 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Machine Learning
Deep Learning
🏢 School of Computer Science and Engineering
Navigating millions of models is hard. This paper charts Hugging Face, revealing model relationships and attribute predictions.
Autoregressive Image Generation with Randomized Parallel Decoding
·3693 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Westlake University
ARPG: Randomly generate high-quality images by parallel decoding, outperforming existing methods in efficiency, memory, and quality.
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
·2631 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Harvard University
4D LangSplat learns 4D language fields for dynamic scenes using multimodal large language models, enabling time-sensitive open-vocabulary queries.
TPDiff: Temporal Pyramid Video Diffusion Model
·2081 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 National University of Singapore
TPDiff accelerates video diffusion by progressively increasing frame rates during diffusion, optimizing computational efficiency with a novel stage-wise training strategy.
Silent Branding Attack: Trigger-free Data Poisoning Attack on Text-to-Image Diffusion Models
·410 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 KAIST
New ‘Silent Branding Attack’ poisons text-to-image models, embedding brand logos without text prompts, raising ethical issues for image generation tools.