Paper Reviews by AI
2025
Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait
·2626 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ University of Liverpool
KDTalker: Accurate & efficient audio-driven talking portrait via implicit keypoint-based spatiotemporal diffusion, unlocking diverse & realistic animations.
Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions
·5687 words·27 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ KAIST AI
SIGHTATION: A BLV-aligned dataset utilizing sighted user feedback to enhance diagram descriptions generated by VLMs, improving accessibility for visually impaired learners.
Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation
·2806 words·14 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Hong Kong University of Science and Technology
Rewards Are Enough!
Pensez: Less Data, Better Reasoning -- Rethinking French LLM
·3508 words·17 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ UniversitΓ© Grenoble Alpes
Pensez: Strategic fine-tuning beats massive data for superior reasoning in French LLMs, challenging conventional wisdom.
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
·5602 words·27 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
3D Vision
π’ Apple
MM-Spatial enhances multimodal LLMs with 3D spatial reasoning via a novel dataset and benchmark, improving performance on spatial understanding tasks.
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning
·2607 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Reasoning
π’ Nanjing University
TVC mitigates visual forgetting in multimodal LLMs, enhancing reasoning by strategically re-introducing and compressing visual information.
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
·4473 words·21 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Reasoning
π’ Stanford University
MicroVQA: A new benchmark to test visual-question-answering in microscopy-based research.
Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation
·2576 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
3D Vision
π’ Shanghai Artificial Intelligence Laboratory
Infinite Mobility: Procedural generation of high-fidelity articulated objects for scalable embodied AI training.
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration
·5931 words·28 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Fudan University
ADR balances visual-language models by adaptively calibrating long-tail data, boosting LLaVA 1.5 by 4.36% without increasing training data volume.
Free-form language-based robotic reasoning and grasping
·1651 words·8 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Embodied AI
π’ Fondazione Bruno Kessler
FreeGrasp: enabling robots to grasp by interpreting instructions and reasoning about object spatial relationships.
Edit Transfer: Learning Image Editing via Vision In-Context Relations
·3168 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Communication University of China
Edit Transfer: Learns image edits from a single example and applies it to new images, surpassing text/reference-based methods!
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
·3109 words·15 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Zhejiang University
DreamRenderer: Taming attribute control in large-scale text-to-image models with a plug-and-play, training-free approach for enhanced content creation.
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
·2841 words·14 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
DeepPerception enhances MLLMs with cognitive visual perception, achieving superior grounding through knowledge integration & reasoning.
BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing
·2181 words·11 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ Peking University
BlobCtrl: Precisely edit images at the element level with a unified, flexible framework, bridging the gap between generation and editing.
$Ο$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation
·3341 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ Shanghai AI Lab
Ξ¦-Decoding: Adaptive foresight sampling balances inference-time exploration and exploitation for better LLM reasoning.
STEVE: AStep Verification Pipeline for Computer-use Agent Training
·3895 words·19 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ CUHK
STEVE: Step-verifying computer-use agent training.
PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models
·4158 words·20 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Understanding
π’ HIT
PEBench: A new benchmark for machine unlearning in multimodal language models, enhancing secure multimodal model development.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
·3237 words·16 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Reasoning
π’ NUS
A comprehensive survey of multimodal chain-of-thought (MCoT) reasoning, bridging the gap in existing literature and fostering innovation towards multimodal AGI.
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification
·2497 words·12 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Reasoning
π’ HIT
MPBench: Multimodal benchmark to identify errors in reasoning processes.
MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
·2743 words·13 mins·
loading
·
loading
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Zhejiang University
MagicID: ID-consistent & dynamic-preserved video customization via hybrid preference optimization.