Paper Reviews by AI

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

17 March 2025·2626 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 University of Liverpool

KDTalker: Accurate & efficient audio-driven talking portrait via implicit keypoint-based spatiotemporal diffusion, unlocking diverse & realistic animations.

Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions

17 March 2025·5687 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 KAIST AI

SIGHTATION: A BLV-aligned dataset utilizing sighted user feedback to enhance diagram descriptions generated by VLMs, improving accessibility for visually impaired learners.

Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation

17 March 2025·2806 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Hong Kong University of Science and Technology

Rewards Are Enough!

Pensez: Less Data, Better Reasoning -- Rethinking French LLM

17 March 2025·3508 words·17 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Université Grenoble Alpes

Pensez: Strategic fine-tuning beats massive data for superior reasoning in French LLMs, challenging conventional wisdom.

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

17 March 2025·5602 words·27 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Apple

MM-Spatial enhances multimodal LLMs with 3D spatial reasoning via a novel dataset and benchmark, improving performance on spatial understanding tasks.

Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning

17 March 2025·2607 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Nanjing University

TVC mitigates visual forgetting in multimodal LLMs, enhancing reasoning by strategically re-introducing and compressing visual information.

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

17 March 2025·4473 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Stanford University

MicroVQA: A new benchmark to test visual-question-answering in microscopy-based research.

Infinite Mobility: Scalable High-Fidelity Synthesis of Articulated Objects via Procedural Generation

17 March 2025·2576 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Shanghai Artificial Intelligence Laboratory

Infinite Mobility: Procedural generation of high-fidelity articulated objects for scalable embodied AI training.

From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

17 March 2025·5931 words·28 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Fudan University

ADR balances visual-language models by adaptively calibrating long-tail data, boosting LLaVA 1.5 by 4.36% without increasing training data volume.

Free-form language-based robotic reasoning and grasping

17 March 2025·1651 words·8 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Embodied AI 🏢 Fondazione Bruno Kessler

FreeGrasp: enabling robots to grasp by interpreting instructions and reasoning about object spatial relationships.

Edit Transfer: Learning Image Editing via Vision In-Context Relations

17 March 2025·3168 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Communication University of China

Edit Transfer: Learns image edits from a single example and applies it to new images, surpassing text/reference-based methods!

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

17 March 2025·3109 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Zhejiang University

DreamRenderer: Taming attribute control in large-scale text-to-image models with a plug-and-play, training-free approach for enhanced content creation.

DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

17 March 2025·2841 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University

DeepPerception enhances MLLMs with cognitive visual perception, achieving superior grounding through knowledge integration & reasoning.

BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

17 March 2025·2181 words·11 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Peking University

BlobCtrl: Precisely edit images at the element level with a unified, flexible framework, bridging the gap between generation and editing.

$φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

17 March 2025·3341 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Shanghai AI Lab

Φ-Decoding: Adaptive foresight sampling balances inference-time exploration and exploitation for better LLM reasoning.

STEVE: AStep Verification Pipeline for Computer-use Agent Training

16 March 2025·3895 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 CUHK

STEVE: Step-verifying computer-use agent training.

PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models

16 March 2025·4158 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 HIT

PEBench: A new benchmark for machine unlearning in multimodal language models, enhancing secure multimodal model development.

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

16 March 2025·3237 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 NUS

A comprehensive survey of multimodal chain-of-thought (MCoT) reasoning, bridging the gap in existing literature and fostering innovation towards multimodal AGI.

MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification

16 March 2025·2497 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 HIT

MPBench: Multimodal benchmark to identify errors in reasoning processes.

MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization

16 March 2025·2743 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Zhejiang University

MagicID: ID-consistent & dynamic-preserved video customization via hybrid preference optimization.