Paper Reviews by AI
2025
MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search
·2082 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Question Answering
🏢 Yale University
MCTS-RAG: Combines Monte Carlo Tree Search with Retrieval-Augmented Generation to enhance small LMs’ reasoning on complex tasks.
FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
·370 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Speech and Audio
Speech Recognition
🏢 Stevens Institute of Technology
FINAUDIO: First benchmark for financial audio LLMs, enhancing financial audio analysis and investment decisions.
Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields
·4642 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 UCLA
Feature4X: 4D Agentic AI from Monocular Video w/ Gaussian Feature Fields
DINeMo: Learning Neural Mesh Models with no 3D Annotations
·1595 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Johns Hopkins University
DINeMo: Learns 3D models with no 3D annotations, leveraging pseudo-correspondence from visual foundation models for enhanced pose estimation.
BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation
·10790 words·51 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Tsinghua University
BIZGEN: Article-level Visual Text Rendering for Infographics Generation
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
·2885 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Central South University
LongTextAR advances long-text image generation via a novel tokenizer, enabling accurate, controllable, and high-fidelity text rendering in images.
ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems
·2349 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Applications
Autonomous Vehicles
🏢 Zhejiang University
ADS-Edit: Empowering autonomous driving with multimodal knowledge editing!
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking
·1979 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 A-M-Team
Boost LLM reasoning by having models ‘Think Twice’! This novel method iteratively refines answers, significantly enhancing accuracy on complex tasks.
Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals
·4505 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Stanford University
Opt-CWM: Self-supervised motion learning via counterfactual optimization, achieving state-of-the-art without labels!
Scaling Vision Pre-Training to 4K Resolution
·6421 words·31 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 UC Berkeley
PS3 scales CLIP vision pre-training to 4K resolution with near-constant cost, achieving state-of-the-art performance in multi-modal LLMs.
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
·3935 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Machine Learning
Deep Learning
🏢 National University of Singapore
LogQuant: 2-bit quantization for KV cache, superior accuracy!
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
·2895 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Shanghai AI Laboratory
MLLMs still struggle with spatial reasoning! LEGO-Puzzles benchmark reveals critical deficiencies, paving the way for AI advancement.
Inference-Time Scaling for Flow Models via Stochastic Generation and Rollover Budget Forcing
·2020 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 KAIST
Inference-time scaling for flow models enhances alignment with user preferences via stochastic generation and budget allocation.
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
·3412 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 ARC Lab, Tencent PCG
Visually perfect generations aren’t always optimal! GenHancer finds that subtly imperfect generations can greatly improve vision-centric tasks.
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
·3191 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Chinese Academy of Sciences
HAVEN: A new benchmark to tackle the hallucination issue in video understanding of large multimodal models!
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
·3847 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Applications
Robotics
🏢 Shanghai AI Lab
Dita: Scales a diffusion transformer for generalist robot policies, enabling 10-shot learning in complex, real-world tasks.
Attention IoU: Examining Biases in CelebA using Attention Maps
·3919 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Classification
🏢 Princeton University
Attention-IoU reveals model biases by analyzing attention maps, offering insights beyond dataset labels and improving debiasing techniques.
AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset
·2413 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Beihang University
AccVideo accelerates video diffusion by 8.5x with a synthetic dataset and trajectory-based distillation, maintaining quality and enabling higher resolution video generation.
Video-T1: Test-Time Scaling for Video Generation
·3231 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Tsinghua University
Video-T1 enhances video generation through test-time scaling, improving quality and consistency by viewing generation as a search for optimal video trajectories.
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
·4635 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 MBZUAI
Video SimpleQA: A New Benchmark for Factuality Evaluation in Large Video Language Models.