Skip to main content

Paper Reviews by AI

2025

An Empirical Study of Autoregressive Pre-training from Videos
·5733 words·27 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Computer Vision Video Understanding ๐Ÿข UC Berkeley
Toto, a new autoregressive video model, achieves competitive performance across various benchmarks by pre-training on over 1 trillion visual tokens, demonstrating the effectiveness of scaling video mo…
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
·5517 words·26 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Natural Language Processing Large Language Models ๐Ÿข Tsinghua University
URSA-7B: A new multimodal model significantly improves chain-of-thought reasoning in mathematics!
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
·2783 words·14 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Computer Vision 3D Vision ๐Ÿข Stability AI
SPAR3D: Fast, accurate single-image 3D reconstruction via a novel two-stage approach using point clouds for high-fidelity mesh generation.
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
·3910 words·19 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Natural Language Processing Large Language Models ๐Ÿข Microsoft Research
Small language models can master complex math reasoning using self-evolved deep thinking via Monte Carlo Tree Search, surpassing larger models in performance.
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis
·285 words·2 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Computer Vision Image Generation ๐Ÿข Tsinghua University
This paper unveils critical thresholds for efficient visual autoregressive model computation, proving sub-quartic time is impossible beyond a certain input matrix norm while establishing efficient app…
LLM4SR: A Survey on Large Language Models for Scientific Research
·2870 words·14 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Natural Language Processing Large Language Models ๐Ÿข University of Texas at Dallas
LLMs revolutionize scientific research! This survey reveals their transformative potential across hypothesis discovery, experiment planning, writing, and peer review, guiding future research.
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
·2599 words·13 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Zhejiang University
InfiGUIAgent, a novel multimodal GUI agent, leverages a two-stage training pipeline to achieve advanced reasoning and GUI interaction capabilities, outperforming existing models in benchmarks.
EpiCoder: Encompassing Diversity and Complexity in Code Generation
·5051 words·24 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Natural Language Processing Large Language Models ๐Ÿข Tsinghua University
EpiCoder revolutionizes code generation by using feature trees to create diverse and complex training data, resulting in state-of-the-art performance on various benchmarks.
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models
·3036 words·15 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Natural Language Processing Named Entity Recognition ๐Ÿข BoฤŸaziรงi University
First-ever resources (NER dataset, dependency treebank, and corpus) and models for historical Turkish NLP are introduced, significantly advancing research capabilities in this underexplored field.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
·4541 words·22 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Peking University
Sa2VA marries SAM2 and LLaVA for dense grounded image and video understanding, achieving state-of-the-art results on multiple benchmarks.
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
·3721 words·18 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Natural Language Processing Text Generation ๐Ÿข Chinese Academy of Sciences
PPTAgent, a novel two-stage framework, significantly improves automatic presentation generation by leveraging an edit-based workflow and a new evaluation metric, outperforming existing end-to-end meth…
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
·3325 words·16 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Computer Vision 3D Vision ๐Ÿข Electronics and Telecommunications Research Institute
MoDec-GS: a novel framework achieving 70% model size reduction in dynamic 3D Gaussian splatting while improving visual quality by cleverly decomposing complex motions and optimizing temporal intervals…
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
·5398 words·26 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Multimodal Learning Vision-Language Models ๐Ÿข Key Laboratory of Intelligent Information Processing
LLaVA-Mini achieves comparable performance to state-of-the-art LMMs using only one vision token, drastically reducing computational cost and latency.
Entropy-Guided Attention for Private LLMs
·5203 words·25 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Natural Language Processing Large Language Models ๐Ÿข New York University
Boosting private LLMs’ efficiency and security, this research introduces an entropy-guided attention mechanism and PI-friendly layer normalization to mitigate the overheads of nonlinear operations.
Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback
·3489 words·17 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Computer Vision 3D Vision ๐Ÿข Fudan University
DOLPHIN: AI automates scientific research from idea generation to experimental validation.
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
·3018 words·15 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Computer Vision Video Understanding ๐Ÿข Hong Kong University of Science and Technology
Diffusion as Shader (DaS) achieves versatile video control by using 3D tracking videos as control signals in a unified video diffusion model, enabling precise manipulation across diverse tasks.
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation
·5463 words·26 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Computer Vision 3D Vision ๐Ÿข University of Cambridge
Chirpy3D: Generating creative, high-quality 3D birds with intricate details by learning a continuous part latent space from 2D images.
TransPixar: Advancing Text-to-Video Generation with Transparency
·2458 words·12 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Computer Vision Video Understanding ๐Ÿข Hong Kong University of Science and Technology
TransPixar generates high-quality videos with transparency by jointly training RGB and alpha channels, outperforming sequential generation methods.
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
·3304 words·16 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Computer Vision Image Generation ๐Ÿข Meta
Through-The-Mask uses mask-based motion trajectories to generate realistic videos from images and text, overcoming limitations of existing methods in handling complex multi-object motion.
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
·3762 words·18 mins· loading · loading
AI Generated ๐Ÿค— Daily Papers Computer Vision Video Understanding ๐Ÿข Nanjing University
STAR: A novel approach uses text-to-video models for realistic, temporally consistent real-world video super-resolution, improving image quality and detail.