Paper Reviews by AI
2025
An Empirical Study of Autoregressive Pre-training from Videos
·5733 words·27 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Computer Vision
Video Understanding
๐ข UC Berkeley
Toto, a new autoregressive video model, achieves competitive performance across various benchmarks by pre-training on over 1 trillion visual tokens, demonstrating the effectiveness of scaling video mo…
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
·5517 words·26 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Natural Language Processing
Large Language Models
๐ข Tsinghua University
URSA-7B: A new multimodal model significantly improves chain-of-thought reasoning in mathematics!
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images
·2783 words·14 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Computer Vision
3D Vision
๐ข Stability AI
SPAR3D: Fast, accurate single-image 3D reconstruction via a novel two-stage approach using point clouds for high-fidelity mesh generation.
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
·3910 words·19 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Natural Language Processing
Large Language Models
๐ข Microsoft Research
Small language models can master complex math reasoning using self-evolved deep thinking via Monte Carlo Tree Search, surpassing larger models in performance.
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis
·285 words·2 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Computer Vision
Image Generation
๐ข Tsinghua University
This paper unveils critical thresholds for efficient visual autoregressive model computation, proving sub-quartic time is impossible beyond a certain input matrix norm while establishing efficient app…
LLM4SR: A Survey on Large Language Models for Scientific Research
·2870 words·14 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Natural Language Processing
Large Language Models
๐ข University of Texas at Dallas
LLMs revolutionize scientific research! This survey reveals their transformative potential across hypothesis discovery, experiment planning, writing, and peer review, guiding future research.
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
·2599 words·13 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Zhejiang University
InfiGUIAgent, a novel multimodal GUI agent, leverages a two-stage training pipeline to achieve advanced reasoning and GUI interaction capabilities, outperforming existing models in benchmarks.
EpiCoder: Encompassing Diversity and Complexity in Code Generation
·5051 words·24 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Natural Language Processing
Large Language Models
๐ข Tsinghua University
EpiCoder revolutionizes code generation by using feature trees to create diverse and complex training data, resulting in state-of-the-art performance on various benchmarks.
Building Foundations for Natural Language Processing of Historical Turkish: Resources and Models
·3036 words·15 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Natural Language Processing
Named Entity Recognition
๐ข Boฤaziรงi University
First-ever resources (NER dataset, dependency treebank, and corpus) and models for historical Turkish NLP are introduced, significantly advancing research capabilities in this underexplored field.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
·4541 words·22 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Peking University
Sa2VA marries SAM2 and LLaVA for dense grounded image and video understanding, achieving state-of-the-art results on multiple benchmarks.
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
·3721 words·18 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Natural Language Processing
Text Generation
๐ข Chinese Academy of Sciences
PPTAgent, a novel two-stage framework, significantly improves automatic presentation generation by leveraging an edit-based workflow and a new evaluation metric, outperforming existing end-to-end meth…
MoDec-GS: Global-to-Local Motion Decomposition and Temporal Interval Adjustment for Compact Dynamic 3D Gaussian Splatting
·3325 words·16 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Computer Vision
3D Vision
๐ข Electronics and Telecommunications Research Institute
MoDec-GS: a novel framework achieving 70% model size reduction in dynamic 3D Gaussian splatting while improving visual quality by cleverly decomposing complex motions and optimizing temporal intervals…
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
·5398 words·26 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Multimodal Learning
Vision-Language Models
๐ข Key Laboratory of Intelligent Information Processing
LLaVA-Mini achieves comparable performance to state-of-the-art LMMs using only one vision token, drastically reducing computational cost and latency.
Entropy-Guided Attention for Private LLMs
·5203 words·25 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Natural Language Processing
Large Language Models
๐ข New York University
Boosting private LLMs’ efficiency and security, this research introduces an entropy-guided attention mechanism and PI-friendly layer normalization to mitigate the overheads of nonlinear operations.
Dolphin: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback
·3489 words·17 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Computer Vision
3D Vision
๐ข Fudan University
DOLPHIN: AI automates scientific research from idea generation to experimental validation.
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
·3018 words·15 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Computer Vision
Video Understanding
๐ข Hong Kong University of Science and Technology
Diffusion as Shader (DaS) achieves versatile video control by using 3D tracking videos as control signals in a unified video diffusion model, enabling precise manipulation across diverse tasks.
Chirpy3D: Continuous Part Latents for Creative 3D Bird Generation
·5463 words·26 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Computer Vision
3D Vision
๐ข University of Cambridge
Chirpy3D: Generating creative, high-quality 3D birds with intricate details by learning a continuous part latent space from 2D images.
TransPixar: Advancing Text-to-Video Generation with Transparency
·2458 words·12 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Computer Vision
Video Understanding
๐ข Hong Kong University of Science and Technology
TransPixar generates high-quality videos with transparency by jointly training RGB and alpha channels, outperforming sequential generation methods.
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation
·3304 words·16 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Computer Vision
Image Generation
๐ข Meta
Through-The-Mask uses mask-based motion trajectories to generate realistic videos from images and text, overcoming limitations of existing methods in handling complex multi-object motion.
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
·3762 words·18 mins·
loading
·
loading
AI Generated
๐ค Daily Papers
Computer Vision
Video Understanding
๐ข Nanjing University
STAR: A novel approach uses text-to-video models for realistic, temporally consistent real-world video super-resolution, improving image quality and detail.