Paper Reviews by AI
2025
XAttention: Block Sparse Attention with Antidiagonal Scoring
·2960 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Tsinghua University
XAttention: Antidiagonal scoring unlocks block-sparse attention, slashing compute costs in long-context Transformers without sacrificing accuracy.
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation
·2005 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Visual Question Answering
🏢 AIRI
Efficient image representation via adaptive token reduction.
VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling
·1204 words·6 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 EverEx
VideoRFSplat: Direct text-to-3D Gaussian Splatting with flexible pose and multi-view joint modeling, bypassing SDS refinement!
Unleashing Vecset Diffusion Model for Fast Shape Generation
·3881 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 MMLab, CUHK
FlashVDM enables fast 3D shape generation by accelerating both VAE decoding and diffusion sampling.
Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens
·3099 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 DP Technology
Uni-3DAR: Autoregressive framework unifies 3D generation/understanding, compressing spatial tokens for faster, versatile AI.
Ultra-Resolution Adaptation with Ease
·2457 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 National University of Singapore
URA: Ultra-resolution adaptation made easy! Uses synthetic data & minor weight tuning for efficient, high-res text-to-image diffusion models.
Typed-RAG: Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering
·1842 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Question Answering
🏢 Pohang University of Science and Technology
Typed-RAG enhances non-factoid QA by type-aware decomposition, refining retrieval and generation for nuanced, user-aligned answers.
Tokenize Image as a Set
·3037 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of Science and Technology of China
TokenSet: Tokenizing images as unordered sets for dynamic capacity allocation and robust generation, breaking from fixed-position latent codes.
Survey on Evaluation of LLM-based Agents
·396 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Hebrew University of Jerusalem
A comprehensive survey on evaluation methodologies for LLM-based agents, analyzing benchmarks and frameworks across key dimensions like capabilities, applications, and generalist performance.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
·3774 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Rice University
LLMs survey: Model, output, and prompt-based strategies for efficient reasoning, mitigating ‘overthinking’ for faster, cheaper, and real-world applications.
Sonata: Self-Supervised Learning of Reliable Point Representations
·2429 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 University of Hong Kong
Sonata: Reliable 3D point cloud self-supervised learning through self-distillation, achieving SOTA with less data.
Scale-wise Distillation of Diffusion Models
·3863 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Yandex Research
SWD: Scale-wise distillation of diffusion models achieves faster image generation by upscaling resolution during denoising, outperforming counterparts with similar computation.
SALT: Singular Value Adaptation with Low-Rank Transformation
·1957 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Segmentation
🏢 Mohamed Bin Zayed University of Artificial Intelligence
SALT: Fine-tuning SAM for medical images using Singular Value Adaptation with Low-Rank Transformation for efficient, robust segmentation.
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
·1719 words·9 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Machine Learning
Reinforcement Learning
🏢 VNU University of Science, Vietnam
RL fine-tuning enhances reasoning in small LLMs, achieving competitive performance with limited resources, despite optimization & length challenges.
Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models
·3300 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Westlake University
VidKV: Achieves 1.5x-bit KV cache quantization for VideoLLMs, maintaining performance without retraining.
NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes
·4268 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Simon Fraser University
NuiScene: Enables efficient & unbounded outdoor scene generation by encoding scene chunks as uniform vector sets and outpainting.
MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion
·2769 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Renmin University of China
MathFusion: Instruction Fusion enhances LLM’s math problem-solving!
MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
·4169 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Fudan University
MagicMotion: A controllable video generation framework enabling precise object motion control through dense-to-sparse trajectory guidance.
M3: 3D-Spatial MultiModal Memory
·2710 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 UC San Diego
M3: Gaussian-integrated memory system for multimodal 3D scene understanding with foundation models.
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
·2805 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
ActVLP: Enhancing VLMs through visual-linguistic guidance for superior action-based decision-making in interactive environments.