Paper Reviews by AI

XAttention: Block Sparse Attention with Antidiagonal Scoring

20 March 2025·2960 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University

XAttention: Antidiagonal scoring unlocks block-sparse attention, slashing compute costs in long-context Transformers without sacrificing accuracy.

When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

20 March 2025·2005 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Visual Question Answering 🏢 AIRI

Efficient image representation via adaptive token reduction.

VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

20 March 2025·1204 words·6 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 EverEx

VideoRFSplat: Direct text-to-3D Gaussian Splatting with flexible pose and multi-view joint modeling, bypassing SDS refinement!

Unleashing Vecset Diffusion Model for Fast Shape Generation

20 March 2025·3881 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 MMLab, CUHK

FlashVDM enables fast 3D shape generation by accelerating both VAE decoding and diffusion sampling.

Uni-3DAR: Unified 3D Generation and Understanding via Autoregression on Compressed Spatial Tokens

20 March 2025·3099 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 DP Technology

Uni-3DAR: Autoregressive framework unifies 3D generation/understanding, compressing spatial tokens for faster, versatile AI.

Ultra-Resolution Adaptation with Ease

20 March 2025·2457 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 National University of Singapore

URA: Ultra-resolution adaptation made easy! Uses synthetic data & minor weight tuning for efficient, high-res text-to-image diffusion models.

Typed-RAG: Type-aware Multi-Aspect Decomposition for Non-Factoid Question Answering

20 March 2025·1842 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Question Answering 🏢 Pohang University of Science and Technology

Typed-RAG enhances non-factoid QA by type-aware decomposition, refining retrieval and generation for nuanced, user-aligned answers.

Tokenize Image as a Set

20 March 2025·3037 words·15 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Science and Technology of China

TokenSet: Tokenizing images as unordered sets for dynamic capacity allocation and robust generation, breaking from fixed-position latent codes.

Survey on Evaluation of LLM-based Agents

20 March 2025·396 words·2 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Hebrew University of Jerusalem

A comprehensive survey on evaluation methodologies for LLM-based agents, analyzing benchmarks and frameworks across key dimensions like capabilities, applications, and generalist performance.

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

20 March 2025·3774 words·18 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Rice University

LLMs survey: Model, output, and prompt-based strategies for efficient reasoning, mitigating ‘overthinking’ for faster, cheaper, and real-world applications.

Sonata: Self-Supervised Learning of Reliable Point Representations

20 March 2025·2429 words·12 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Hong Kong

Sonata: Reliable 3D point cloud self-supervised learning through self-distillation, achieving SOTA with less data.

Scale-wise Distillation of Diffusion Models

20 March 2025·3863 words·19 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Yandex Research

SWD: Scale-wise distillation of diffusion models achieves faster image generation by upscaling resolution during denoising, outperforming counterparts with similar computation.

SALT: Singular Value Adaptation with Low-Rank Transformation

20 March 2025·1957 words·10 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Segmentation 🏢 Mohamed Bin Zayed University of Artificial Intelligence

SALT: Fine-tuning SAM for medical images using Singular Value Adaptation with Low-Rank Transformation for efficient, robust segmentation.

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

20 March 2025·1719 words·9 mins· loading · loading

AI Generated 🤗 Daily Papers Machine Learning Reinforcement Learning 🏢 VNU University of Science, Vietnam

RL fine-tuning enhances reasoning in small LLMs, achieving competitive performance with limited resources, despite optimization & length challenges.

Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models

20 March 2025·3300 words·16 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Westlake University

VidKV: Achieves 1.5x-bit KV cache quantization for VideoLLMs, maintaining performance without retraining.

NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

20 March 2025·4268 words·21 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Simon Fraser University

NuiScene: Enables efficient & unbounded outdoor scene generation by encoding scene chunks as uniform vector sets and outpainting.

MathFusion: Enhancing Mathematic Problem-solving of LLM through Instruction Fusion

20 March 2025·2769 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Renmin University of China

MathFusion: Instruction Fusion enhances LLM’s math problem-solving!

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance

20 March 2025·4169 words·20 mins· loading · loading

AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Fudan University

MagicMotion: A controllable video generation framework enabling precise object motion control through dense-to-sparse trajectory guidance.

M3: 3D-Spatial MultiModal Memory

20 March 2025·2710 words·13 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 UC San Diego

M3: Gaussian-integrated memory system for multimodal 3D scene understanding with foundation models.

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

20 March 2025·2805 words·14 mins· loading · loading

AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University

ActVLP: Enhancing VLMs through visual-linguistic guidance for superior action-based decision-making in interactive environments.