Paper Reviews by AI
2025
Investigating Human-Aligned Large Language Model Uncertainty
·1326 words·7 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Vanderbilt University
This research explores how well LLM uncertainty measures align with human uncertainty, finding Bayesian and top-k entropy measures show promise.
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
·4997 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 National Key Laboratory for Novel Software Technology, Nanjing University
CapArena: Detailed image caption benchmark in the LLM era, revealing metric biases and advancing automated evaluation.
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills
·4598 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Embodied AI
🏢 Peking University
Being-0: A humanoid robot agent achieves complex tasks by integrating a vision-language model with modular skills, enhancing efficiency and real-time performance.
Basic Category Usage in Vision Language Models
·1339 words·7 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tennessee Tech University
VLMs exhibit human-like object categorization, favoring basic levels and mirroring biological/expertise nuances, suggesting learned cognitive behaviors.
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
·3299 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 UCLA
Reflect-DiT: Scaling Text-to-Image Diffusion Transformers via In-Context Reflection!
Hyperbolic Safety-Aware Vision-Language Models
·3785 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Modena and Reggio Emilia, Italy
HySAC: A hyperbolic framework for safety-aware vision-language models, improving content moderation and interpretability.
VGGT: Visual Geometry Grounded Transformer
·3346 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 University of Oxford
VGGT: a fast, end-to-end transformer that infers complete 3D scene attributes from multiple views, outperforming optimization-based methods.
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
·222 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Queen Mary University of London
V-STaR: A new benchmark to evaluate Video-LLMs in video spatio-temporal reasoning, revealing gaps in current models’ understanding.
Towards a Unified Copernicus Foundation Model for Earth Vision
·4400 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Scene Understanding
🏢 Technical University of Munich
Unified Copernicus Foundation Model for Earth Vision: A multimodal approach to improve scalability, versatility, and adaptability of EO models.
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
·2617 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Zhejiang University
ReCamMaster: Re-shoots videos via generative rendering, controlling camera movement from a single source, for novel perspectives and enhanced video creation.
MTV-Inpaint: Multi-Task Long Video Inpainting
·3551 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 City University of Hong Kong
MTV-Inpaint: A unified framework for multi-task long video inpainting, enabling versatile object insertion, scene completion, editing, and removal.
Implicit Bias-Like Patterns in Reasoning Models
·3234 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Theory
Fairness
🏢 Washington University in St. Louis
AI reasoning models reveal bias-like patterns, processing association-incompatible info with more computational effort, mirroring human implicit biases.
GKG-LLM: A Unified Framework for Generalized Knowledge Graph Construction
·2910 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Information Extraction
🏢 Xi'an Jiaotong University
GKG-LLM: Unifying Knowledge Graph Construction with a novel 3-stage framework, empowering domain adaptation & resource efficiency.
API Agents vs. GUI Agents: Divergence and Convergence
·2038 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Applications
Robotics
🏢 Microsoft
API vs. GUI Agents: Understanding the divergence and convergence in LLM-based automation.
Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning
·2655 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Applications
Robotics
🏢 Shanghai Jiao Tong University
ADC: Human-robot collaboration revolutionizes data collection, slashing data needs and boosting robot learning!
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning
·3847 words·19 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Embodied AI
🏢 Fudan University
D2PO: World modeling enhances embodied task planning by jointly optimizing state prediction and action selection, leading to more efficient execution.
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
·2529 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Waterloo
VisualWebInstruct: Scales up multimodal instruction data via web search, enhancing VLMs’ reasoning for complex tasks.
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
·2233 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Embodied AI
🏢 Tsinghua University
UniGoal: A novel framework for universal zero-shot goal-oriented navigation, outperforming task-specific methods with a unified approach.
Transformers without Normalization
·4050 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Machine Learning
Deep Learning
🏢 FAIR, Meta
Transformers can achieve state-of-the-art performance without normalization layers via Dynamic Tanh (DyT), offering a simpler and more efficient alternative.
New Trends for Modern Machine Translation with Large Reasoning Models
·518 words·3 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Machine Translation
🏢 University of Edinburgh
LRMs transform MT with reasoning, handling context, culture, and nuance for better translations.