Skip to main content

2025-03-27s

2025

ViLBench: A Suite for Vision-Language Process Reward Modeling
·373 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 UC Santa Cruz
VILBENCH: Vision-Language Process Reward Modeling Suite
Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models
·393 words·2 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 KAIST
Fixing fine-tuned diffusion models! By using richer, unconditional priors, they generate better images and videos.
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
·1746 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Question Answering 🏢 University of Washington
Open Deep Search (ODS): Democratizing Search with Open-source Reasoning Agents.
MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search
·2082 words·10 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Question Answering 🏢 Yale University
MCTS-RAG: Combines Monte Carlo Tree Search with Retrieval-Augmented Generation to enhance small LMs’ reasoning on complex tasks.
DINeMo: Learning Neural Mesh Models with no 3D Annotations
·1595 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Johns Hopkins University
DINeMo: Learns 3D models with no 3D annotations, leveraging pseudo-correspondence from visual foundation models for enhanced pose estimation.
BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation
·10790 words·51 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tsinghua University
BIZGEN: Article-level Visual Text Rendering for Infographics Generation
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
·2885 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Central South University
LongTextAR advances long-text image generation via a novel tokenizer, enabling accurate, controllable, and high-fidelity text rendering in images.
ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems
·2349 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers AI Applications Autonomous Vehicles 🏢 Zhejiang University
ADS-Edit: Empowering autonomous driving with multimodal knowledge editing!
Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals
·4505 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Stanford University
Opt-CWM: Self-supervised motion learning via counterfactual optimization, achieving state-of-the-art without labels!
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
·3935 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Machine Learning Deep Learning 🏢 National University of Singapore
LogQuant: 2-bit quantization for KV cache, superior accuracy!
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
·2895 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Reasoning 🏢 Shanghai AI Laboratory
MLLMs still struggle with spatial reasoning! LEGO-Puzzles benchmark reveals critical deficiencies, paving the way for AI advancement.
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
·3412 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 ARC Lab, Tencent PCG
Visually perfect generations aren’t always optimal! GenHancer finds that subtly imperfect generations can greatly improve vision-centric tasks.
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
·3847 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers AI Applications Robotics 🏢 Shanghai AI Lab
Dita: Scales a diffusion transformer for generalist robot policies, enabling 10-shot learning in complex, real-world tasks.
Attention IoU: Examining Biases in CelebA using Attention Maps
·3919 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Classification 🏢 Princeton University
Attention-IoU reveals model biases by analyzing attention maps, offering insights beyond dataset labels and improving debiasing techniques.
AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset
·2413 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Beihang University
AccVideo accelerates video diffusion by 8.5x with a synthetic dataset and trajectory-based distillation, maintaining quality and enabling higher resolution video generation.
PathoHR: Breast Cancer Survival Prediction on High-Resolution Pathological Images
·1466 words·7 mins· loading · loading
AI Generated 🤗 Daily Papers AI Applications Healthcare 🏢 XJLTU
PathoHR: Boost breast cancer survival prediction with high-resolution pathology images!
Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image
·2762 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Oxford
Motion blur, usually a problem, is now a solution! This paper estimates camera motion from motion-blurred images, acting like an IMU.