Skip to main content

Paper Reviews by AI

2024

Style-Friendly SNR Sampler for Style-Driven Generation
·4866 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Seoul National University
Style-friendly SNR sampler biases diffusion model training towards higher noise levels, enabling it to learn and generate images with higher style fidelity.
One to rule them all: natural language to bind communication, perception and action
·1627 words·8 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 University of Milan
AI-powered robots now understand and execute complex natural language commands, adapting seamlessly to dynamic environments thanks to a new architecture integrating LLMs, perception, and planning.
OminiControl: Minimal and Universal Control for Diffusion Transformer
·3446 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 National University of Singapore
OminiControl: A minimal, universal framework efficiently integrates image conditions into diffusion transformers, enabling diverse and precise control over image generation.
Morph: A Motion-free Physics Optimization Framework for Human Motion Generation
·2160 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tencent AI Lab
Morph: a novel motion-free physics optimization framework drastically enhances human motion generation’s physical plausibility using synthetic data, achieving state-of-the-art quality.
MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts
·4779 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Hong Kong Polytechnic University
MolReFlect achieves state-of-the-art molecule-text alignment by using a teacher-student LLM framework that generates fine-grained alignments, improving accuracy and explainability.
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
·2416 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Nanjing University
This survey paper offers a comprehensive overview of Multimodal Large Language Model (MLLM) evaluation, systematically categorizing benchmarks and methods, and identifying gaps for future research, th…
Material Anything: Generating Materials for Any 3D Object via Diffusion
·4056 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Northwestern Polytechnical University
Material Anything: Generate realistic materials for ANY 3D object via diffusion!
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
·3534 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NTU, Singapore
Large multimodal models’ inner workings are demystified using a novel framework that identifies, interprets, and even steers their internal features, opening the door to safer, more reliable AI.
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
·2991 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 KAIST
CoordTok: a novel video tokenizer drastically reduces token count for long videos, enabling memory-efficient training of diffusion models for high-quality, long video generation.
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
·2924 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers AI Applications Autonomous Vehicles 🏢 Institute of Artificial Intelligence, Huazhong University of Science and Technology
DiffusionDrive: a novel truncated diffusion model achieves real-time, high-quality end-to-end autonomous driving by leveraging multi-mode action distributions and significantly reducing computational …
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages
·2221 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Ajou University
UnifiedCrawl efficiently harvests massive monolingual datasets for low-resource languages from Common Crawl, enabling affordable LLM adaptation via QLoRA, significantly improving performance.
Stable Flow: Vital Layers for Training-Free Image Editing
·2773 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Snap Research
Stable Flow achieves diverse, consistent image editing without training by strategically injecting source image features into specific ‘vital’ layers of a diffusion transformer model. This training-f…
SegBook: A Simple Baseline and Cookbook for Volumetric Medical Image Segmentation
·2952 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Segmentation 🏢 Stanford University
SegBook: a large-scale benchmark, reveals that fine-tuning full-body CT pre-trained models significantly improves performance on various downstream medical image segmentation tasks, particularly for s…
Novel View Extrapolation with Video Diffusion Priors
·2381 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Nanyang Technological University
ViewExtrapolator leverages Stable Video Diffusion to realistically extrapolate novel views far beyond training data, dramatically improving the quality of 3D scene generation.
MyTimeMachine: Personalized Facial Age Transformation
·3186 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of North Carolina at Chapel Hill
MyTimeMachine personalizes facial age transformation using just 50 personal photos, outperforming existing methods by generating re-aged faces that closely match a person’s actual appearance at variou…
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
·2284 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Alibaba International Digital Commerce
Marco-01: a novel large reasoning model surpasses existing LLMs by using Chain-of-Thought, Monte Carlo Tree Search, and reflection mechanisms to excel in open-ended problem-solving, particularly in co…
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
·4302 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
MagicDriveDiT generates high-resolution, long street-view videos with precise control, exceeding limitations of previous methods in autonomous driving.
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
·2697 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent AI Lab
Insight-V: A multi-agent system enhances multi-modal LLMs’ visual reasoning by generating high-quality long-chain reasoning data and employing a two-stage training pipeline, achieving significant perf…
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
·4473 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Washington
GMAI-VL-5.5M & GMAI-VL: A new multimodal medical dataset and vision-language model achieve state-of-the-art results in various medical tasks.
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
·5261 words·25 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 ETH Zurich
LLMs’ hallucinations stem from entity recognition: SAEs reveal model ‘self-knowledge’, causally affecting whether it hallucinates or refuses to answer. This mechanism is even repurposed by chat finet…