Paper Reviews by AI
2024
Style-Friendly SNR Sampler for Style-Driven Generation
·4866 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Seoul National University
Style-friendly SNR sampler biases diffusion model training towards higher noise levels, enabling it to learn and generate images with higher style fidelity.
One to rule them all: natural language to bind communication, perception and action
·1627 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 University of Milan
AI-powered robots now understand and execute complex natural language commands, adapting seamlessly to dynamic environments thanks to a new architecture integrating LLMs, perception, and planning.
OminiControl: Minimal and Universal Control for Diffusion Transformer
·3446 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 National University of Singapore
OminiControl: A minimal, universal framework efficiently integrates image conditions into diffusion transformers, enabling diverse and precise control over image generation.
Morph: A Motion-free Physics Optimization Framework for Human Motion Generation
·2160 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Tencent AI Lab
Morph: a novel motion-free physics optimization framework drastically enhances human motion generation’s physical plausibility using synthetic data, achieving state-of-the-art quality.
MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts
·4779 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Hong Kong Polytechnic University
MolReFlect achieves state-of-the-art molecule-text alignment by using a teacher-student LLM framework that generates fine-grained alignments, improving accuracy and explainability.
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
·2416 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Nanjing University
This survey paper offers a comprehensive overview of Multimodal Large Language Model (MLLM) evaluation, systematically categorizing benchmarks and methods, and identifying gaps for future research, th…
Material Anything: Generating Materials for Any 3D Object via Diffusion
·4056 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Northwestern Polytechnical University
Material Anything: Generate realistic materials for ANY 3D object via diffusion!
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
·3534 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 NTU, Singapore
Large multimodal models’ inner workings are demystified using a novel framework that identifies, interprets, and even steers their internal features, opening the door to safer, more reliable AI.
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
·2991 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 KAIST
CoordTok: a novel video tokenizer drastically reduces token count for long videos, enabling memory-efficient training of diffusion models for high-quality, long video generation.
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
·2924 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
AI Applications
Autonomous Vehicles
🏢 Institute of Artificial Intelligence, Huazhong University of Science and Technology
DiffusionDrive: a novel truncated diffusion model achieves real-time, high-quality end-to-end autonomous driving by leveraging multi-mode action distributions and significantly reducing computational …
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages
·2221 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Ajou University
UnifiedCrawl efficiently harvests massive monolingual datasets for low-resource languages from Common Crawl, enabling affordable LLM adaptation via QLoRA, significantly improving performance.
Stable Flow: Vital Layers for Training-Free Image Editing
·2773 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 Snap Research
Stable Flow achieves diverse, consistent image editing without training by strategically injecting source image features into specific ‘vital’ layers of a diffusion transformer model. This training-f…
SegBook: A Simple Baseline and Cookbook for Volumetric Medical Image Segmentation
·2952 words·14 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Segmentation
🏢 Stanford University
SegBook: a large-scale benchmark, reveals that fine-tuning full-body CT pre-trained models significantly improves performance on various downstream medical image segmentation tasks, particularly for s…
Novel View Extrapolation with Video Diffusion Priors
·2381 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
3D Vision
🏢 Nanyang Technological University
ViewExtrapolator leverages Stable Video Diffusion to realistically extrapolate novel views far beyond training data, dramatically improving the quality of 3D scene generation.
MyTimeMachine: Personalized Facial Age Transformation
·3186 words·15 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Image Generation
🏢 University of North Carolina at Chapel Hill
MyTimeMachine personalizes facial age transformation using just 50 personal photos, outperforming existing methods by generating re-aged faces that closely match a person’s actual appearance at variou…
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
·2284 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 Alibaba International Digital Commerce
Marco-01: a novel large reasoning model surpasses existing LLMs by using Chain-of-Thought, Monte Carlo Tree Search, and reflection mechanisms to excel in open-ended problem-solving, particularly in co…
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
·4302 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Computer Vision
Video Understanding
🏢 Hong Kong University of Science and Technology
MagicDriveDiT generates high-resolution, long street-view videos with precise control, exceeding limitations of previous methods in autonomous driving.
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
·2697 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tencent AI Lab
Insight-V: A multi-agent system enhances multi-modal LLMs’ visual reasoning by generating high-quality long-chain reasoning data and employing a two-stage training pipeline, achieving significant perf…
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
·4473 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Washington
GMAI-VL-5.5M & GMAI-VL: A new multimodal medical dataset and vision-language model achieve state-of-the-art results in various medical tasks.
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
·5261 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Natural Language Processing
Large Language Models
🏢 ETH Zurich
LLMs’ hallucinations stem from entity recognition: SAEs reveal model ‘self-knowledge’, causally affecting whether it hallucinates or refuses to answer. This mechanism is even repurposed by chat finet…