Skip to main content

Paper Reviews by AI

2024

Morph: A Motion-free Physics Optimization Framework for Human Motion Generation
·2160 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Tencent AI Lab
Morph: a novel motion-free physics optimization framework drastically enhances human motion generation’s physical plausibility using synthetic data, achieving state-of-the-art quality.
MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts
·4779 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Hong Kong Polytechnic University
MolReFlect achieves state-of-the-art molecule-text alignment by using a teacher-student LLM framework that generates fine-grained alignments, improving accuracy and explainability.
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
·2416 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Nanjing University
This survey paper offers a comprehensive overview of Multimodal Large Language Model (MLLM) evaluation, systematically categorizing benchmarks and methods, and identifying gaps for future research, th…
Material Anything: Generating Materials for Any 3D Object via Diffusion
·4056 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Northwestern Polytechnical University
Material Anything: Generate realistic materials for ANY 3D object via diffusion!
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
·3534 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NTU, Singapore
Large multimodal models’ inner workings are demystified using a novel framework that identifies, interprets, and even steers their internal features, opening the door to safer, more reliable AI.
Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction
·2991 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 KAIST
CoordTok: a novel video tokenizer drastically reduces token count for long videos, enabling memory-efficient training of diffusion models for high-quality, long video generation.
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
·2924 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers AI Applications Autonomous Vehicles 🏢 Institute of Artificial Intelligence, Huazhong University of Science and Technology
DiffusionDrive: a novel truncated diffusion model achieves real-time, high-quality end-to-end autonomous driving by leveraging multi-mode action distributions and significantly reducing computational …
UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages
·2221 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Ajou University
UnifiedCrawl efficiently harvests massive monolingual datasets for low-resource languages from Common Crawl, enabling affordable LLM adaptation via QLoRA, significantly improving performance.
Stable Flow: Vital Layers for Training-Free Image Editing
·2773 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Snap Research
Stable Flow achieves diverse, consistent image editing without training by strategically injecting source image features into specific ‘vital’ layers of a diffusion transformer model. This training-f…
SegBook: A Simple Baseline and Cookbook for Volumetric Medical Image Segmentation
·2952 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Segmentation 🏢 Stanford University
SegBook: a large-scale benchmark, reveals that fine-tuning full-body CT pre-trained models significantly improves performance on various downstream medical image segmentation tasks, particularly for s…
Novel View Extrapolation with Video Diffusion Priors
·2381 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Nanyang Technological University
ViewExtrapolator leverages Stable Video Diffusion to realistically extrapolate novel views far beyond training data, dramatically improving the quality of 3D scene generation.
MyTimeMachine: Personalized Facial Age Transformation
·3186 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of North Carolina at Chapel Hill
MyTimeMachine personalizes facial age transformation using just 50 personal photos, outperforming existing methods by generating re-aged faces that closely match a person’s actual appearance at variou…
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
·2284 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Alibaba International Digital Commerce
Marco-01: a novel large reasoning model surpasses existing LLMs by using Chain-of-Thought, Monte Carlo Tree Search, and reflection mechanisms to excel in open-ended problem-solving, particularly in co…
MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
·4302 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
MagicDriveDiT generates high-resolution, long street-view videos with precise control, exceeding limitations of previous methods in autonomous driving.
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
·2697 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent AI Lab
Insight-V: A multi-agent system enhances multi-modal LLMs’ visual reasoning by generating high-quality long-chain reasoning data and employing a two-stage training pipeline, achieving significant perf…
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
·4473 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Washington
GMAI-VL-5.5M & GMAI-VL: A new multimodal medical dataset and vision-language model achieve state-of-the-art results in various medical tasks.
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
·5261 words·25 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 ETH Zurich
LLMs’ hallucinations stem from entity recognition: SAEs reveal model ‘self-knowledge’, causally affecting whether it hallucinates or refuses to answer. This mechanism is even repurposed by chat finet…
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
·4011 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 National University of Singapore
AnchorAttention enhances long-context LLMs by mitigating BFloat16’s disruptive effects on RoPE, improving performance and speeding up training.
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
·3767 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong Baptist University
VideoAutoArena automates large multimodal model (LMM) evaluation using simulated users, offering a cost-effective and scalable solution compared to traditional human annotation.
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
·3966 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Nanyang Technological University
VBench++: A new benchmark suite meticulously evaluates video generative models across 16 diverse dimensions, aligning with human perception for improved model development and fairer comparisons.