Skip to main content

Paper Reviews by AI

2024

Personalized Multimodal Large Language Models: A Survey
·599 words·3 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of California, San Diego
This survey reveals the exciting advancements in personalized multimodal large language models (MLLMs), offering a novel taxonomy, highlighting key challenges and applications, ultimately pushing the …
OmniCreator: Self-Supervised Unified Generation with Universal Editing
·5399 words·26 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Hong Kong University of Science and Technology
OmniCreator: Self-supervised unified image+video generation & universal editing.
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
·4800 words·23 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Peking University
Imperfect OCR hinders Retrieval-Augmented Generation (RAG). OHRBench, a new benchmark, reveals this cascading impact, showing current OCR solutions insufficient for high-quality RAG knowledge bases. …
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
·5843 words·28 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Understanding 🏢 CUHK MMLab
AV-Odyssey Bench reveals that current multimodal LLMs struggle with basic audio-visual understanding, prompting the development of a comprehensive benchmark for more effective evaluation.
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
·3550 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
X-Prompt: a novel autoregressive vision-language model achieves universal in-context image generation by efficiently compressing contextual information and using a unified training framework for super…
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
·4300 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 NVIDIA
VLSI: Verbalized Layers-to-Interactions efficiently transfers knowledge from large to small VLMs using layer-wise natural language distillation, achieving significant performance gains without scaling…
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
·3776 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Department of Electrical and Computer Engineering, North South University
VideoLights: a novel framework for joint video highlight detection & moment retrieval, boosts performance via feature refinement, cross-modal & cross-task alignment, achieving state-of-the-art results…
Towards Universal Soccer Video Understanding
·2836 words·14 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai Jiao Tong University
Soccer video understanding gets a major boost with SoccerReplay-1988, the largest multi-modal dataset, and MatchVision, a new visual-language model achieving state-of-the-art performance on event clas…
Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning
·1712 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Text Classification 🏢 Telecom SudParis
Few-shot learning empowers cross-lingual audio abuse detection using pre-trained models, achieving high accuracy in low-resource Indian languages.
TinyFusion: Diffusion Transformers Learned Shallow
·4225 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 National University of Singapore
TinyFusion, a novel learnable depth pruning method, crafts efficient shallow diffusion transformers with superior post-fine-tuning performance, achieving a 2x speedup with less than 7% of the original…
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
·3884 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 Yandex Research
SWITTI: a novel scale-wise transformer achieves 7x faster text-to-image generation than state-of-the-art diffusion models, while maintaining competitive image quality.
Structured 3D Latents for Scalable and Versatile 3D Generation
·4249 words·20 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 Tsinghua University
Unified 3D latent representation (SLAT) enables versatile high-quality 3D asset generation, significantly outperforming existing methods.
PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos
·4589 words·22 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 Mohamed Bin Zayed University of Artificial Intelligence
PhysGame benchmark unveils video LLMs’ weaknesses in understanding physical commonsense from gameplay videos, prompting the creation of PhysVLM, a knowledge-enhanced model that outperforms existing mo…
One Shot, One Talk: Whole-body Talking Avatar from a Single Image
·2297 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision 3D Vision 🏢 University of Science and Technology of China
One-shot image to realistic, animatable talking avatar! Novel pipeline uses diffusion models and a hybrid 3DGS-mesh representation, achieving seamless generalization and precise control.
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
·5107 words·24 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 UC Los Angeles
OmniFlow: a novel generative model masters any-to-any multi-modal generation, outperforming existing models and offering flexible control!
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
·2333 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 SketchX, CVSSP, University of Surrey
NitroFusion achieves high-fidelity single-step image generation using a dynamic adversarial training approach with a specialized discriminator pool, dramatically improving speed and quality.
Negative Token Merging: Image-based Adversarial Feature Guidance
·2311 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Image Generation 🏢 University of Washington
NegToMe: Image-based adversarial guidance improves image generation diversity and reduces similarity to copyrighted content without training, simply by using images instead of negative text prompts.
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
·3719 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 South China University of Technology
LSceneLLM boosts large 3D scene understanding by adaptively focusing on task-relevant visual details using LLMs’ visual preferences, surpassing existing methods on multiple benchmarks.
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
·1734 words·9 mins· loading · loading
AI Generated 🤗 Daily Papers Computer Vision Video Understanding 🏢 01.AI
Presto: a novel video diffusion model generates 15-second, high-quality videos with unparalleled long-range coherence and rich content, achieved through a segmented cross-attention mechanism and the L…
Free Process Rewards without Process Labels
·3126 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Natural Language Processing Large Language Models 🏢 Tsinghua University
Train high-performing Process Reward Models (PRMs) cheaply using only outcome-level labels, eliminating the need for costly step-by-step annotations!