Recent
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
·5843 words·28 mins
AI Generated
π€ Daily Papers
Multimodal Learning
Multimodal Understanding
π’ CUHK MMLab
AV-Odyssey Bench reveals that current multimodal LLMs struggle with basic audio-visual understanding, prompting the development of a comprehensive benchmark for more effective evaluation.
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
·4800 words·23 mins
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ Peking University
Imperfect OCR hinders Retrieval-Augmented Generation (RAG). OHRBench, a new benchmark, reveals this cascading impact, showing current OCR solutions insufficient for high-quality RAG knowledge bases. …
OmniCreator: Self-Supervised Unified Generation with Universal Editing
·5399 words·26 mins
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Hong Kong University of Science and Technology
OmniCreator: Self-supervised unified image+video generation & universal editing.
Scaling Image Tokenizers with Grouped Spherical Quantization
·7140 words·34 mins
AI Generated
π€ Daily Papers
Computer Vision
Image Generation
π’ JΓΌlich Supercomputing Centre
GSQ-GAN, a novel image tokenizer, achieves superior reconstruction quality with 16x downsampling using grouped spherical quantization, enabling efficient scaling for high-fidelity image generation.
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation
·2511 words·12 mins
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ Hong Kong University of Science and Technology
VideoGen-of-Thought (VGoT) creates high-quality, multi-shot videos by collaboratively generating scripts, keyframes, and video clips, ensuring narrative consistency and visual coherence.
Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input
·2871 words·14 mins
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ Polytechnic of Turin
AIUTA minimizes user input in instance navigation by leveraging agent self-dialogue and dynamic interaction, achieving state-of-the-art performance.
Free Process Rewards without Process Labels
·3126 words·15 mins
AI Generated
π€ Daily Papers
Natural Language Processing
Large Language Models
π’ Tsinghua University
Train high-performing Process Reward Models (PRMs) cheaply using only outcome-level labels, eliminating the need for costly step-by-step annotations!
Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
·1734 words·9 mins
AI Generated
π€ Daily Papers
Computer Vision
Video Understanding
π’ 01.AI
Presto: a novel video diffusion model generates 15-second, high-quality videos with unparalleled long-range coherence and rich content, achieved through a segmented cross-attention mechanism and the L…
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
·3719 words·18 mins
AI Generated
π€ Daily Papers
Multimodal Learning
Vision-Language Models
π’ South China University of Technology
LSceneLLM boosts large 3D scene understanding by adaptively focusing on task-relevant visual details using LLMs’ visual preferences, surpassing existing methods on multiple benchmarks.