Multimodal Learning
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
·4915 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
SigLIP 2: Multilingual Vision-Language Encoders with Semantic Understanding, Localization, and Dense Features.
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
·4251 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Pennsylvania
CoSyn: Code-guided synth data for scaling text-rich image understanding, achieving SOTA via targeted multimodal data generation!
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC
·2325 words·11 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Embodied AI
🏢 MAIS, Institute of Automation, Chinese Academy of Sciences, China
PC-Agent: A new hierarchical framework that significantly improves complex task automation on PCs by 32%!
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO
·402 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Menlo Research
AlphaMaze enhances LLMs’ spatial intelligence via GRPO, achieving 93% accuracy in maze navigation and showing emergent reasoning.
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm
·5226 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Sydney
RealSyn: A new, scalable multimodal dataset revolutionizes vision-language learning by effectively using interleaved image-text documents.
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
·2594 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
mmMamba: a novel framework creates linear-complexity multimodal models via distillation, drastically improving efficiency without sacrificing performance.
Magma: A Foundation Model for Multimodal AI Agents
·5533 words·26 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Embodied AI
🏢 Microsoft Research
Magma: a new foundation model for multimodal AI agents excels at bridging verbal and spatial intelligence, achieving state-of-the-art performance across various tasks, including UI navigation and robo…
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
·4398 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Tsinghua University
video-SALMONN-01: An open-source audio-visual LLM enhances video understanding with a novel reasoning-intensive dataset and the pDPO method, achieving significant accuracy gains.
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
·1563 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Reallm Labs
InfiR: Efficient, small AI models rival larger ones in reasoning, slashing costs and boosting privacy for wider AI use.
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
·2102 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
HermesFlow seamlessly bridges the understanding-generation gap in MLLMs using a novel Pair-DPO framework and self-play optimization on homologous data, achieving significant performance improvements.
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
·4310 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
HealthGPT: A novel medical vision-language model unifying comprehension and generation via heterogeneous knowledge adaptation, achieving superior performance on various medical tasks.
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
·2430 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Cambridge
ZeroBench: a new visual reasoning benchmark, proves impossible for current large multimodal models, pushing the boundaries of AI visual understanding.
Exploring the Potential of Encoder-free Architectures in 3D LMMs
·3414 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Northwestern Polytechnical University
Encoder-free 3D LMMs outperform state-of-the-art, achieving comparable results to significantly larger models.
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
·3464 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Hong Kong University of Science and Technology
ThinkDiff empowers text-to-image diffusion models with multimodal reasoning by aligning vision-language models to an LLM decoder, achieving state-of-the-art results on in-context reasoning benchmarks.
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
·5073 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 DLUT
EVEv2.0: A novel encoder-free vision-language model outperforms existing approaches by using a divide-and-conquer architecture and a data-efficient training strategy, achieving strong vision-reasoning…
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation
·3420 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Jiao Tong University
Show-o Turbo dramatically speeds up multimodal understanding and generation by leveraging parallel decoding and consistency distillation, achieving significant performance gains with fewer sampling st…
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
·5172 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 NVIDIA Research
QLIP: A new visual tokenizer unifying autoregressive multimodal understanding & generation with state-of-the-art reconstruction and zero-shot performance!
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
·2102 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
Ola: a novel 7B parameter omni-modal language model achieves state-of-the-art performance across image, video and audio tasks using a progressive modality alignment training strategy.
The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering
·4880 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Rutgers University
VISTA steers LVLMs away from hallucinations by cleverly adjusting token rankings during inference, improving visual grounding and semantic coherence.
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
·3250 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Singapore University of Technology and Design
GPT models’ multimodal reasoning abilities are tracked over time on challenging visual puzzles, revealing surprisingly steady improvement and cost trade-offs.