Vision-Language Models
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
·4915 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
SigLIP 2: Multilingual Vision-Language Encoders with Semantic Understanding, Localization, and Dense Features.
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
·4251 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Pennsylvania
CoSyn: Code-guided synth data for scaling text-rich image understanding, achieving SOTA via targeted multimodal data generation!
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO
·402 words·2 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Menlo Research
AlphaMaze enhances LLMs’ spatial intelligence via GRPO, achieving 93% accuracy in maze navigation and showing emergent reasoning.
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm
·5226 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Sydney
RealSyn: A new, scalable multimodal dataset revolutionizes vision-language learning by effectively using interleaved image-text documents.
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
·2594 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
mmMamba: a novel framework creates linear-complexity multimodal models via distillation, drastically improving efficiency without sacrificing performance.
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
·2102 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
HermesFlow seamlessly bridges the understanding-generation gap in MLLMs using a novel Pair-DPO framework and self-play optimization on homologous data, achieving significant performance improvements.
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
·4310 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
HealthGPT: A novel medical vision-language model unifying comprehension and generation via heterogeneous knowledge adaptation, achieving superior performance on various medical tasks.
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
·2430 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Cambridge
ZeroBench: a new visual reasoning benchmark, proves impossible for current large multimodal models, pushing the boundaries of AI visual understanding.
Exploring the Potential of Encoder-free Architectures in 3D LMMs
·3414 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Northwestern Polytechnical University
Encoder-free 3D LMMs outperform state-of-the-art, achieving comparable results to significantly larger models.
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
·5073 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 DLUT
EVEv2.0: A novel encoder-free vision-language model outperforms existing approaches by using a divide-and-conquer architecture and a data-efficient training strategy, achieving strong vision-reasoning…
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation
·3420 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Jiao Tong University
Show-o Turbo dramatically speeds up multimodal understanding and generation by leveraging parallel decoding and consistency distillation, achieving significant performance gains with fewer sampling st…
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
·5172 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 NVIDIA Research
QLIP: A new visual tokenizer unifying autoregressive multimodal understanding & generation with state-of-the-art reconstruction and zero-shot performance!
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
·2102 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
Ola: a novel 7B parameter omni-modal language model achieves state-of-the-art performance across image, video and audio tasks using a progressive modality alignment training strategy.
The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering
·4880 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Rutgers University
VISTA steers LVLMs away from hallucinations by cleverly adjusting token rankings during inference, improving visual grounding and semantic coherence.
Baichuan-Omni-1.5 Technical Report
·3756 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Baichuan Inc.
Baichuan-Omni-1.5: An open-source omni-modal LLM achieving SOTA performance across multiple modalities.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
·4124 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 DAMO Academy, Alibaba Group
VideoLLaMA3: Vision-centric training yields state-of-the-art image & video understanding!
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
·4361 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
FILMAGENT: A multi-agent framework automates end-to-end virtual film production using LLMs, exceeding single-agent performance in a collaborative workflow.
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
·2690 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Artificial Intelligence Laboratory
InternLM-XComposer2.5-Reward: A novel multi-modal reward model boosting Large Vision Language Model performance.
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
·3786 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
New multimodal safety test suite (MSTS) reveals vision-language models’ vulnerabilities and underscores the unique challenges of multimodal inputs.
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
·3561 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong Polytechnic University
Multimodal LLMs can now evaluate art aesthetics with human-level accuracy using a novel dataset (MM-StyleBench) and prompt method (ArtCoT), significantly improving AI alignment in artistic evaluation.