Multimodal Learning
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
·4310 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Peking University
HealthGPT: A novel medical vision-language model unifying comprehension and generation via heterogeneous knowledge adaptation, achieving superior performance on various medical tasks.
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models
·2430 words·12 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Cambridge
ZeroBench: a new visual reasoning benchmark, proves impossible for current large multimodal models, pushing the boundaries of AI visual understanding.
Exploring the Potential of Encoder-free Architectures in 3D LMMs
·3414 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Northwestern Polytechnical University
Encoder-free 3D LMMs outperform state-of-the-art, achieving comparable results to significantly larger models.
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
·3464 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Hong Kong University of Science and Technology
ThinkDiff empowers text-to-image diffusion models with multimodal reasoning by aligning vision-language models to an LLM decoder, achieving state-of-the-art results on in-context reasoning benchmarks.
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
·5073 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 DLUT
EVEv2.0: A novel encoder-free vision-language model outperforms existing approaches by using a divide-and-conquer architecture and a data-efficient training strategy, achieving strong vision-reasoning…
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation
·3420 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Jiao Tong University
Show-o Turbo dramatically speeds up multimodal understanding and generation by leveraging parallel decoding and consistency distillation, achieving significant performance gains with fewer sampling st…
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
·5172 words·25 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 NVIDIA Research
QLIP: A new visual tokenizer unifying autoregressive multimodal understanding & generation with state-of-the-art reconstruction and zero-shot performance!
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment
·2102 words·10 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
Ola: a novel 7B parameter omni-modal language model achieves state-of-the-art performance across image, video and audio tasks using a progressive modality alignment training strategy.
The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering
·4880 words·23 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Rutgers University
VISTA steers LVLMs away from hallucinations by cleverly adjusting token rankings during inference, improving visual grounding and semantic coherence.
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
·3250 words·16 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Multimodal Reasoning
🏢 Singapore University of Technology and Design
GPT models’ multimodal reasoning abilities are tracked over time on challenging visual puzzles, revealing surprisingly steady improvement and cost trade-offs.
Baichuan-Omni-1.5 Technical Report
·3756 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Baichuan Inc.
Baichuan-Omni-1.5: An open-source omni-modal LLM achieving SOTA performance across multiple modalities.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
·4124 words·20 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 DAMO Academy, Alibaba Group
VideoLLaMA3: Vision-centric training yields state-of-the-art image & video understanding!
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces
·4361 words·21 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
FILMAGENT: A multi-agent framework automates end-to-end virtual film production using LLMs, exceeding single-agent performance in a collaborative workflow.
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
·4964 words·24 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Human-AI Interaction
🏢 ByteDance Seed, Tsinghua University
UI-TARS, a novel native GUI agent, achieves state-of-the-art performance by solely using screenshots as input, eliminating the need for complex agent frameworks and expert-designed workflows.
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
·2690 words·13 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Shanghai Artificial Intelligence Laboratory
InternLM-XComposer2.5-Reward: A novel multi-modal reward model boosting Large Vision Language Model performance.
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
·3786 words·18 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
New multimodal safety test suite (MSTS) reveals vision-language models’ vulnerabilities and underscores the unique challenges of multimodal inputs.
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
·3561 words·17 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Hong Kong Polytechnic University
Multimodal LLMs can now evaluate art aesthetics with human-level accuracy using a novel dataset (MM-StyleBench) and prompt method (ArtCoT), significantly improving AI alignment in artistic evaluation.
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
·1663 words·8 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Cross-Modal Retrieval
🏢 Noah's Ark Lab, Huawei
MMDocIR, a new benchmark dataset, enables better evaluation of multi-modal document retrieval systems by providing page-level and layout-level annotations for diverse long documents and questions.
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
·4505 words·22 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
Parameter-Inverted Image Pyramid Networks (PIIP) drastically cut visual model computing costs without sacrificing accuracy by using smaller models for higher-resolution images and larger models for lo…
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
·22812 words·108 mins·
loading
·
loading
AI Generated
🤗 Daily Papers
Multimodal Learning
Vision-Language Models
🏢 University of Würzburg
Centurio: a 100-language LVLMs achieves state-of-the-art multilingual performance by strategically incorporating non-English data in training, proving that multilingualism doesn’t hinder English profi…