Skip to main content

Multimodal Learning

UI-TARS: Pioneering Automated GUI Interaction with Native Agents
·4964 words·24 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Human-AI Interaction 🏒 ByteDance Seed, Tsinghua University
UI-TARS, a novel native GUI agent, achieves state-of-the-art performance by solely using screenshots as input, eliminating the need for complex agent frameworks and expert-designed workflows.
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
·2690 words·13 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Shanghai Artificial Intelligence Laboratory
InternLM-XComposer2.5-Reward: A novel multi-modal reward model boosting Large Vision Language Model performance.
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
·3786 words·18 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Google DeepMind
New multimodal safety test suite (MSTS) reveals vision-language models’ vulnerabilities and underscores the unique challenges of multimodal inputs.
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
·3561 words·17 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Hong Kong Polytechnic University
Multimodal LLMs can now evaluate art aesthetics with human-level accuracy using a novel dataset (MM-StyleBench) and prompt method (ArtCoT), significantly improving AI alignment in artistic evaluation.
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
·1663 words·8 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Cross-Modal Retrieval 🏒 Noah's Ark Lab, Huawei
MMDocIR, a new benchmark dataset, enables better evaluation of multi-modal document retrieval systems by providing page-level and layout-level annotations for diverse long documents and questions.
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
·4505 words·22 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Tsinghua University
Parameter-Inverted Image Pyramid Networks (PIIP) drastically cut visual model computing costs without sacrificing accuracy by using smaller models for higher-resolution images and larger models for lo…
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
·22812 words·108 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 University of WΓΌrzburg
Centurio: a 100-language LVLMs achieves state-of-the-art multilingual performance by strategically incorporating non-English data in training, proving that multilingualism doesn’t hinder English profi…
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
·2599 words·13 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Zhejiang University
InfiGUIAgent, a novel multimodal GUI agent, leverages a two-stage training pipeline to achieve advanced reasoning and GUI interaction capabilities, outperforming existing models in benchmarks.
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
·4541 words·22 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Peking University
Sa2VA marries SAM2 and LLaVA for dense grounded image and video understanding, achieving state-of-the-art results on multiple benchmarks.
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
·5398 words·26 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Key Laboratory of Intelligent Information Processing
LLaVA-Mini achieves comparable performance to state-of-the-art LMMs using only one vision token, drastically reducing computational cost and latency.
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
·2565 words·13 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Chinese University of Hong Kong
Dispider: A novel system enabling real-time interaction with video LLMs via disentangled perception, decision, and reaction modules for efficient, accurate responses to streaming video.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
·2577 words·13 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Tencent Youtu Lab
VITA-1.5 achieves near real-time vision and speech interaction by using a novel three-stage training method that progressively integrates speech data into an LLM, enabling fluent conversations.
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
·2983 words·15 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Multimodal Reasoning 🏒 Gaoling School of Artificial Intelligence, Renmin University of China
Virgo: A new multimodal slow-thinking system, significantly improves MLLM reasoning by fine-tuning with text-based long-form thought data, demonstrating comparable performance to commercial systems.
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
·4036 words·19 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 College of Computer Science and Technology, Zhejiang University
New multimodal textbook dataset boosts Vision-Language Model (VLM) performance!
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
·3571 words·17 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 DAMO Academy, Alibaba Group
VideoRefer Suite boosts video LLM understanding by introducing a large-scale, high-quality object-level video instruction dataset, a versatile spatial-temporal object encoder model, and a comprehensiv…
On the Compositional Generalization of Multimodal LLMs for Medical Imaging
·5637 words·27 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Chinese University of Hong Kong, Shenzhen
Multimodal LLMs for medical imaging now generalize better via compositional generalization, leveraging relationships between image features (modality, anatomy, task) to understand unseen images and im…
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
·3641 words·18 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 University of Oxford
OS-Genesis: Reverse task synthesis revolutionizes GUI agent training by generating high-quality trajectory data without human supervision, drastically boosting performance on challenging benchmarks.
From Elements to Design: A Layered Approach for Automatic Graphic Design Composition
·3329 words·16 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Xi'an Jiaotong University
LaDeCo: a layered approach to automatic graphic design composition, generating high-quality designs by sequentially composing elements into semantic layers.
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
·3509 words·17 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 Shanghai AI Laboratory
Task Preference Optimization (TPO) significantly boosts multimodal large language models’ visual understanding by aligning them with fine-grained visual tasks via learnable task tokens, achieving 14.6…
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
·2929 words·14 mins· loading · loading
AI Generated πŸ€— Daily Papers Multimodal Learning Vision-Language Models 🏒 University of Toronto
MMFactory: A universal framework for vision-language tasks, offering diverse programmatic solutions based on user needs and constraints, outperforming existing methods.