Skip to main content

Multimodal Learning

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
·2697 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tencent AI Lab
Insight-V: A multi-agent system enhances multi-modal LLMs’ visual reasoning by generating high-quality long-chain reasoning data and employing a two-stage training pipeline, achieving significant perf…
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI
·4473 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Washington
GMAI-VL-5.5M & GMAI-VL: A new multimodal medical dataset and vision-language model achieve state-of-the-art results in various medical tasks.
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
·3767 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Hong Kong Baptist University
VideoAutoArena automates large multimodal model (LMM) evaluation using simulated users, offering a cost-effective and scalable solution compared to traditional human annotation.
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
·3633 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Vivo AI Lab
BlueLM-V-3B: Algorithm and system co-design enables efficient, real-time multimodal language model deployment on mobile devices.
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts
·2224 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Metabrain AGI Lab
Awaker2.5-VL: A novel Mixture-of-Experts architecture stably scales MLLMs, solving multi-task conflict with parameter efficiency and achieving state-of-the-art performance.
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
·2599 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Multimodal Generation 🏢 Roblox
SmoothCache: A universal technique boosts Diffusion Transformer inference speed by 8-71% across modalities, without sacrificing quality!
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
·2726 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Peking University
LLaVA-01: A novel visual language model achieves superior reasoning performance through structured, multi-stage processing and efficient inference-time scaling, surpassing even larger, closed-source m…
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
·4459 words·21 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
Boosting multimodal reasoning in LLMs, researchers developed Mixed Preference Optimization (MPO) and a large-scale dataset (MMPR), significantly improving reasoning accuracy and achieving performance …
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
·4045 words·19 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Tsinghua University
JanusFlow harmonizes autoregression and rectified flow for unified multimodal understanding and generation, achieving state-of-the-art results on standard benchmarks.
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
·2584 words·13 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
VideoGLaMM: a new large multimodal model achieves precise pixel-level visual grounding in videos by seamlessly integrating a dual vision encoder, a spatio-temporal decoder, and a large language model.
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
·2445 words·12 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Microsoft Research
LLM2CLIP boosts CLIP’s performance by cleverly integrating LLMs, enabling it to understand longer, more complex image captions and achieving state-of-the-art results across various benchmarks.
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination
·3165 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong, Shenzhen
MM-Detect: a novel framework detects contamination in multimodal LLMs, enhancing benchmark reliability by identifying training set leakage and improving performance evaluations.
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation
·2197 words·11 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 University of Technology Sydney
TIP-I2V: A million-scale dataset provides 1.7 million real user text & image prompts for image-to-video generation, boosting model development and safety.
Inference Optimal VLMs Need Only One Visual Token but Larger Models
·3063 words·15 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University
Inference-optimal Vision Language Models (VLMs) need only one visual token but larger models!
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
·3766 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 Tsinghua University
ANDROIDLAB, a novel framework, systematically benchmarks Android autonomous agents, improving LLM and LMM success rates on 138 tasks via a unified environment and open-source dataset.
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
·3628 words·18 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Shanghai AI Laboratory
OS-Atlas: A new open-source toolkit and model dramatically improves GUI agent performance by providing a massive dataset and innovative training methods, enabling superior generalization to unseen int…
BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays
·3405 words·16 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Vision-Language Models 🏢 Institute of High Performance Computing (IHPC)
BenchX: A unified benchmark framework reveals surprising MedVLP performance, challenging existing conclusions and advancing research.
Survey of User Interface Design and Interaction Techniques in Generative AI Applications
·3567 words·17 mins· loading · loading
AI Generated 🤗 Daily Papers Multimodal Learning Human-AI Interaction 🏢 UC San Diego
This study provides a comprehensive taxonomy of user interface design and interaction techniques in generative AI, offering valuable insights for developers and researchers aiming to enhance user expe…