Multimodal Learning

OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling

26 September 2024·2912 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Institute of Automation, Chinese Academy of Sciences

OneRef: Unified one-tower model surpasses existing methods in visual grounding and segmentation by leveraging a novel Mask Referring Modeling paradigm.

On the Comparison between Multi-modal and Single-modal Contrastive Learning

26 September 2024·455 words·3 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 RIKEN AIP

Multi-modal contrastive learning surpasses single-modal by leveraging inter-modal correlations to improve feature learning and downstream task performance, as demonstrated through a novel theoretical …

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

26 September 2024·2191 words·11 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University

OmniTokenizer: A transformer-based tokenizer achieving state-of-the-art image and video reconstruction by leveraging a novel spatial-temporal decoupled architecture and progressive training strategy.

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

26 September 2024·3479 words·17 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Peking University

OmniJARVIS: Unified vision-language-action tokenization enables open-world instruction-following agents via unified multimodal interaction data.

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

26 September 2024·3418 words·17 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Skywork AI

OMG-LLaVA: A single model elegantly bridges image, object, and pixel-level reasoning for superior visual understanding.

Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding

26 September 2024·1696 words·8 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Baidu

Octopus, a novel multi-modal LLM, uses parallel visual recognition and sequential understanding to achieve 5x speedup on visual grounding and improved accuracy on various MLLM tasks.

Novel Object Synthesis via Adaptive Text-Image Harmony

26 September 2024·4696 words·23 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 School of Computer Science and Engineering, Nanjing University of Science and Technology

Researchers created a novel object synthesis method, Adaptive Text-Image Harmony (ATIH), that harmoniously blends image and text inputs to generate creative, composite objects.

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

26 September 2024·2229 words·11 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Google DeepMind

Contrastive vision-language models (VLMs) trained only on English data significantly underperform on culturally diverse benchmarks. This paper reveals this bias, proposes novel evaluation metrics, and…

No 'Zero-Shot' Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

26 September 2024·6344 words·30 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of Oxford

Multimodal models’ impressive ‘zero-shot’ performance hinges on the frequency of concepts in their training data, not inherent generalization ability; exponentially more data is needed for linear impr…

NeuroBOLT: Resting-state EEG-to-fMRI Synthesis with Multi-dimensional Feature Mapping

26 September 2024·2012 words·10 mins· loading · loading

Multimodal Learning Cross-Modal Retrieval 🏢 Vanderbilt University

NeuroBOLT: Resting-state EEG-to-fMRI synthesis using multi-dimensional feature mapping.

Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction

26 September 2024·2574 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Institute of Automation, Chinese Academy of Sciences

Researchers enhanced brain recording-based visual reconstruction using a novel Vision Transformer 3D framework integrated with LLMs, achieving superior performance in visual reconstruction, captioning…

MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities

26 September 2024·3642 words·18 mins· loading · loading

Multimodal Learning Multimodal Understanding 🏢 ETH Zurich

MultiOOD benchmark and novel A2D & NP-Mix algorithms drastically improve multimodal out-of-distribution detection.

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

26 September 2024·2386 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 IBM Research

Large Multimodal Models (LMMs) are limited by their context length during many-shot in-context learning. This paper introduces Multimodal Task Vectors (MTV), a method to compress numerous in-context …

Multimodal Large Language Models Make Text-to-Image Generative Models Align Better

26 September 2024·4263 words·21 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Microsoft Research

AI-generated preference data improves text-to-image alignment.

Multilingual Diversity Improves Vision-Language Representations

26 September 2024·2777 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Washington

Boosting vision-language models: Multilingual data improves performance on English-centric benchmarks.

Multi-Object Hallucination in Vision Language Models

26 September 2024·2226 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Michigan

LVLMs often hallucinate objects, a problem worsened when multiple objects are present. This paper introduces ROPE, a novel automated evaluation protocol that reveals how object class distribution and…

Multi-modal Transfer Learning between Biological Foundation Models

26 September 2024·2170 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 InstaDeep

IsoFormer, a novel multi-modal model, accurately predicts RNA transcript isoform expression by integrating DNA, RNA, and protein sequence information, achieving state-of-the-art results.

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

26 September 2024·2418 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 CUHK MMLab

MoVA, a novel MLLM, enhances multimodal understanding by adaptively routing and fusing task-specific vision experts for improved generalization across diverse image content.

MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence

26 September 2024·2462 words·12 mins· loading · loading

Multimodal Learning Multimodal Generation 🏢 Zhejiang University

MoMu-Diffusion: a novel framework that learns long-term motion-music synchronization, generating realistic and beat-matched sequences surpassing existing methods.

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

26 September 2024·2146 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Harbin Institute of Technology, Shenzhen

MoME, a novel Mixture of Multimodal Experts, significantly improves generalist Multimodal Large Language Models (MLLMs) by mitigating task interference through specialized vision and language experts,…