Multimodal Learning
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
·2912 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Institute of Automation, Chinese Academy of Sciences
OneRef: Unified one-tower model surpasses existing methods in visual grounding and segmentation by leveraging a novel Mask Referring Modeling paradigm.
On the Comparison between Multi-modal and Single-modal Contrastive Learning
·455 words·3 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 RIKEN AIP
Multi-modal contrastive learning surpasses single-modal by leveraging inter-modal correlations to improve feature learning and downstream task performance, as demonstrated through a novel theoretical …
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
·2191 words·11 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
OmniTokenizer: A transformer-based tokenizer achieving state-of-the-art image and video reconstruction by leveraging a novel spatial-temporal decoupled architecture and progressive training strategy.
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
·3479 words·17 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Peking University
OmniJARVIS: Unified vision-language-action tokenization enables open-world instruction-following agents via unified multimodal interaction data.
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
·3418 words·17 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Skywork AI
OMG-LLaVA: A single model elegantly bridges image, object, and pixel-level reasoning for superior visual understanding.
Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding
·1696 words·8 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Baidu
Octopus, a novel multi-modal LLM, uses parallel visual recognition and sequential understanding to achieve 5x speedup on visual grounding and improved accuracy on various MLLM tasks.
Novel Object Synthesis via Adaptive Text-Image Harmony
·4696 words·23 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 School of Computer Science and Engineering, Nanjing University of Science and Technology
Researchers created a novel object synthesis method, Adaptive Text-Image Harmony (ATIH), that harmoniously blends image and text inputs to generate creative, composite objects.
No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
·2229 words·11 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
Contrastive vision-language models (VLMs) trained only on English data significantly underperform on culturally diverse benchmarks. This paper reveals this bias, proposes novel evaluation metrics, and…
No 'Zero-Shot' Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
·6344 words·30 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 University of Oxford
Multimodal models’ impressive ‘zero-shot’ performance hinges on the frequency of concepts in their training data, not inherent generalization ability; exponentially more data is needed for linear impr…
NeuroBOLT: Resting-state EEG-to-fMRI Synthesis with Multi-dimensional Feature Mapping
·2012 words·10 mins·
loading
·
loading
Multimodal Learning
Cross-Modal Retrieval
🏢 Vanderbilt University
NeuroBOLT: Resting-state EEG-to-fMRI synthesis using multi-dimensional feature mapping.
Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction
·2574 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Institute of Automation, Chinese Academy of Sciences
Researchers enhanced brain recording-based visual reconstruction using a novel Vision Transformer 3D framework integrated with LLMs, achieving superior performance in visual reconstruction, captioning…
MultiOOD: Scaling Out-of-Distribution Detection for Multiple Modalities
·3642 words·18 mins·
loading
·
loading
Multimodal Learning
Multimodal Understanding
🏢 ETH Zurich
MultiOOD benchmark and novel A2D & NP-Mix algorithms drastically improve multimodal out-of-distribution detection.
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
·2386 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 IBM Research
Large Multimodal Models (LMMs) are limited by their context length during many-shot in-context learning. This paper introduces Multimodal Task Vectors (MTV), a method to compress numerous in-context …
Multimodal Large Language Models Make Text-to-Image Generative Models Align Better
·4263 words·21 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
AI-generated preference data improves text-to-image alignment.
Multilingual Diversity Improves Vision-Language Representations
·2777 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Washington
Boosting vision-language models: Multilingual data improves performance on English-centric benchmarks.
Multi-Object Hallucination in Vision Language Models
·2226 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Michigan
LVLMs often hallucinate objects, a problem worsened when multiple objects are present. This paper introduces ROPE, a novel automated evaluation protocol that reveals how object class distribution and…
Multi-modal Transfer Learning between Biological Foundation Models
·2170 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 InstaDeep
IsoFormer, a novel multi-modal model, accurately predicts RNA transcript isoform expression by integrating DNA, RNA, and protein sequence information, achieving state-of-the-art results.
MoVA: Adapting Mixture of Vision Experts to Multimodal Context
·2418 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 CUHK MMLab
MoVA, a novel MLLM, enhances multimodal understanding by adaptively routing and fusing task-specific vision experts for improved generalization across diverse image content.
MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
·2462 words·12 mins·
loading
·
loading
Multimodal Learning
Multimodal Generation
🏢 Zhejiang University
MoMu-Diffusion: a novel framework that learns long-term motion-music synchronization, generating realistic and beat-matched sequences surpassing existing methods.
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
·2146 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Harbin Institute of Technology, Shenzhen
MoME, a novel Mixture of Multimodal Experts, significantly improves generalist Multimodal Large Language Models (MLLMs) by mitigating task interference through specialized vision and language experts,…