Vision-Language Models
Novel Object Synthesis via Adaptive Text-Image Harmony
·4696 words·23 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 School of Computer Science and Engineering, Nanjing University of Science and Technology
Researchers created a novel object synthesis method, Adaptive Text-Image Harmony (ATIH), that harmoniously blends image and text inputs to generate creative, composite objects.
No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
·2229 words·11 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
Contrastive vision-language models (VLMs) trained only on English data significantly underperform on culturally diverse benchmarks. This paper reveals this bias, proposes novel evaluation metrics, and…
No 'Zero-Shot' Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
·6344 words·30 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 University of Oxford
Multimodal models’ impressive ‘zero-shot’ performance hinges on the frequency of concepts in their training data, not inherent generalization ability; exponentially more data is needed for linear impr…
Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction
·2574 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Institute of Automation, Chinese Academy of Sciences
Researchers enhanced brain recording-based visual reconstruction using a novel Vision Transformer 3D framework integrated with LLMs, achieving superior performance in visual reconstruction, captioning…
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
·2386 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 IBM Research
Large Multimodal Models (LMMs) are limited by their context length during many-shot in-context learning. This paper introduces Multimodal Task Vectors (MTV), a method to compress numerous in-context …
Multimodal Large Language Models Make Text-to-Image Generative Models Align Better
·4263 words·21 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
AI-generated preference data improves text-to-image alignment.
Multilingual Diversity Improves Vision-Language Representations
·2777 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Washington
Boosting vision-language models: Multilingual data improves performance on English-centric benchmarks.
Multi-Object Hallucination in Vision Language Models
·2226 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Michigan
LVLMs often hallucinate objects, a problem worsened when multiple objects are present. This paper introduces ROPE, a novel automated evaluation protocol that reveals how object class distribution and…
Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention
·3055 words·15 mins·
loading
·
loading
AI Generated
Natural Language Processing
Vision-Language Models
🏢 Department of Computer Science, Purdue University
D-LISA: Dynamic modules & language-informed spatial attention revolutionizes multi-object 3D grounding, surpassing state-of-the-art accuracy by 12.8%.
Multi-modal Transfer Learning between Biological Foundation Models
·2170 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 InstaDeep
IsoFormer, a novel multi-modal model, accurately predicts RNA transcript isoform expression by integrating DNA, RNA, and protein sequence information, achieving state-of-the-art results.
MoVA: Adapting Mixture of Vision Experts to Multimodal Context
·2418 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 CUHK MMLab
MoVA, a novel MLLM, enhances multimodal understanding by adaptively routing and fusing task-specific vision experts for improved generalization across diverse image content.
MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models
·2146 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Harbin Institute of Technology, Shenzhen
MoME, a novel Mixture of Multimodal Experts, significantly improves generalist Multimodal Large Language Models (MLLMs) by mitigating task interference through specialized vision and language experts,…
MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts
·4224 words·20 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
MoLE: Mixture of Low-rank Experts enhances human-centric text-to-image diffusion models by using low-rank modules trained on high-quality face and hand datasets to improve the realism of faces and han…
MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling
·2749 words·13 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Alibaba Group
MoGenTS revolutionizes human motion generation by quantizing individual joints into 2D tokens, enabling efficient spatial-temporal modeling and significantly outperforming existing methods.
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration
·3087 words·15 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Beijing Jiaotong University
Mobile-Agent-v2 uses a three-agent collaborative framework (planning, decision, reflection) to improve mobile device operation accuracy by over 30%, overcoming the limitations of single-agent architec…
MMSite: A Multi-modal Framework for the Identification of Active Sites in Proteins
·2970 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 School of Computer Science, National Engineering Research Center for Multimedia Software and Institute of Artificial Intelligence, Wuhan University
MMSite: a novel multi-modal framework accurately identifies protein active sites using protein sequences and textual descriptions, achieving state-of-the-art performance.
Mitigating Object Hallucination via Concentric Causal Attention
·2174 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Nanyang Technological University
Concentric Causal Attention (CCA) significantly reduces object hallucination in LVLMs by cleverly reorganizing visual tokens to mitigate the impact of long-term decay in Rotary Position Encoding.
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
·3263 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 KAIST
Meteor: Mamba-based Traversal of Rationale achieves significant vision-language improvements by efficiently embedding multifaceted rationales in a large language model, without scaling the model or us…
MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts
·2803 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 School of Artificial Intelligence, University of Chinese Academy of Sciences
MemVLT: Adaptive Vision-Language Tracking leverages memory to generate dynamic prompts, surpassing existing methods by adapting to changing target appearances.
Membership Inference Attacks against Large Vision-Language Models
·3357 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 LIONS, EPFL
First benchmark for detecting training data in large vision-language models (VLLMs) improves data security.