Vision-Language Models

Novel Object Synthesis via Adaptive Text-Image Harmony

26 September 2024·4696 words·23 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 School of Computer Science and Engineering, Nanjing University of Science and Technology

Researchers created a novel object synthesis method, Adaptive Text-Image Harmony (ATIH), that harmoniously blends image and text inputs to generate creative, composite objects.

No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

26 September 2024·2229 words·11 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Google DeepMind

Contrastive vision-language models (VLMs) trained only on English data significantly underperform on culturally diverse benchmarks. This paper reveals this bias, proposes novel evaluation metrics, and…

No 'Zero-Shot' Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

26 September 2024·6344 words·30 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 University of Oxford

Multimodal models’ impressive ‘zero-shot’ performance hinges on the frequency of concepts in their training data, not inherent generalization ability; exponentially more data is needed for linear impr…

Neuro-Vision to Language: Enhancing Brain Recording-based Visual Reconstruction and Language Interaction

26 September 2024·2574 words·13 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Institute of Automation, Chinese Academy of Sciences

Researchers enhanced brain recording-based visual reconstruction using a novel Vision Transformer 3D framework integrated with LLMs, achieving superior performance in visual reconstruction, captioning…

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

26 September 2024·2386 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 IBM Research

Large Multimodal Models (LMMs) are limited by their context length during many-shot in-context learning. This paper introduces Multimodal Task Vectors (MTV), a method to compress numerous in-context …

Multimodal Large Language Models Make Text-to-Image Generative Models Align Better

26 September 2024·4263 words·21 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Microsoft Research

AI-generated preference data improves text-to-image alignment.

Multilingual Diversity Improves Vision-Language Representations

26 September 2024·2777 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Washington

Boosting vision-language models: Multilingual data improves performance on English-centric benchmarks.

Multi-Object Hallucination in Vision Language Models

26 September 2024·2226 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Michigan

LVLMs often hallucinate objects, a problem worsened when multiple objects are present. This paper introduces ROPE, a novel automated evaluation protocol that reveals how object class distribution and…

Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

26 September 2024·3055 words·15 mins· loading · loading

AI Generated Natural Language Processing Vision-Language Models 🏢 Department of Computer Science, Purdue University

D-LISA: Dynamic modules & language-informed spatial attention revolutionizes multi-object 3D grounding, surpassing state-of-the-art accuracy by 12.8%.

Multi-modal Transfer Learning between Biological Foundation Models

26 September 2024·2170 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 InstaDeep

IsoFormer, a novel multi-modal model, accurately predicts RNA transcript isoform expression by integrating DNA, RNA, and protein sequence information, achieving state-of-the-art results.

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

26 September 2024·2418 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 CUHK MMLab

MoVA, a novel MLLM, enhances multimodal understanding by adaptively routing and fusing task-specific vision experts for improved generalization across diverse image content.

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

26 September 2024·2146 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Harbin Institute of Technology, Shenzhen

MoME, a novel Mixture of Multimodal Experts, significantly improves generalist Multimodal Large Language Models (MLLMs) by mitigating task interference through specialized vision and language experts,…

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

26 September 2024·4224 words·20 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Peking University

MoLE: Mixture of Low-rank Experts enhances human-centric text-to-image diffusion models by using low-rank modules trained on high-quality face and hand datasets to improve the realism of faces and han…

MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling

26 September 2024·2749 words·13 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Alibaba Group

MoGenTS revolutionizes human motion generation by quantizing individual joints into 2D tokens, enabling efficient spatial-temporal modeling and significantly outperforming existing methods.

Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

26 September 2024·3087 words·15 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Beijing Jiaotong University

Mobile-Agent-v2 uses a three-agent collaborative framework (planning, decision, reflection) to improve mobile device operation accuracy by over 30%, overcoming the limitations of single-agent architec…

MMSite: A Multi-modal Framework for the Identification of Active Sites in Proteins

26 September 2024·2970 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 School of Computer Science, National Engineering Research Center for Multimedia Software and Institute of Artificial Intelligence, Wuhan University

MMSite: a novel multi-modal framework accurately identifies protein active sites using protein sequences and textual descriptions, achieving state-of-the-art performance.

Mitigating Object Hallucination via Concentric Causal Attention

26 September 2024·2174 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Nanyang Technological University

Concentric Causal Attention (CCA) significantly reduces object hallucination in LVLMs by cleverly reorganizing visual tokens to mitigate the impact of long-term decay in Rotary Position Encoding.

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

26 September 2024·3263 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 KAIST

Meteor: Mamba-based Traversal of Rationale achieves significant vision-language improvements by efficiently embedding multifaceted rationales in a large language model, without scaling the model or us…

MemVLT: Vision-Language Tracking with Adaptive Memory-based Prompts

26 September 2024·2803 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 School of Artificial Intelligence, University of Chinese Academy of Sciences

MemVLT: Adaptive Vision-Language Tracking leverages memory to generate dynamic prompts, surpassing existing methods by adapting to changing target appearances.

Membership Inference Attacks against Large Vision-Language Models

26 September 2024·3357 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 LIONS, EPFL

First benchmark for detecting training data in large vision-language models (VLLMs) improves data security.