Multimodal Learning

LocCa: Visual Pretraining with Location-aware Captioners

26 September 2024·2114 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Google DeepMind

LocCa, a novel visual pretraining paradigm, uses location-aware captioning tasks to boost downstream localization performance while maintaining holistic task capabilities.

LLMs Can Evolve Continually on Modality for X-Modal Reasoning

26 September 2024·2222 words·11 mins· loading · loading

Multimodal Learning Multimodal Reasoning 🏢 Dalian University of Technology

PathWeave: A novel framework enabling Multimodal LLMs to continually evolve on modality, achieving comparable state-of-the-art performance with 98.73% less training burden!

LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Control and Rendering

26 September 2024·2138 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Fudan University

LiveScene: Language-embedded interactive radiance fields efficiently reconstruct and control complex scenes with multiple interactive objects, achieving state-of-the-art results.

Listenable Maps for Zero-Shot Audio Classifiers

26 September 2024·2601 words·13 mins· loading · loading

Multimodal Learning Audio-Visual Learning 🏢 Fondazione Bruno Kessler

LMAC-ZS: First decoder-based method for explaining zero-shot audio classifiers, ensuring transparency and trustworthiness in AI.

Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes

26 September 2024·2315 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

LipFD: a novel method leverages audio-visual inconsistencies to accurately spot lip-syncing deepfakes, outperforming existing methods and introducing a high-quality dataset for future research.

LG-VQ: Language-Guided Codebook Learning

26 September 2024·3656 words·18 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Harbin Institute of Technology

LG-VQ: A novel language-guided codebook learning framework boosts multi-modal performance.

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

26 September 2024·2448 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Show Lab, National University of Singapore

Visual tokens boost long-text multi-modal models!

Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models

26 September 2024·2923 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Southeast University

Lever-LM configures effective in-context demonstrations for large vision-language models using a small language model, significantly improving their performance on visual question answering and image …

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

26 September 2024·2514 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Carnegie Mellon University

Boosting complex visual reasoning, a new Iterative and Parallel Reasoning Mechanism (IPRM) outperforms existing methods by combining step-by-step and simultaneous computations, improving accuracy and …

Learning Spatially-Aware Language and Audio Embeddings

26 September 2024·3744 words·18 mins· loading · loading

Multimodal Learning Audio-Visual Learning 🏢 Georgia Institute of Technology

ELSA: a new model that learns spatially aware language and audio embeddings, achieving state-of-the-art performance in semantic retrieval and 3D sound source localization.

Learning Cortico-Muscular Dependence through Orthonormal Decomposition of Density Ratios

26 September 2024·2506 words·12 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Department of Bioengineering, Imperial College London

Unveiling cortico-muscular dependence using orthonormal decomposition of density ratios, FMCA-T, enhances movement classification and reveals channel-temporal dependencies.

Learning 1D Causal Visual Representation with De-focus Attention Networks

26 September 2024·2168 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Tsinghua University

De-focus Attention Networks achieve comparable performance to 2D non-causal models using 1D causal visual representation, solving the ‘over-focus’ issue in existing 1D causal vision models.

LaSe-E2V: Towards Language-guided Semantic-aware Event-to-Video Reconstruction

26 September 2024·2343 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Hong Kong University of Science and Technology

LaSe-E2V: Language-guided semantic-aware event-to-video reconstruction uses text descriptions to improve video quality and consistency.

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

26 September 2024·2911 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Apple

Kaleido Diffusion boosts the diversity of images generated by diffusion models without sacrificing quality, using autoregressive latent modeling to add more control and interpretability to the image g…

Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning

26 September 2024·1862 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Courant Institute of Mathematical Sciences

I2M2: A novel framework revolutionizes multi-modal learning by jointly modeling inter- and intra-modality dependencies, achieving superior performance across diverse real-world datasets.

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

26 September 2024·3090 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Microsoft Research

SpatialEval benchmark reveals that current vision-language models struggle with spatial reasoning, highlighting the need for improved multimodal models that effectively integrate visual and textual in…

IPO: Interpretable Prompt Optimization for Vision-Language Models

26 September 2024·3712 words·18 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 AIM Lab, University of Amsterdam

This paper introduces IPO, a novel interpretable prompt optimizer for vision-language models. IPO uses large language models (LLMs) to dynamically generate human-understandable prompts, improving acc…

Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)

26 September 2024·4104 words·20 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Harvard University

SpLiCE unlocks CLIP’s potential by transforming its dense, opaque representations into sparse, human-interpretable concept embeddings.

Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge

26 September 2024·3358 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Vrije Universiteit Brussel

CLIP’s zero-shot image classification decisions are made interpretable using a novel mutual-knowledge approach based on textual concepts, demonstrating effective and human-friendly analysis across div…

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

26 September 2024·2071 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory

InternLM-XComposer2-4KHD pioneers high-resolution image understanding in LVLMs, scaling processing from 336 pixels to 4K HD and beyond, achieving state-of-the-art results on multiple benchmarks.