Multimodal Learning
LocCa: Visual Pretraining with Location-aware Captioners
·2114 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Google DeepMind
LocCa, a novel visual pretraining paradigm, uses location-aware captioning tasks to boost downstream localization performance while maintaining holistic task capabilities.
LLMs Can Evolve Continually on Modality for X-Modal Reasoning
·2222 words·11 mins·
loading
·
loading
Multimodal Learning
Multimodal Reasoning
🏢 Dalian University of Technology
PathWeave: A novel framework enabling Multimodal LLMs to continually evolve on modality, achieving comparable state-of-the-art performance with 98.73% less training burden!
LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Control and Rendering
·2138 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Fudan University
LiveScene: Language-embedded interactive radiance fields efficiently reconstruct and control complex scenes with multiple interactive objects, achieving state-of-the-art results.
Listenable Maps for Zero-Shot Audio Classifiers
·2601 words·13 mins·
loading
·
loading
Multimodal Learning
Audio-Visual Learning
🏢 Fondazione Bruno Kessler
LMAC-ZS: First decoder-based method for explaining zero-shot audio classifiers, ensuring transparency and trustworthiness in AI.
Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes
·2315 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Carnegie Mellon University
LipFD: a novel method leverages audio-visual inconsistencies to accurately spot lip-syncing deepfakes, outperforming existing methods and introducing a high-quality dataset for future research.
LG-VQ: Language-Guided Codebook Learning
·3656 words·18 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Harbin Institute of Technology
LG-VQ: A novel language-guided codebook learning framework boosts multi-modal performance.
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
·2448 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Show Lab, National University of Singapore
Visual tokens boost long-text multi-modal models!
Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models
·2923 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Southeast University
Lever-LM configures effective in-context demonstrations for large vision-language models using a small language model, significantly improving their performance on visual question answering and image …
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
·2514 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Carnegie Mellon University
Boosting complex visual reasoning, a new Iterative and Parallel Reasoning Mechanism (IPRM) outperforms existing methods by combining step-by-step and simultaneous computations, improving accuracy and …
Learning Spatially-Aware Language and Audio Embeddings
·3744 words·18 mins·
loading
·
loading
Multimodal Learning
Audio-Visual Learning
🏢 Georgia Institute of Technology
ELSA: a new model that learns spatially aware language and audio embeddings, achieving state-of-the-art performance in semantic retrieval and 3D sound source localization.
Learning Cortico-Muscular Dependence through Orthonormal Decomposition of Density Ratios
·2506 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Department of Bioengineering, Imperial College London
Unveiling cortico-muscular dependence using orthonormal decomposition of density ratios, FMCA-T, enhances movement classification and reveals channel-temporal dependencies.
Learning 1D Causal Visual Representation with De-focus Attention Networks
·2168 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
De-focus Attention Networks achieve comparable performance to 2D non-causal models using 1D causal visual representation, solving the ‘over-focus’ issue in existing 1D causal vision models.
LaSe-E2V: Towards Language-guided Semantic-aware Event-to-Video Reconstruction
·2343 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
LaSe-E2V: Language-guided semantic-aware event-to-video reconstruction uses text descriptions to improve video quality and consistency.
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling
·2911 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Apple
Kaleido Diffusion boosts the diversity of images generated by diffusion models without sacrificing quality, using autoregressive latent modeling to add more control and interpretability to the image g…
Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
·1862 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Courant Institute of Mathematical Sciences
I2M2: A novel framework revolutionizes multi-modal learning by jointly modeling inter- and intra-modality dependencies, achieving superior performance across diverse real-world datasets.
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
·3090 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
SpatialEval benchmark reveals that current vision-language models struggle with spatial reasoning, highlighting the need for improved multimodal models that effectively integrate visual and textual in…
IPO: Interpretable Prompt Optimization for Vision-Language Models
·3712 words·18 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 AIM Lab, University of Amsterdam
This paper introduces IPO, a novel interpretable prompt optimizer for vision-language models. IPO uses large language models (LLMs) to dynamically generate human-understandable prompts, improving acc…
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
·4104 words·20 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Harvard University
SpLiCE unlocks CLIP’s potential by transforming its dense, opaque representations into sparse, human-interpretable concept embeddings.
Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge
·3358 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Vrije Universiteit Brussel
CLIP’s zero-shot image classification decisions are made interpretable using a novel mutual-knowledge approach based on textual concepts, demonstrating effective and human-friendly analysis across div…
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
·2071 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Shanghai Artificial Intelligence Laboratory
InternLM-XComposer2-4KHD pioneers high-resolution image understanding in LVLMs, scaling processing from 336 pixels to 4K HD and beyond, achieving state-of-the-art results on multiple benchmarks.