Skip to main content

Vision-Language Models

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling
·2911 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Apple
Kaleido Diffusion boosts the diversity of images generated by diffusion models without sacrificing quality, using autoregressive latent modeling to add more control and interpretability to the image g…
Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
·1862 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Courant Institute of Mathematical Sciences
I2M2: A novel framework revolutionizes multi-modal learning by jointly modeling inter- and intra-modality dependencies, achieving superior performance across diverse real-world datasets.
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
·3090 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Microsoft Research
SpatialEval benchmark reveals that current vision-language models struggle with spatial reasoning, highlighting the need for improved multimodal models that effectively integrate visual and textual in…
IPO: Interpretable Prompt Optimization for Vision-Language Models
·3712 words·18 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 AIM Lab, University of Amsterdam
This paper introduces IPO, a novel interpretable prompt optimizer for vision-language models. IPO uses large language models (LLMs) to dynamically generate human-understandable prompts, improving acc…
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
·4104 words·20 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Harvard University
SpLiCE unlocks CLIP’s potential by transforming its dense, opaque representations into sparse, human-interpretable concept embeddings.
Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge
·3358 words·16 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Vrije Universiteit Brussel
CLIP’s zero-shot image classification decisions are made interpretable using a novel mutual-knowledge approach based on textual concepts, demonstrating effective and human-friendly analysis across div…
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
·2071 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory
InternLM-XComposer2-4KHD pioneers high-resolution image understanding in LVLMs, scaling processing from 336 pixels to 4K HD and beyond, achieving state-of-the-art results on multiple benchmarks.
Interfacing Foundation Models' Embeddings
·2676 words·13 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 UW-Madison
FIND, a lightweight transformer interface, seamlessly aligns foundation models’ embeddings for unified image and dataset-level understanding, enabling generalizable, interleaved performance on segment…
InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction
·2177 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Illinois Urbana-Champaign
InterDreamer: Zero-shot text-guided 3D human-object interaction generation without paired data, achieved via decoupled semantic and dynamic modeling, using LLMs and a physics-based world model.
InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint
·2703 words·13 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong
InterControl: Zero-shot multi-person interaction generation by precisely controlling every joint using only single-person data.
Instruction-Guided Visual Masking
·3666 words·18 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Tsinghua University
Instruction-Guided Visual Masking (IVM) boosts multimodal instruction following by precisely focusing models on relevant image regions via visual masking, achieving state-of-the-art results on multipl…
InstructG2I: Synthesizing Images from Multimodal Attributed Graphs
·1973 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Illinois at Urbana-Champaign
INSTRUCTG2I: a novel graph context-conditioned diffusion model, generates images from multimodal attributed graphs, addressing challenges in graph size, dependencies, and controllability.
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs
·6925 words·33 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Sorbonne University
Frozen LLMs surprisingly excel at multimodal tasks; this paper reveals that their success stems from an implicit multimodal alignment effect, paving the way for efficient LMMs.
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
·1541 words·8 mins· loading · loading
Natural Language Processing Vision-Language Models 🏢 Xiamen University
I2EBench: a new benchmark for Instruction-based Image Editing provides a comprehensive evaluation framework using 16 dimensions, aligned with human perception, to evaluate IIE models objectively.
HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid
·2769 words·13 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Shanghai Jiao Tong University
Humanoid robot learns to rearrange objects using vision and language instructions, achieving remarkable success on diverse tasks in a novel dataset.
How Control Information Influences Multilingual Text Image Generation and Editing?
·2075 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China
TextGen enhances multilingual visual text generation and editing by optimizing control information using Fourier analysis and a two-stage framework, achieving state-of-the-art results.
Homology Consistency Constrained Efficient Tuning for Vision-Language Models
·1675 words·8 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China
Constraining vision-language model tuning via persistent homology ensures consistent image-text alignment, improving few-shot learning and domain generalization.
Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
·2062 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 ECE & 2IPAI, Seoul National University
This paper introduces HVFA, a novel OCR-free document understanding framework using MLLMs and multi-scale visual features, achieving superior performance across various document understanding tasks.
HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model
·2128 words·10 mins· loading · loading
AI Generated Natural Language Processing Vision-Language Models 🏢 AICV Lab, University of Arkansas
HENASY, a novel egocentric video-language model, uses a compositional approach to assemble scene entities for improved interpretability and performance.
HAWK: Learning to Understand Open-World Video Anomalies
·3198 words·16 mins· loading · loading
Natural Language Processing Vision-Language Models 🏢 Hong Kong University of Science and Technology
HAWK: a novel framework leveraging interactive VLMs and motion modality achieves state-of-the-art performance in open-world video anomaly understanding, generating descriptions and answering questions…