Vision-Language Models

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

26 September 2024·2911 words·14 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Apple

Kaleido Diffusion boosts the diversity of images generated by diffusion models without sacrificing quality, using autoregressive latent modeling to add more control and interpretability to the image g…

Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning

26 September 2024·1862 words·9 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Courant Institute of Mathematical Sciences

I2M2: A novel framework revolutionizes multi-modal learning by jointly modeling inter- and intra-modality dependencies, achieving superior performance across diverse real-world datasets.

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

26 September 2024·3090 words·15 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Microsoft Research

SpatialEval benchmark reveals that current vision-language models struggle with spatial reasoning, highlighting the need for improved multimodal models that effectively integrate visual and textual in…

IPO: Interpretable Prompt Optimization for Vision-Language Models

26 September 2024·3712 words·18 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 AIM Lab, University of Amsterdam

This paper introduces IPO, a novel interpretable prompt optimizer for vision-language models. IPO uses large language models (LLMs) to dynamically generate human-understandable prompts, improving acc…

Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)

26 September 2024·4104 words·20 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Harvard University

SpLiCE unlocks CLIP’s potential by transforming its dense, opaque representations into sparse, human-interpretable concept embeddings.

Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge

26 September 2024·3358 words·16 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Vrije Universiteit Brussel

CLIP’s zero-shot image classification decisions are made interpretable using a novel mutual-knowledge approach based on textual concepts, demonstrating effective and human-friendly analysis across div…

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

26 September 2024·2071 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 Shanghai Artificial Intelligence Laboratory

InternLM-XComposer2-4KHD pioneers high-resolution image understanding in LVLMs, scaling processing from 336 pixels to 4K HD and beyond, achieving state-of-the-art results on multiple benchmarks.

Interfacing Foundation Models' Embeddings

26 September 2024·2676 words·13 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 UW-Madison

FIND, a lightweight transformer interface, seamlessly aligns foundation models’ embeddings for unified image and dataset-level understanding, enabling generalizable, interleaved performance on segment…

InterDreamer: Zero-Shot Text to 3D Dynamic Human-Object Interaction

26 September 2024·2177 words·11 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Illinois Urbana-Champaign

InterDreamer: Zero-shot text-guided 3D human-object interaction generation without paired data, achieved via decoupled semantic and dynamic modeling, using LLMs and a physics-based world model.

InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint

26 September 2024·2703 words·13 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Chinese University of Hong Kong

InterControl: Zero-shot multi-person interaction generation by precisely controlling every joint using only single-person data.

Instruction-Guided Visual Masking

26 September 2024·3666 words·18 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Tsinghua University

Instruction-Guided Visual Masking (IVM) boosts multimodal instruction following by precisely focusing models on relevant image regions via visual masking, achieving state-of-the-art results on multipl…

InstructG2I: Synthesizing Images from Multimodal Attributed Graphs

26 September 2024·1973 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Illinois at Urbana-Champaign

INSTRUCTG2I: a novel graph context-conditioned diffusion model, generates images from multimodal attributed graphs, addressing challenges in graph size, dependencies, and controllability.

Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs

26 September 2024·6925 words·33 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Sorbonne University

Frozen LLMs surprisingly excel at multimodal tasks; this paper reveals that their success stems from an implicit multimodal alignment effect, paving the way for efficient LMMs.

I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing

26 September 2024·1541 words·8 mins· loading · loading

Natural Language Processing Vision-Language Models 🏢 Xiamen University

I2EBench: a new benchmark for Instruction-based Image Editing provides a comprehensive evaluation framework using 16 dimensions, aligned with human perception, to evaluate IIE models objectively.

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

26 September 2024·2769 words·13 mins· loading · loading

AI Generated Multimodal Learning Vision-Language Models 🏢 Shanghai Jiao Tong University

Humanoid robot learns to rearrange objects using vision and language instructions, achieving remarkable success on diverse tasks in a novel dataset.

How Control Information Influences Multilingual Text Image Generation and Editing?

26 September 2024·2075 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

TextGen enhances multilingual visual text generation and editing by optimizing control information using Fourier analysis and a two-stage framework, achieving state-of-the-art results.

Homology Consistency Constrained Efficient Tuning for Vision-Language Models

26 September 2024·1675 words·8 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 University of Science and Technology of China

Constraining vision-language model tuning via persistent homology ensures consistent image-text alignment, improving few-shot learning and domain generalization.

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding

26 September 2024·2062 words·10 mins· loading · loading

Multimodal Learning Vision-Language Models 🏢 ECE & 2IPAI, Seoul National University

This paper introduces HVFA, a novel OCR-free document understanding framework using MLLMs and multi-scale visual features, achieving superior performance across various document understanding tasks.

HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model

26 September 2024·2128 words·10 mins· loading · loading

AI Generated Natural Language Processing Vision-Language Models 🏢 AICV Lab, University of Arkansas

HENASY, a novel egocentric video-language model, uses a compositional approach to assemble scene entities for improved interpretability and performance.

HAWK: Learning to Understand Open-World Video Anomalies

26 September 2024·3198 words·16 mins· loading · loading

Natural Language Processing Vision-Language Models 🏢 Hong Kong University of Science and Technology

HAWK: a novel framework leveraging interactive VLMs and motion modality achieves state-of-the-art performance in open-world video anomaly understanding, generating descriptions and answering questions…