Multimodal Learning
Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning
·3022 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Google
A-Harmonic reward function and Reward Preference Optimization (RPO) improve subject-driven text-to-image generation by enabling faster training and state-of-the-art results with a simpler setup.
Stabilizing Zero-Shot Prediction: A Novel Antidote to Forgetting in Continual Vision-Language Tasks
·2243 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
ZAF: a novel replay-free continual learning method for vision-language models, significantly reduces forgetting by stabilizing zero-shot predictions.
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
·3123 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UC San Diego
SpatialRGPT enhances Vision-Language Models’ spatial reasoning by integrating 3D scene graphs and depth information, achieving significant performance gains on spatial reasoning tasks.
SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
·2401 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Oxford
SpatialPIN boosts vision-language models’ spatial reasoning by cleverly combining prompting techniques with 3D foundation models, achieving zero-shot performance on various spatial tasks.
SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization
·2449 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Harvard University
SocialGPT cleverly leverages Vision Foundation Models and Large Language Models for zero-shot social relation reasoning, achieving competitive results and offering interpretable outputs via prompt opt…
Slot-VLM: Object-Event Slots for Video-Language Modeling
·4378 words·21 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Microsoft Research
Slot-VLM generates semantically decomposed video tokens using an Object-Event Slots module, improving video-language model performance.
Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models
·3521 words·17 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 School of Cyber Science and Engineering, Southeast University, Nanjing, China
Single Image Unlearning (SIU) efficiently removes visual data from Multimodal Large Language Models (MLLMs) using only one image per concept, outperforming existing methods and defending against attac…
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion
·2883 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Southeast University
SimVG: A simpler, faster visual grounding framework with decoupled multi-modal fusion, achieving state-of-the-art performance.
Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models
·3138 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Maryland, College Park
Shadowcast: A new data poisoning attack manipulates vision-language models by injecting visually similar, yet deceptively misleading, image-text pairs, causing them to generate false information.
SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation
·2121 words·10 mins·
loading
·
loading
Multimodal Learning
Embodied AI
🏢 Tsinghua University
SG-Nav achieves state-of-the-art zero-shot object navigation by leveraging a novel 3D scene graph to provide rich context for LLM-based reasoning.
SeTAR: Out-of-Distribution Detection with Selective Low-Rank Approximation
·5697 words·27 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Southern University of Science and Technology
SeTAR: Training-free OOD detection via selective low-rank approximation, improving robustness and efficiency.
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
·2539 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UNC Chapel Hill
SELMA boosts text-to-image fidelity by merging skill-specific models trained on automatically generated image-text datasets.
Self-Calibrated Tuning of Vision-Language Models for Out-of-Distribution Detection
·2209 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Shanghai Jiao Tong University
Self-Calibrated Tuning (SCT) enhances vision-language model OOD detection by adaptively weighting OOD regularization based on prediction uncertainty, mitigating issues caused by inaccurate feature ext…
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
·3346 words·16 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 ByteDance Inc.
Boosting vision-language model performance, Contrastive ALignment (CAL) prioritizes visually correlated text tokens during training via a simple, computationally efficient re-weighting strategy, signi…
Seeing Beyond the Crop: Using Language Priors for Out-of-Bounding Box Keypoint Prediction
·2045 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Waterloo
TokenCLIPose leverages language priors to predict human keypoints beyond bounding boxes, improving pose estimation accuracy significantly on ice hockey, lacrosse and CrowdPose datasets.
SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge
·2158 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Shanghai AI Laboratory
SearchLVLMs: A plug-and-play framework efficiently augments large vision-language models with up-to-date internet knowledge via hierarchical filtering, significantly improving accuracy on visual quest…
SceneCraft: Layout-Guided 3D Scene Generation
·2040 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Shanghai Jiao Tong University
SceneCraft generates highly detailed indoor scenes from user-provided textual descriptions and spatial layouts, overcoming limitations of previous text-to-3D methods in scale and control.
Scene Graph Generation with Role-Playing Large Language Models
·2597 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Zhejiang University
SDSGG outperforms leading scene graph generation methods by using LLMs to create scene-specific descriptions, adapting to diverse visual relations.
RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation
·1948 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Peking University
RoboMamba: a novel robotic VLA model efficiently combines reasoning and action, achieving high speeds and accuracy while requiring minimal fine-tuning.
Right this way: Can VLMs Guide Us to See More to Answer Questions?
·2940 words·14 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 UC Santa Cruz
VLMs struggle with insufficient visual info for Q&A; this work introduces a novel Directional Guidance task and a data augmentation framework, significantly improving VLM performance by teaching them …