Vision-Language Models
Toward a Stable, Fair, and Comprehensive Evaluation of Object Hallucination in Large Vision-Language Models
·2235 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Harbin Institute of Technology
LeHaCE: a novel framework for evaluating object hallucination in LVLMs, improving evaluation stability and fairness by accounting for instruction-induced image description length variations.
TFG: Unified Training-Free Guidance for Diffusion Models
·3585 words·17 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Stanford University
TFG: A unified, training-free framework for boosting diffusion model performance by efficiently searching its algorithm-agnostic design space.
Textual Training for the Hassle-Free Removal of Unwanted Visual Data: Case Studies on OOD and Hateful Image Detection
·2145 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Seoul National University
Hassle-Free Textual Training (HFTT) uses only textual data to effectively remove unwanted visual data from AI training datasets, significantly reducing human annotation needs.
Text-Infused Attention and Foreground-Aware Modeling for Zero-Shot Temporal Action Detection
·2535 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Dept. of Artificial Intelligence, Korea University
Ti-FAD: a novel zero-shot temporal action detection model outperforms state-of-the-art methods by enhancing text-related visual focus and foreground awareness.
Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models
·2675 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 School of Computer Science and Engineering, Tianjin University of Technology
Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR) significantly improves vision-language model robustness against adversarial attacks by aligning and constraining text-guided attention, achievi…
Text-DiFuse: An Interactive Multi-Modal Image Fusion Framework based on Text-modulated Diffusion Model
·2167 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Wuhan University
Text-DiFuse: A novel interactive multi-modal image fusion framework leverages text-modulated diffusion models for superior performance in complex scenarios.
Text-Aware Diffusion for Policy Learning
·3340 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Brown University
Text-Aware Diffusion for Policy Learning (TADPoLe) uses pretrained diffusion models for zero-shot reward generation, enabling natural language-driven policy learning without manual reward design.
Testing Semantic Importance via Betting
·4904 words·24 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
🏢 Johns Hopkins University
This work presents statistically grounded methods to rank semantic concept importance in black-box models, using conditional independence testing for both global and local interpretations.
Temporal Sentence Grounding with Relevance Feedback in Videos
·2432 words·12 mins·
loading
·
loading
Natural Language Processing
Vision-Language Models
🏢 Peking University
RaTSG network tackles Temporal Sentence Grounding with Relevance Feedback (TSG-RF) by discerning query relevance at multiple granularities before selectively grounding segments.
Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation
·2449 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Carnegie Mellon University
Tactile DreamFusion: High-resolution tactile sensing enhances 3D generation, creating realistic geometric details previously unattainable.
Tackling Uncertain Correspondences for Multi-Modal Entity Alignment
·1671 words·8 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Hong Kong University of Science and Technology
TMEA: A novel approach significantly boosts multi-modal entity alignment accuracy by effectively handling uncertain correspondences between modalities, improving data integration for diverse knowledge…
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
·2422 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Science and Technology of China
TabPedia: a novel large vision-language model, achieves superior visual table understanding by seamlessly integrating diverse tasks via a concept synergy mechanism and a new benchmark.
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
·3387 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UC Santa Barbara
T2V-Turbo breaks the quality bottleneck of video consistency models by integrating mixed reward feedback during consistency distillation, enabling high-quality video generation with significantly fast…
Synergistic Dual Spatial-aware Generation of Image-to-text and Text-to-image
·2896 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Tianjin University
Synergistic Dual Spatial-aware Generation boosts image-to-text and text-to-image accuracy using a novel 3D scene graph and dual learning framework.
Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning
·3022 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Google
A-Harmonic reward function and Reward Preference Optimization (RPO) improve subject-driven text-to-image generation by enabling faster training and state-of-the-art results with a simpler setup.
Stabilizing Zero-Shot Prediction: A Novel Antidote to Forgetting in Continual Vision-Language Tasks
·2243 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Tsinghua University
ZAF: a novel replay-free continual learning method for vision-language models, significantly reduces forgetting by stabilizing zero-shot predictions.
SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
·3123 words·15 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 UC San Diego
SpatialRGPT enhances Vision-Language Models’ spatial reasoning by integrating 3D scene graphs and depth information, achieving significant performance gains on spatial reasoning tasks.
SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors
·2401 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 University of Oxford
SpatialPIN boosts vision-language models’ spatial reasoning by cleverly combining prompting techniques with 3D foundation models, achieving zero-shot performance on various spatial tasks.
SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization
·2449 words·12 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
🏢 Harvard University
SocialGPT cleverly leverages Vision Foundation Models and Large Language Models for zero-shot social relation reasoning, achieving competitive results and offering interpretable outputs via prompt opt…
SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM
·4826 words·23 mins·
loading
·
loading
AI Generated
Natural Language Processing
Vision-Language Models
🏢 School of Data Science, Fudan University
SlowFocus significantly improves fine-grained temporal understanding in video LLMs by using mixed-frequency sampling and a novel multi-frequency attention mechanism.