Skip to main content

Vision-Language Models

Referring Human Pose and Mask Estimation In the Wild
·2191 words·11 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Western Australia
RefHuman: a new dataset and UniPHD model achieve state-of-the-art referring human pose and mask estimation in the wild, using text or positional prompts.
Referencing Where to Focus: Improving Visual Grounding with Referential Query
·2958 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 National Key Laboratory of Human-Machine Hybrid Augmented Intelligence
RefFormer boosts visual grounding accuracy by intelligently adapting queries using multi-level image features, effectively guiding the decoder towards the target object.
Recognize Any Regions
·2350 words·12 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 University of Surrey
RegionSpot efficiently integrates pretrained localization and vision-language models for superior open-world object recognition, achieving significant performance gains with minimal training.
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models
·2441 words·12 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Stanford University
RAVL: a novel approach that accurately discovers and effectively mitigates spurious correlations in fine-tuned vision-language models, improving zero-shot classification accuracy.
QUEST: Quadruple Multimodal Contrastive Learning with Constraints and Self-Penalization
·1744 words·9 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Beihang University
QUEST: Quadruple Multimodal Contrastive Learning tackles feature suppression by using quaternion embedding to extract unique information while penalizing excessive shared information influence, achiev…
Q-VLM: Post-training Quantization for Large Vision-Language Models
·2070 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Tsinghua University
Q-VLM: A novel post-training quantization framework significantly compresses large vision-language models, boosting inference speed without sacrificing accuracy.
Propensity Score Alignment of Unpaired Multimodal Data
·2058 words·10 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 University of British Columbia
Unlocking multimodal learning’s potential with propensity scores: This novel approach aligns unpaired data across modalities, significantly improving representation learning.
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
·2063 words·10 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Strasbourg
PeskaVLP: Hierarchical knowledge augmentation boosts surgical video-language pretraining!
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
·4267 words·21 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Nanjing University
Prism: a novel framework disentangles perception and reasoning in Vision-Language Models (VLMs) for improved model assessment and efficient VLM development.
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
·2633 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 University of Oxford
Pre-trained text-to-image diffusion models create highly effective, versatile representations for embodied AI control, surpassing previous methods.
PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation
·3082 words·15 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Sun Yat-Sen University
PIVOT-R, a novel primitive-driven waypoint-aware world model, significantly boosts robotic manipulation performance and efficiency via an asynchronous hierarchical executor.
PERIA: Perceive, Reason, Imagine, Act via Holistic Language and Vision Planning for Manipulation
·2974 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 College of Intelligence and Computing, Tianjin University
PERIA: Holistic language & vision planning for complex robotic manipulation!
ParallelEdits: Efficient Multi-Aspect Text-Driven Image Editing with Attention Grouping
·3634 words·18 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 University of Buffalo
ParallelEdits efficiently edits multiple image aspects simultaneously, guided by text prompts, surpassing sequential methods in speed and accuracy.
Pandora's Box: Towards Building Universal Attackers against Real-World Large Vision-Language Models
·2651 words·13 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Peking University
Researchers developed a universal adversarial patch to fool real-world large vision-language models (LVLMs) across multiple tasks, without needing access to internal model details.
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
·2912 words·14 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Institute of Automation, Chinese Academy of Sciences
OneRef: Unified one-tower model surpasses existing methods in visual grounding and segmentation by leveraging a novel Mask Referring Modeling paradigm.
On the Comparison between Multi-modal and Single-modal Contrastive Learning
·455 words·3 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 RIKEN AIP
Multi-modal contrastive learning surpasses single-modal by leveraging inter-modal correlations to improve feature learning and downstream task performance, as demonstrated through a novel theoretical …
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
·2191 words·11 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
OmniTokenizer: A transformer-based tokenizer achieving state-of-the-art image and video reconstruction by leveraging a novel spatial-temporal decoupled architecture and progressive training strategy.
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
·3479 words·17 mins· loading · loading
AI Generated Multimodal Learning Vision-Language Models 🏢 Peking University
OmniJARVIS: Unified vision-language-action tokenization enables open-world instruction-following agents via unified multimodal interaction data.
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
·3418 words·17 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Skywork AI
OMG-LLaVA: A single model elegantly bridges image, object, and pixel-level reasoning for superior visual understanding.
Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding
·1696 words·8 mins· loading · loading
Multimodal Learning Vision-Language Models 🏢 Baidu
Octopus, a novel multi-modal LLM, uses parallel visual recognition and sequential understanding to achieve 5x speedup on visual grounding and improved accuracy on various MLLM tasks.