Multimodal Learning
Yo'LLaVA: Your Personalized Language and Vision Assistant
·4272 words·21 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ University of Wisconsin-Madison
Yo’LLaVA personalizes Large Multimodal Models (LMMs) to converse about specific subjects using just a few images, embedding concepts into latent tokens for efficient and effective personalized convers…
XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation
·2133 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
XMask3D uses cross-modal mask reasoning to achieve state-of-the-art open vocabulary 3D semantic segmentation by aligning 2D and 3D features at the mask level, resulting in precise segmentation boundar…
Wings: Learning Multimodal LLMs without Text-only Forgetting
·1958 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Alibaba International Digital Commerce
WINGS: A novel multimodal LLM combats ’text-only forgetting’ by using complementary visual and textual learners, achieving superior performance on text-only and visual tasks.
Why are Visually-Grounded Language Models Bad at Image Classification?
·3661 words·18 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
π’ Stanford University
Visually-grounded Language Models (VLMs) surprisingly underperform in image classification. This study reveals that this is primarily due to a lack of sufficient classification data during VLM trainin…
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
·3423 words·17 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ University of California, Santa Barbara
T2IScoreScore objectively evaluates text-to-image prompt faithfulness metrics using semantic error graphs, revealing that simpler metrics surprisingly outperform complex, computationally expensive one…
What matters when building vision-language models?
·2924 words·14 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Hugging Face
Idefics2, a new 8B-parameter VLM, achieves state-of-the-art performance, closing the gap with much larger models by meticulously analyzing design choices and training methods.
What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights
·3275 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ University of Hong Kong
CLIP’s robustness to long-tailed pre-training data stems from its dynamic classification task and descriptive language supervision, offering transferable insights for improving model generalizability.
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration
·2619 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Central South University
Unlocking the full potential of multi-modal in-context learning requires understanding its core factors. This research systematically explores these factors, highlighting the importance of a multi-mod…
WATT: Weight Average Test Time Adaptation of CLIP
·3263 words·16 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ ETS MontrΓ©al, Canada
WATT: a novel test-time adaptation method boosts CLIP’s performance on domain shifted images by cleverly averaging weights from multiple text prompts, achieving state-of-the-art results without extra …
Voila-A: Aligning Vision-Language Models with User's Gaze Attention
·2566 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ SKLSDE Lab, Beihang University
Voila-A enhances vision-language models by aligning their attention with user gaze, improving real-world application effectiveness and interpretability.
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
·2266 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Carnegie Mellon University
VLMs learn to generate their own memories by abstracting experiences from noisy demonstrations and human feedback, significantly boosting in-context learning performance.
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
·3475 words·17 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Skywork AI
VITRON: a unified pixel-level Vision LLM excels in understanding, generating, segmenting, and editing images and videos.
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
·3551 words·17 mins·
loading
·
loading
AI Generated
Multimodal Learning
Vision-Language Models
π’ University of Washington
Visual SKETCHPAD empowers multimodal language models (LLMs) with visual reasoning abilities by allowing them to generate intermediate sketches. This innovative framework substantially enhances LLM per…
Visual Perception by Large Language Modelβs Weights
·2070 words·10 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Tencent AI Lab
VLORA: Boosting Multimodal LLMs efficiency by merging visual features into model weights instead of extending input sequences.
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
·2150 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Chinese Academy of Sciences
AcFormer, a novel vision-language connector for MLLMs, leverages ‘visual anchors’ to reduce computation cost by ~66% while improving accuracy.
VisMin: Visual Minimal-Change Understanding
·2710 words·13 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Mila - Quebec AI Institute
VisMin benchmark evaluates visual-language models’ fine-grained understanding by identifying minimal image-text differences (object, attribute, count, spatial relation). Current VLMs struggle with sp…
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
·6701 words·32 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
VisionLLM v2 unifies visual perception, understanding, and generation, excelling in various vision tasks and achieving performance comparable to task-specific models.
Vision-Language Navigation with Energy-Based Policy
·1855 words·9 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Zhejiang University
Energy-based Navigation Policy (ENP) revolutionizes Vision-Language Navigation by modeling joint state-action distributions, achieving superior performance across diverse benchmarks.
Vision-Language Models are Strong Noisy Label Detectors
·2173 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ School of Computer Science and Engineering, Southeast University
Vision-language models effectively detect noisy labels, improving image classification accuracy with DEFT.
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
·2294 words·11 mins·
loading
·
loading
Multimodal Learning
Vision-Language Models
π’ Tsinghua University
Latent Compression Learning (LCL) revolutionizes vision model pre-training by effectively leveraging readily available interleaved image-text data, achieving performance comparable to models trained o…